Skip to content

LLM Judges

LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.

Reference an LLM judge in your eval file:

execution:
evaluators:
- name: semantic_check
type: llm_judge
prompt: ./judges/correctness.md

The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.

Write evaluation instructions as markdown. Template variables are interpolated:

# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}
**Expected Outcome:** {{expected_outcome}}
**Reference Answer:** {{reference_answer}}
**Candidate Answer:** {{candidate_answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:
1. Correctness — does the answer match the expected outcome?
2. Completeness — does it address all parts of the question?
3. Clarity — is the response clear and well-structured?
VariableSource
questionFirst user message content
expected_outcomeEval case expected_outcome field
reference_answerLast expected message content
candidate_answerLast candidate response content
sidecarEval case sidecar metadata
rubricsEval case rubrics (if defined)

For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:

#!/usr/bin/env bun
import { definePromptTemplate } from '@agentv/eval';
export default definePromptTemplate((ctx) => {
const rubric = ctx.config?.rubric as string | undefined;
return `You are evaluating an AI assistant's response.
## Question
${ctx.question}
## Candidate Answer
${ctx.candidateAnswer}
${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;
});
  1. AgentV renders the prompt template with variables from the eval case
  2. The rendered prompt is sent to the judge target (configured in targets.yaml)
  3. The LLM returns a structured evaluation with score, hits, misses, and reasoning
  4. Results are recorded in the output JSONL

When using TypeScript templates, configure them in YAML with optional config data passed to the script:

evaluators:
- name: custom-eval
type: llm_judge
prompt:
script: [bun, run, ../prompts/custom-evaluator.ts]
config:
rubric: "Your rubric here"
strictMode: true

The config object is available as ctx.config inside the template function.

TypeScript templates receive a context object with these fields:

FieldTypeDescription
questionstringFirst user message content
candidateAnswerstringLast entry in output_messages
referenceAnswerstringLast entry in expected_messages
expectedOutcomestringEval case expected_outcome field
expectedMessagesMessage[]Full resolved expected messages
outputMessagesMessage[]Full provider output messages
traceSummaryTraceSummaryExecution metrics summary
configobjectCustom config from YAML

Template variables are derived internally through three layers:

What users write in YAML or JSONL:

  • input or input_messages — two syntaxes for the same data. input: "What is 2+2?" expands to [{ role: "user", content: "What is 2+2?" }]. If both are present, input_messages takes precedence.
  • expected_output or expected_messages — two syntaxes for the same data. expected_output: "4" expands to [{ role: "assistant", content: "4" }]. If both are present, expected_messages takes precedence.

After parsing, canonical message arrays replace the shorthand fields:

  • input_messages: TestMessage[] — canonical resolved input
  • expected_messages: TestMessage[] — canonical resolved expected output

At this layer, input and expected_output no longer exist as separate fields.

Derived strings injected into evaluator prompts:

VariableDerivation
questionContent of the first user role entry in input_messages
expected_outcomePassed through from the eval case field
reference_answerContent of the last entry in expected_messages
candidate_answerContent of the last entry in output_messages
input_messagesFull resolved input array, JSON-serialized
expected_messagesFull resolved expected array, JSON-serialized
output_messagesFull provider output array, JSON-serialized

Example flow:

# User writes:
input: "What is 2+2?"
expected_output: "The answer is 4"
# Resolved:
input_messages: [{ role: "user", content: "What is 2+2?" }]
expected_messages: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:
question: "What is 2+2?"
reference_answer: "The answer is 4"
candidate_answer: (extracted from provider output at runtime)