LLM Judges
LLM judges use a language model to evaluate agent responses against custom criteria defined in a prompt file.
Configuration
Section titled “Configuration”Reference an LLM judge in your eval file:
execution: evaluators: - name: semantic_check type: llm_judge prompt: ./judges/correctness.mdPrompt Files
Section titled “Prompt Files”The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
Markdown Template
Section titled “Markdown Template”Write evaluation instructions as markdown. Template variables are interpolated:
# Evaluation Criteria
Evaluate the candidate's response to the following question:
**Question:** {{question}}**Expected Outcome:** {{expected_outcome}}**Reference Answer:** {{reference_answer}}**Candidate Answer:** {{candidate_answer}}
## Scoring
Score the response from 0.0 to 1.0 based on:1. Correctness — does the answer match the expected outcome?2. Completeness — does it address all parts of the question?3. Clarity — is the response clear and well-structured?Available Template Variables
Section titled “Available Template Variables”| Variable | Source |
|---|---|
question | First user message content |
expected_outcome | Eval case expected_outcome field |
reference_answer | Last expected message content |
candidate_answer | Last candidate response content |
sidecar | Eval case sidecar metadata |
rubrics | Eval case rubrics (if defined) |
TypeScript Template
Section titled “TypeScript Template”For dynamic prompt generation, use the definePromptTemplate function from @agentv/eval:
#!/usr/bin/env bunimport { definePromptTemplate } from '@agentv/eval';
export default definePromptTemplate((ctx) => { const rubric = ctx.config?.rubric as string | undefined;
return `You are evaluating an AI assistant's response.
## Question${ctx.question}
## Candidate Answer${ctx.candidateAnswer}
${ctx.referenceAnswer ? `## Reference Answer\n${ctx.referenceAnswer}` : ''}
${rubric ? `## Evaluation Criteria\n${rubric}` : ''}
Evaluate and provide a score from 0 to 1.`;});How It Works
Section titled “How It Works”- AgentV renders the prompt template with variables from the eval case
- The rendered prompt is sent to the judge target (configured in targets.yaml)
- The LLM returns a structured evaluation with score, hits, misses, and reasoning
- Results are recorded in the output JSONL
Script Configuration
Section titled “Script Configuration”When using TypeScript templates, configure them in YAML with optional config data passed to the script:
evaluators: - name: custom-eval type: llm_judge prompt: script: [bun, run, ../prompts/custom-evaluator.ts] config: rubric: "Your rubric here" strictMode: trueThe config object is available as ctx.config inside the template function.
Available Context Fields
Section titled “Available Context Fields”TypeScript templates receive a context object with these fields:
| Field | Type | Description |
|---|---|---|
question | string | First user message content |
candidateAnswer | string | Last entry in output_messages |
referenceAnswer | string | Last entry in expected_messages |
expectedOutcome | string | Eval case expected_outcome field |
expectedMessages | Message[] | Full resolved expected messages |
outputMessages | Message[] | Full provider output messages |
traceSummary | TraceSummary | Execution metrics summary |
config | object | Custom config from YAML |
Template Variable Derivation
Section titled “Template Variable Derivation”Template variables are derived internally through three layers:
1. Authoring Layer
Section titled “1. Authoring Layer”What users write in YAML or JSONL:
inputorinput_messages— two syntaxes for the same data.input: "What is 2+2?"expands to[{ role: "user", content: "What is 2+2?" }]. If both are present,input_messagestakes precedence.expected_outputorexpected_messages— two syntaxes for the same data.expected_output: "4"expands to[{ role: "assistant", content: "4" }]. If both are present,expected_messagestakes precedence.
2. Resolved Layer
Section titled “2. Resolved Layer”After parsing, canonical message arrays replace the shorthand fields:
input_messages: TestMessage[]— canonical resolved inputexpected_messages: TestMessage[]— canonical resolved expected output
At this layer, input and expected_output no longer exist as separate fields.
3. Template Variable Layer
Section titled “3. Template Variable Layer”Derived strings injected into evaluator prompts:
| Variable | Derivation |
|---|---|
question | Content of the first user role entry in input_messages |
expected_outcome | Passed through from the eval case field |
reference_answer | Content of the last entry in expected_messages |
candidate_answer | Content of the last entry in output_messages |
input_messages | Full resolved input array, JSON-serialized |
expected_messages | Full resolved expected array, JSON-serialized |
output_messages | Full provider output array, JSON-serialized |
Example flow:
# User writes:input: "What is 2+2?"expected_output: "The answer is 4"# Resolved:input_messages: [{ role: "user", content: "What is 2+2?" }]expected_messages: [{ role: "assistant", content: "The answer is 4" }]
# Derived template variables:question: "What is 2+2?"reference_answer: "The answer is 4"candidate_answer: (extracted from provider output at runtime)