Skip to content

Structured Data & Metrics Evaluators

Built-in evaluators for grading structured outputs and gating on execution metrics:

  • field_accuracy — compare JSON fields against ground truth
  • latency — gate on response time
  • cost — gate on monetary cost
  • token_usage — gate on token consumption

Put the expected structured output in the evalcase expected_messages (typically as the last assistant message with content as an object). Evaluators read expected values from there.

evalcases:
- id: invoice-001
expected_messages:
- role: assistant
content:
invoice_number: "INV-2025-001234"
net_total: 1889

Use field_accuracy to compare fields in the candidate JSON against the ground-truth object in expected_messages.

execution:
evaluators:
- name: invoice_fields
type: field_accuracy
aggregation: weighted_average
fields:
- path: invoice_number
match: exact
required: true
weight: 2.0
- path: invoice_date
match: date
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
- path: net_total
match: numeric_tolerance
tolerance: 1.0
Match TypeDescriptionOptions
exactStrict equality
dateCompares dates after parsingformats — list of accepted date formats
numeric_toleranceNumeric compare within tolerancetolerance — absolute threshold; relative: true for relative tolerance

For fuzzy string matching, use a code_judge evaluator (e.g. Levenshtein distance) instead of adding a fuzzy mode to field_accuracy.

StrategyDescription
weighted_average (default)Weighted mean of field scores
all_or_nothingScore 1.0 only if all graded fields pass

Gate on execution time (in milliseconds) reported by the provider via traceSummary.

execution:
evaluators:
- name: performance
type: latency
threshold: 2000

Gate on monetary cost reported by the provider via traceSummary.

execution:
evaluators:
- name: budget
type: cost
budget: 0.10

Gate on provider-reported token usage. Useful when cost is unavailable or model pricing differs.

execution:
evaluators:
- name: token-budget
type: token_usage
max_total: 10000
# or:
# max_input: 8000
# max_output: 2000

Use a composite evaluator to produce a single “release gate” score from multiple checks:

execution:
evaluators:
- name: release_gate
type: composite
evaluators:
- name: correctness
type: field_accuracy
fields:
- path: invoice_number
match: exact
- name: latency
type: latency
threshold: 2000
- name: cost
type: cost
budget: 0.10
- name: tokens
type: token_usage
max_total: 10000
aggregator:
type: weighted_average
weights:
correctness: 0.8
latency: 0.1
cost: 0.05
tokens: 0.05