Structured Data & Metrics Evaluators

Built-in evaluators for grading structured outputs and gating on execution metrics:

field_accuracy — compare JSON fields against ground truth
latency — gate on response time
cost — gate on monetary cost
token_usage — gate on token consumption

Ground Truth

Put the expected structured output in the evalcase expected_messages (typically as the last assistant message with content as an object). Evaluators read expected values from there.

evalcases:
  - id: invoice-001
    expected_messages:
      - role: assistant
        content:
          invoice_number: "INV-2025-001234"
          net_total: 1889

Field Accuracy

Use field_accuracy to compare fields in the candidate JSON against the ground-truth object in expected_messages.

execution:
  evaluators:
    - name: invoice_fields
      type: field_accuracy
      aggregation: weighted_average
      fields:
        - path: invoice_number
          match: exact
          required: true
          weight: 2.0
        - path: invoice_date
          match: date
          formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
        - path: net_total
          match: numeric_tolerance
          tolerance: 1.0

Match Types

Match Type	Description	Options
`exact`	Strict equality	—
`date`	Compares dates after parsing	`formats` — list of accepted date formats
`numeric_tolerance`	Numeric compare within tolerance	`tolerance` — absolute threshold; `relative: true` for relative tolerance

For fuzzy string matching, use a code_judge evaluator (e.g. Levenshtein distance) instead of adding a fuzzy mode to field_accuracy.

Aggregation

Strategy	Description
`weighted_average` (default)	Weighted mean of field scores
`all_or_nothing`	Score 1.0 only if all graded fields pass

Latency

Gate on execution time (in milliseconds) reported by the provider via traceSummary.

execution:
  evaluators:
    - name: performance
      type: latency
      threshold: 2000

Cost

Gate on monetary cost reported by the provider via traceSummary.

execution:
  evaluators:
    - name: budget
      type: cost
      budget: 0.10

Token Usage

Gate on provider-reported token usage. Useful when cost is unavailable or model pricing differs.

execution:
  evaluators:
    - name: token-budget
      type: token_usage
      max_total: 10000
      # or:
      # max_input: 8000
      # max_output: 2000

Combining with Composite Evaluators

Use a composite evaluator to produce a single “release gate” score from multiple checks:

execution:
  evaluators:
    - name: release_gate
      type: composite
      evaluators:
        - name: correctness
          type: field_accuracy
          fields:
            - path: invoice_number
              match: exact
        - name: latency
          type: latency
          threshold: 2000
        - name: cost
          type: cost
          budget: 0.10
        - name: tokens
          type: token_usage
          max_total: 10000
      aggregator:
        type: weighted_average
        weights:
          correctness: 0.8
          latency: 0.1
          cost: 0.05
          tokens: 0.05