Tool Trajectory Evaluators
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
any_order — Minimum Tool Counts
Section titled “any_order — Minimum Tool Counts”Validates that each tool was called at least N times, regardless of order:
execution: evaluators: - name: tool-usage type: tool_trajectory mode: any_order minimums: knowledgeSearch: 2 # Must be called at least twice documentRetrieve: 1 # Must be called at least onceUse any_order when you want to ensure required tools are used but don’t care about execution order.
in_order — Sequential Matching
Section titled “in_order — Sequential Matching”Validates tools appear in the expected sequence, but allows gaps (other tools can appear between expected ones):
execution: evaluators: - name: workflow-sequence type: tool_trajectory mode: in_order expected: - tool: fetchData - tool: validateSchema - tool: transformData - tool: saveResultsUse in_order when you need to verify logical workflow order while allowing the agent to use additional helper tools between steps.
exact — Strict Sequence Match
Section titled “exact — Strict Sequence Match”Validates the exact tool sequence with no gaps or extra tools:
execution: evaluators: - name: auth-sequence type: tool_trajectory mode: exact expected: - tool: checkCredentials - tool: generateToken - tool: auditLogUse exact for security-critical workflows, strict protocol validation, or regression testing specific behavior.
Argument Matching
Section titled “Argument Matching”For in_order and exact modes, you can optionally validate tool arguments:
execution: evaluators: - name: search-validation type: tool_trajectory mode: in_order expected: # Partial match — only specified keys are checked - tool: search args: { query: "machine learning" }
# Skip argument validation for this tool - tool: process args: any
# No args field = no argument validation (same as args: any) - tool: saveResults| Syntax | Behavior |
|---|---|
args: { key: value } | Partial deep equality — only specified keys are checked |
args: any | Skip argument validation |
No args field | Same as args: any |
Latency Assertions
Section titled “Latency Assertions”For in_order and exact modes, you can validate per-tool timing with max_duration_ms:
execution: evaluators: - name: perf-check type: tool_trajectory mode: in_order expected: - tool: Read max_duration_ms: 100 # Must complete within 100ms - tool: Edit max_duration_ms: 500 # Allow 500ms for edits - tool: Write # No timing requirementEach max_duration_ms assertion counts as a separate scoring aspect. The rules:
| Condition | Result |
|---|---|
actual_duration <= max_duration_ms | Pass (counts as hit) |
actual_duration > max_duration_ms | Fail (counts as miss) |
No duration_ms in trace output | Warning logged, neutral (neither hit nor miss) |
Set generous thresholds to avoid flaky tests from timing variance. Only add latency assertions where timing matters on critical paths.
Scoring
Section titled “Scoring”| Mode | Score Calculation |
|---|---|
any_order | (tools meeting minimum) / (total tools with minimums) |
in_order | (sequence hits + latency hits) / (expected tools + latency assertions) |
exact | (sequence hits + latency hits) / (expected tools + latency assertions) |
Example: 3 expected tools with 2 latency assertions = 5 total aspects scored.
Trace Data Format
Section titled “Trace Data Format”Tool trajectory evaluators require trace data from the agent provider. Providers return output_messages containing tool_calls:
{ "id": "eval-001", "output_messages": [ { "role": "assistant", "content": "I'll search for information about this topic.", "tool_calls": [ { "tool": "knowledgeSearch", "input": { "query": "REST vs GraphQL" }, "output": { "results": [] }, "id": "call_123", "timestamp": "2024-01-15T10:30:00Z", "duration_ms": 45 } ] } ]}The evaluator extracts tool calls from output_messages[].tool_calls[]. The tool and input fields are required. Optional fields:
idandtimestamp— for debuggingduration_ms— required if usingmax_duration_mslatency assertions
Supported Providers
Section titled “Supported Providers”- codex — returns
output_messagesvia JSONL log events - vscode / vscode-insiders — returns
output_messagesfrom Copilot execution - cli — returns
output_messageswithtool_calls
CLI Options
Section titled “CLI Options”# Write trace files to diskagentv eval evals/test.yaml --dump-traces
# Include full trace in result outputagentv eval evals/test.yaml --include-traceUse --dump-traces to inspect actual traces and understand agent behavior before writing evaluators.
Complete Examples
Section titled “Complete Examples”Research Agent Validation
Section titled “Research Agent Validation”description: Validate research agent tool usageexecution: target: codex_agent
evalcases: - id: comprehensive-research expected_outcome: Agent thoroughly researches the topic
input_messages: - role: user content: Research machine learning frameworks
execution: evaluators: # Check minimum tool usage - name: coverage type: tool_trajectory mode: any_order minimums: webSearch: 1 documentRead: 2 noteTaking: 1
# Check workflow order - name: workflow type: tool_trajectory mode: in_order expected: - tool: webSearch - tool: documentRead - tool: summarizeMulti-Step Pipeline
Section titled “Multi-Step Pipeline”evalcases: - id: data-pipeline expected_outcome: Process data through complete pipeline
input_messages: - role: user content: Process the customer dataset
execution: evaluators: - name: pipeline-check type: tool_trajectory mode: exact expected: - tool: loadData - tool: validate - tool: transform - tool: exportPipeline with Latency Assertions
Section titled “Pipeline with Latency Assertions”evalcases: - id: data-pipeline-perf expected_outcome: Process data within timing budgets
input_messages: - role: user content: Process the customer dataset quickly
execution: evaluators: - name: pipeline-perf type: tool_trajectory mode: in_order expected: - tool: loadData max_duration_ms: 1000 # Network fetch within 1s - tool: validate # No timing requirement - tool: transform max_duration_ms: 500 # Transform must be fast - tool: export max_duration_ms: 200 # Export should be quickBest Practices
Section titled “Best Practices”- Start with
any_order, then tighten toin_orderorexactas needed. - Combine with other evaluators — use tool trajectory for execution validation and LLM judges for output quality.
- Inspect traces first with
--dump-tracesto understand agent behavior before writing evaluators. - Use generous latency thresholds to avoid flaky tests from timing variance.
- Use code judges for custom validation — write custom tool validation scripts when built-in modes are insufficient.