Evaluations
Validate your agent’s outputs against expectations usingtrajectory examples and scorers. You can run evals inline during traces or as part of CI.
Define examples
UseExample to describe inputs, expected outputs, retrieval context, and expected tools.
example.py
Choose scorers
Scorers judge different aspects (faithfulness to context, answer relevancy, tool-calling order, etc.).scorers.py
Run evals inline during tracing
Calltracer.async_evaluate(...) within observed functions or inside a with tracer.trace(...): block to attach results to the current span.
inline_eval.py
span_id if you manage spans manually.
Batch or CI-style usage
Wrap your agent function and evaluate across a suite of examples. Keep eachExample focused on a single behavior.
ci_suite.py
- Keep thresholds conservative at first (e.g., 0.5) and tune after inspecting failures.
- Use trajectory-level scorers like
ToolCallOrderScorerwhen validating multi-tool agent behavior. - Persist traces in CI to inspect failures in your dashboard.

