Evals (beta)¶

View and analyze your evaluation results in Pydantic Logfire's web interface. Evals provides observability into how your AI systems perform across different test cases and experiments over time.

Code-First Evaluation

Evals are created and run using the Pydantic Evals a sub-package of Pydantic AI. Logfire serves as a read-only observability layer where you can view and compare results.

What are Evals?¶

Evals help you systematically test and evaluate AI systems by running them against predefined test cases. Each evaluation experiment appears in Logfire automatically when you run the Pydantic Evals package with Logfire integration enabled.

For the data model, examples and full documentation on creating and running Evals, read the Pydantic Evals docs

Viewing Experiments¶

The Evals tab shows all evaluation experiments for your project. Each experiment represents a single run of a dataset against a task function.

Experiment List¶

Each experiment displays:

Experiment name - Auto-generated by Logfire (e.g., "gentle-sniff-buses")
Task name - The function being evaluated
Span link - Direct link to the detailed trace
Created timestamp - When the experiment was run

Click on any experiment to view detailed results.

Experiment Details¶

Individual experiment pages show comprehensive results including:

Test cases with inputs, expected outputs, and actual outputs
Assertion results - Pass/fail status for each evaluator
Performance metrics - Duration, token usage, and custom scores
Evaluation scores - Detailed scoring from all evaluators

Comparing Experiments¶

Use the experiment comparison view to analyze performance across different runs:

Select multiple experiments from the list
Click Compare selected
View side-by-side results for the same test cases

The comparison view highlights:

Differences in outputs between experiment runs
Score variations across evaluators
Performance changes in metrics like duration and token usage
Regression detection when comparing baseline vs current implementations

Integration with Traces¶

Every evaluation experiment generates detailed OpenTelemetry traces that appear in Logfire:

Experiment span - Root span containing all evaluation metadata
Case execution spans - Individual test case runs with full context
Task function spans - Detailed tracing of your AI system under test
Evaluator spans - Scoring and assessment execution details

Navigate from experiment results to full trace details using the span links.

Best Practices¶

Organizing Experiments¶

Use descriptive dataset names that will appear in experiment metadata
Add commit messages or version information to track code changes
Run evaluations consistently as part of your development workflow

Monitoring Performance¶

Set up regular evaluation runs to track performance over time
Use the comparison view to identify regressions
Monitor both accuracy metrics and performance characteristics

Collaborative Analysis¶

Share experiment links with team members for collaborative review
Use the trace integration to debug specific test case failures
Document significant findings in your evaluation dataset metadata

For implementation help, refer to the Pydantic Evals installation guide.