Eval Results Comparator

Compare two eval runs from JSON or JSONL and quantify score/pass-rate deltas plus case-level regression.

0

Matched Cases

0.0

Score Delta

0.0

Pass Delta (pp)

0

Improved

0

Regressed

Run metrics

  • Run A records: 0
  • Run B records: 0
  • Run A pass rate: 0.0%
  • Run B pass rate: 0.0%
  • Fail to pass: 0
  • Pass to fail: 0
  • Only in Run A: 0
  • Only in Run B: 0

Comparison JSON

Top case changes

No comparable scored cases yet.

About This Tool

Eval Results Comparator helps you compare baseline and candidate eval runs quickly by quantifying score deltas, pass-rate changes, and case-level regressions.

Frequently Asked Questions

What formats are supported?

JSON array and JSONL are supported. IDs can be provided as custom_id, id, test_id, case_id, or name.

What if pass flags are missing?

The tool derives pass/fail from score using your selected threshold.

Is eval data uploaded?

No. Parsing and comparison run entirely in your browser.