Eval Results Comparator
Compare two eval runs from JSON or JSONL and quantify score/pass-rate deltas plus case-level regression.
0
Matched Cases
0.0
Score Delta
0.0
Pass Delta (pp)
0
Improved
0
Regressed
Run metrics
- Run A records: 0
- Run B records: 0
- Run A pass rate: 0.0%
- Run B pass rate: 0.0%
- Fail to pass: 0
- Pass to fail: 0
- Only in Run A: 0
- Only in Run B: 0
Comparison JSON
Top case changes
No comparable scored cases yet.
About This Tool
Eval Results Comparator helps you compare baseline and candidate eval runs quickly by quantifying score deltas, pass-rate changes, and case-level regressions.
Frequently Asked Questions
What formats are supported?
JSON array and JSONL are supported. IDs can be provided as custom_id, id, test_id, case_id, or name.
What if pass flags are missing?
The tool derives pass/fail from score using your selected threshold.
Is eval data uploaded?
No. Parsing and comparison run entirely in your browser.
Related Tools
OpenAI Batch JSONL Validator
Validate Batch API JSONL lines, detect errors, and export valid records.
LLM Response Grader
Grade model responses using weighted rubric rules, regex checks, and banned-term penalties.
Prompt Regression Suite Builder
Compare prompt versions, detect removed constraints, and generate deterministic QA suites.
Compare With Similar Tools
Decision pages to quickly see when to use each tool.
Workflow Links
Suggested step-by-step tools based on this page intent.
Before This Tool
JSON Output RepairerRepair malformed AI JSON outputs and recover parser-safe structured data.Prompt A/B Test MatrixGenerate deterministic prompt variant matrices across tone, length, and output format.Agent Safety ChecklistAudit agent runbooks for allowlists, confirmation gates, budgets, fallbacks, and logging.
Next Step Tools
Prompt Diff OptimizerCompare prompt revisions, estimate token delta, and spot removed constraint lines.Prompt Injection SimulatorSimulate prompt-injection attacks and score guardrail resilience before release.Answer Consistency CheckerCompare multiple model answers and detect conflicts, drift, and stability issues.