Eval Results Comparator

Compare two eval runs from JSON or JSONL and quantify score/pass-rate deltas plus case-level regression.

Derived pass threshold (if pass flag is missing)

Run A (baseline)

Run B (candidate)

0

Matched Cases

0.0

Score Delta

0.0

Pass Delta (pp)

0

Improved

0

Regressed

Run metrics

Run A records: 0
Run B records: 0
Run A pass rate: 0.0%
Run B pass rate: 0.0%
Fail to pass: 0
Pass to fail: 0
Only in Run A: 0
Only in Run B: 0

Comparison JSON

{
  "threshold": 70,
  "counts": {
    "runA": 0,
    "runB": 0,
    "matched": 0,
    "onlyA": 0,
    "onlyB": 0
  },
  "metrics": {
    "avgScoreA": 0,
    "avgScoreB": 0,
    "scoreDelta": 0,
    "passRateA": 0,
    "passRateB": 0,
    "passRateDelta": 0,
    "improved": 0,
    "regressed": 0,
    "failToPass": 0,
    "passToFail": 0
  },
  "topChanges": []
}

Top case changes

No comparable scored cases yet.

About This Tool

Eval Results Comparator helps you compare baseline and candidate eval runs quickly by quantifying score deltas, pass-rate changes, and case-level regressions.

Frequently Asked Questions

What formats are supported?

JSON array and JSONL are supported. IDs can be provided as custom_id, id, test_id, case_id, or name.

What if pass flags are missing?

The tool derives pass/fail from score using your selected threshold.

Is eval data uploaded?

No. Parsing and comparison run entirely in your browser.

Related Tools

OpenAI Batch JSONL Validator

Validate Batch API JSONL lines, detect errors, and export valid records.

LLM Response Grader

Grade model responses using weighted rubric rules, regex checks, and banned-term penalties.

Prompt Regression Suite Builder

Compare prompt versions, detect removed constraints, and generate deterministic QA suites.

Compare With Similar Tools

Decision pages to quickly see when to use each tool.

Eval Results Comparator vs Prompt Regression Suite Builder

Run-to-run eval delta analysis vs deterministic regression suite construction.

AI QA Workflow Runner vs Eval Results Comparator

End-to-end QA gate decisioning vs baseline-candidate eval delta analytics.

Workflow Links

Suggested step-by-step tools based on this page intent.

Before This Tool

JSON Output RepairerRepair malformed AI JSON outputs and recover parser-safe structured data.Prompt A/B Test MatrixGenerate deterministic prompt variant matrices across tone, length, and output format.Agent Safety ChecklistAudit agent runbooks for allowlists, confirmation gates, budgets, fallbacks, and logging.

Next Step Tools

Prompt Diff OptimizerCompare prompt revisions, estimate token delta, and spot removed constraint lines.Prompt Injection SimulatorSimulate prompt-injection attacks and score guardrail resilience before release.Answer Consistency CheckerCompare multiple model answers and detect conflicts, drift, and stability issues.