Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
Eddie Landesberg, Manjari Narayan

TL;DR
Causal Judge Evaluation (CJE) provides a calibration method for cheap LLM evaluation metrics, achieving high accuracy and cost savings while ensuring reliability through auditability and diagnostics.
Contribution
CJE introduces a calibration framework that makes surrogate LLM evaluation metrics auditable, enabling scalable, reliable assessment with minimal oracle data.
Findings
CJE achieves 99% pairwise ranking accuracy at 14x lower cost.
Naive confidence intervals on raw scores have 0% coverage, CJE achieves ~95%.
Adversarial policies are correctly flagged and not misrepresented.
Abstract
Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against a small oracle slice, then evaluate at scale with valid uncertainty. We treat surrogate validity as auditable: for each policy or deployment context, a small oracle audit tests whether the learned calibration remains mean-unbiased, turning an uncheckable identification condition into a falsifiable diagnostic. On 4,961 Chatbot Arena prompts comparing five policies with a 16x oracle/judge cost ratio, at a 5% oracle fraction CJE achieves 99% pairwise ranking accuracy at 14x lower cost; across all configurations (5-50% oracle, varying n), accuracy averages 94%. An adversarial policy fails the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Imbalanced Data Classification Techniques · Topic Modeling
