Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg; Manjari Narayan

arXiv:2512.11150·stat.ME·January 22, 2026

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg, Manjari Narayan

PDF

Open Access 1 Datasets

TL;DR

Causal Judge Evaluation (CJE) provides a calibration method for cheap LLM evaluation metrics, achieving high accuracy and cost savings while ensuring reliability through auditability and diagnostics.

Contribution

CJE introduces a calibration framework that makes surrogate LLM evaluation metrics auditable, enabling scalable, reliable assessment with minimal oracle data.

Findings

01

CJE achieves 99% pairwise ranking accuracy at 14x lower cost.

02

Naive confidence intervals on raw scores have 0% coverage, CJE achieves ~95%.

03

Adversarial policies are correctly flagged and not misrepresented.

Abstract

Measuring long-run LLM outcomes (user satisfaction, expert judgment, downstream KPIs) is expensive. Teams default to cheap LLM judges, but uncalibrated proxies can invert rankings entirely. Causal Judge Evaluation (CJE) makes it affordable to aim at the right target: calibrate cheap scores against a small oracle slice, then evaluate at scale with valid uncertainty. We treat surrogate validity as auditable: for each policy or deployment context, a small oracle audit tests whether the learned calibration remains mean-unbiased, turning an uncheckable identification condition into a falsifiable diagnostic. On 4,961 Chatbot Arena prompts comparing five policies with a 16x oracle/judge cost ratio, at a 5% oracle fraction CJE achieves 99% pairwise ranking accuracy at 14x lower cost; across all configurations (5-50% oracle, varying n), accuracy averages 94%. An adversarial policy fails the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

elandy/cje-chatbot-arena
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Imbalanced Data Classification Techniques · Topic Modeling