CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation
Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

TL;DR
CARE introduces a confounder-aware aggregation method for LLM evaluation that models shared biases among judges, leading to more accurate quality assessments across diverse benchmarks.
Contribution
It provides a novel framework that explicitly accounts for shared confounders in LLM judge scores, with theoretical guarantees and practical improvements.
Findings
Reduces aggregation error by up to 26.8%
Improves accuracy across multiple benchmark types
Offers theoretical guarantees for confounder modeling
Abstract
LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI
