CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Jitian Zhao; Changho Shin; Tzu-Heng Huang; Satya Sai Srinath Namburi GNVV; Frederic Sala

arXiv:2603.00039·cs.LG·March 3, 2026

CARE: Confounder-Aware Aggregation for Reliable LLM Evaluation

Jitian Zhao, Changho Shin, Tzu-Heng Huang, Satya Sai Srinath Namburi GNVV, Frederic Sala

PDF

Open Access

TL;DR

CARE introduces a confounder-aware aggregation method for LLM evaluation that models shared biases among judges, leading to more accurate quality assessments across diverse benchmarks.

Contribution

It provides a novel framework that explicitly accounts for shared confounders in LLM judge scores, with theoretical guarantees and practical improvements.

Findings

01

Reduces aggregation error by up to 26.8%

02

Improves accuracy across multiple benchmark types

03

Offers theoretical guarantees for confounder modeling

Abstract

LLM-as-a-judge ensembles are the standard paradigm for scalable evaluation, but their aggregation mechanisms suffer from a fundamental flaw: they implicitly assume that judges provide independent estimates of true quality. However, in practice, LLM judges exhibit correlated errors caused by shared latent confounders -- such as verbosity, stylistic preferences, or training artifacts -- causing standard aggregation rules like majority vote or averaging to provide little gain or even amplify systematic mistakes. To address this, we introduce CARE, a confounder-aware aggregation framework that explicitly models LLM judge scores as arising from both a latent true-quality signal and shared confounding factors. Rather than heuristically re-weighting judges, CARE separates quality from confounders without access to ground-truth labels. We provide theoretical guarantees for identifiability and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Ethics and Social Impacts of AI