Unbiased Evaluation of Large Language Models from a Causal Perspective

Meilin Chen; Jian Tian; Liang Ma; Di Xie; Weijie Chen; Jiang Zhu

arXiv:2502.06655·cs.AI·May 13, 2025

Unbiased Evaluation of Large Language Models from a Causal Perspective

Meilin Chen, Jian Tian, Liang Ma, Di Xie, Weijie Chen, Jiang Zhu

PDF

Open Access

TL;DR

This paper introduces a theoretical framework and a new unbiased evaluation protocol for large language models, addressing biases in current assessment methods and revealing significant room for improvement in LLM performance.

Contribution

It provides a formal analysis of evaluation bias and proposes the Unbiased Evaluator to deliver more accurate and interpretable assessments of LLMs.

Findings

01

Current LLMs show substantial room for improvement.

02

The Unbiased Evaluator detects benchmark contamination.

03

Evaluation biases can be systematically characterized and mitigated.

Abstract

Benchmark contamination has become a significant concern in the LLM evaluation community. Previous Agents-as-an-Evaluator address this issue by involving agents in the generation of questions. Despite their success, the biases in Agents-as-an-Evaluator methods remain largely unexplored. In this paper, we present a theoretical formulation of evaluation bias, providing valuable insights into designing unbiased evaluation protocols. Furthermore, we identify two type of bias in Agents-as-an-Evaluator through carefully designed probing tasks on a minimal Agents-as-an-Evaluator setup. To address these issues, we propose the Unbiased Evaluator, an evaluation protocol that delivers a more comprehensive, unbiased, and interpretable assessment of LLMs.Extensive experiments reveal significant room for improvement in current LLMs. Additionally, we demonstrate that the Unbiased Evaluator not only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques