Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel,, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

TL;DR
This paper critically examines the agreement testing of language model benchmarks, identifies methodological issues, proposes best practices, and introduces BenchBench, a tool to improve benchmark validation reliability.
Contribution
It introduces standardized procedures for Benchmark Agreement Testing (BAT), a new Python package BenchBench, and a meta-benchmark leaderboard to enhance the validity of benchmark evaluations.
Findings
Methodological choices significantly affect BAT results.
Implementing best practices improves BAT robustness.
BenchBench tool facilitates standardized benchmark agreement testing.
Abstract
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
MethodsSparse Evolutionary Training
