Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with   BenchBench

Yotam Perlitz; Ariel Gera; Ofir Arviv; Asaf Yehudai; Elron Bandel,; Eyal Shnarch; Michal Shmueli-Scheuer; Leshem Choshen

arXiv:2407.13696·cs.CL·September 13, 2024

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel,, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

PDF

Open Access 1 Repo

TL;DR

This paper critically examines the agreement testing of language model benchmarks, identifies methodological issues, proposes best practices, and introduces BenchBench, a tool to improve benchmark validation reliability.

Contribution

It introduces standardized procedures for Benchmark Agreement Testing (BAT), a new Python package BenchBench, and a meta-benchmark leaderboard to enhance the validity of benchmark evaluations.

Findings

01

Methodological choices significantly affect BAT results.

02

Implementing best practices improves BAT robustness.

03

BenchBench tool facilitates standardized benchmark agreement testing.

Abstract

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ibm/benchbench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law

MethodsSparse Evolutionary Training