Do Large Language Model Benchmarks Test Reliability?

Joshua Vendrow; Edward Vendrow; Sara Beery; Aleksander Madry

arXiv:2502.03461·cs.LG·February 6, 2025

Do Large Language Model Benchmarks Test Reliability?

Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper examines the reliability of current large language model benchmarks, highlighting issues caused by label errors, and introduces platinum benchmarks to better evaluate model reliability and identify persistent failures.

Contribution

It identifies the limitations of existing benchmarks in measuring reliability, proposes platinum benchmarks with minimized label errors, and reveals new failure patterns in frontier LLMs.

Findings

01

Label errors in benchmarks can obscure true model failures.

02

Frontier LLMs still struggle with simple tasks like elementary math problems.

03

New failure patterns in models are uncovered using curated platinum benchmarks.

Abstract

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MadryLab/platinum-benchmarks
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsFocus