Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

Longyuan Zhu; Hairan Hua; Linlin Miao; Bing Zhao

arXiv:2602.11674·cs.AI·February 13, 2026

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

Longyuan Zhu, Hairan Hua, Linlin Miao, Bing Zhao

PDF

Open Access

TL;DR

The paper introduces the Benchmark Health Index (BHI), a comprehensive data-driven framework to evaluate and manage the reliability and longevity of benchmarks used for assessing Large Language Models.

Contribution

It presents the first macro-level framework for quantifying benchmark health, addressing issues of score inflation and selective reporting in LLM evaluation.

Findings

01

Analyzed 106 benchmarks from 91 models in 2025.

02

Identified key axes for benchmark assessment: discrimination, saturation, impact.

03

Provided a systematic characterization of the evaluation landscape.

Abstract

Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Computational and Text Analysis Methods