Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
Longyuan Zhu, Hairan Hua, Linlin Miao, Bing Zhao

TL;DR
The paper introduces the Benchmark Health Index (BHI), a comprehensive data-driven framework to evaluate and manage the reliability and longevity of benchmarks used for assessing Large Language Models.
Contribution
It presents the first macro-level framework for quantifying benchmark health, addressing issues of score inflation and selective reporting in LLM evaluation.
Findings
Analyzed 106 benchmarks from 91 models in 2025.
Identified key axes for benchmark assessment: discrimination, saturation, impact.
Provided a systematic characterization of the evaluation landscape.
Abstract
Large Language Models (LLMs) are advancing rapidly, yet the benchmarks used to measure this progress are becoming increasingly unreliable. Score inflation and selective reporting have eroded the authority of standard benchmarks, leaving the community uncertain about which evaluation results remain trustworthy. We introduce the Benchmark Health Index (BHI), a pure data-driven framework for auditing evaluation sets along three orthogonal and complementary axes: (1) Capability Discrimination, measuring how sharply a benchmark separates model performance beyond noise; (2) Anti-Saturation, estimating remaining headroom before ceiling effects erode resolution and thus the benchmark's expected longevity; and (3) Impact, quantifying influence across academic and industrial ecosystems via adoption breadth and practice-shaping power. By distilling 106 validated benchmarks from the technical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Computational and Text Analysis Methods
