BenchScope: How Many Independent Signals Does Your Benchmark Provide?
Tommy Sha, Stella Zhao

TL;DR
This paper introduces Effective Dimensionality (ED), a diagnostic tool to assess the independence and measurement breadth of AI benchmarks, revealing redundancy and guiding benchmark design.
Contribution
It proposes ED as a fast, population-conditional upper-bound diagnostic for benchmark scores, with a practical workflow and reference atlas for maintainers.
Findings
ED reveals substantial redundancy in benchmarks
The Open LLM Leaderboard behaves like roughly two measurement axes
Measurement breadth varies more than 20x across benchmarks
Abstract
AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
