BenchScope: How Many Independent Signals Does Your Benchmark Provide?

Tommy Sha; Stella Zhao

arXiv:2603.29357·cs.AI·April 1, 2026

BenchScope: How Many Independent Signals Does Your Benchmark Provide?

Tommy Sha, Stella Zhao

PDF

TL;DR

This paper introduces Effective Dimensionality (ED), a diagnostic tool to assess the independence and measurement breadth of AI benchmarks, revealing redundancy and guiding benchmark design.

Contribution

It proposes ED as a fast, population-conditional upper-bound diagnostic for benchmark scores, with a practical workflow and reference atlas for maintainers.

Findings

01

ED reveals substantial redundancy in benchmarks

02

The Open LLM Leaderboard behaves like roughly two measurement axes

03

Measurement breadth varies more than 20x across benchmarks

Abstract

AI evaluation suites often report many scores without checking whether those scores carry independent information. We introduce Effective Dimensionality (ED), the participation ratio of a centered benchmark-score spectrum, as a fast, population-conditional upper-bound diagnostic of measurement breadth. Applied at per-instance granularity to 22 benchmarks across 8 domains and more than 8,400 model evaluations, ED reveals substantial redundancy: the six-score Open LLM Leaderboard behaves like roughly two effective measurement axes (ED = 1.7), BBH and MMLU-Pro are near-interchangeable (rho = 0.96, stable across seven subpopulations), and measurement breadth varies more than 20x across current benchmarks. We show that relative ED rankings are stable under matched-dimension controls and that ED can flag redundant suite components, monitor performance-conditional compression, and guide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.