The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains
Jung Min Kang

TL;DR
This paper shows that simple averaging in evaluations can be misleading under data sparsity and difficulty gaps, but Item Response Theory (IRT) can recover true rankings across various domains.
Contribution
It demonstrates that IRT models outperform simple averaging in sparse, heterogeneous evaluation matrices, providing more reliable rankings across multiple AI and safety domains.
Findings
Simple averaging ranking correlation drops significantly with sparsity and difficulty heterogeneity.
IRT maintains high correlation (≥0.996) regardless of sparsity and difficulty gaps.
Evaluation failure surface depends on sparsity and difficulty gap interaction.
Abstract
Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation between simple-average rankings and ground-truth rankings degrades from at 100% coverage to at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains across all conditions. A 150-condition grid sweep over sparsity and difficulty gap $D \in [0.5,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
