The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Jung Min Kang

arXiv:2605.11205·cs.LG·May 13, 2026

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

Jung Min Kang

PDF

TL;DR

This paper shows that simple averaging in evaluations can be misleading under data sparsity and difficulty gaps, but Item Response Theory (IRT) can recover true rankings across various domains.

Contribution

It demonstrates that IRT models outperform simple averaging in sparse, heterogeneous evaluation matrices, providing more reliable rankings across multiple AI and safety domains.

Findings

01

Simple averaging ranking correlation drops significantly with sparsity and difficulty heterogeneity.

02

IRT maintains high correlation (≥0.996) regardless of sparsity and difficulty gaps.

03

Evaluation failure surface depends on sparsity and difficulty gap interaction.

Abstract

Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $ρ$ between simple-average rankings and ground-truth rankings degrades from $ρ = 1.000$ at 100% coverage to $ρ = 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $ρ \geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.