Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue

TL;DR
This paper introduces a scalable, nonparametric method using signed isotonic R^2 to efficiently identify problematic items in large assessment datasets, outperforming traditional diagnostics.
Contribution
The authors propose a novel, model-agnostic scalability coefficient based on interitem isotonic regression for detecting bad items without assuming linearity.
Findings
Signed isotonic R^2 achieves top-tier AUC in ranking bad items.
Method remains robust with small sample sizes and mixed item types.
Computationally efficient, it outperforms classical diagnostics.
Abstract
The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic , which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's . Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
