Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

Michael Hardy; Joshua Gilbert; Benjamin Domingue

arXiv:2603.24999·stat.AP·March 30, 2026

Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

Michael Hardy, Joshua Gilbert, Benjamin Domingue

PDF

TL;DR

This paper introduces a scalable, nonparametric method using signed isotonic R^2 to efficiently identify problematic items in large assessment datasets, outperforming traditional diagnostics.

Contribution

The authors propose a novel, model-agnostic scalability coefficient based on interitem isotonic regression for detecting bad items without assuming linearity.

Findings

01

Signed isotonic R^2 achieves top-tier AUC in ranking bad items.

02

Method remains robust with small sample sizes and mixed item types.

03

Computationally efficient, it outperforms classical diagnostics.

Abstract

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^{2}$ , which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$ . Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.