Position: Science of AI Evaluation Requires Item-level Benchmark Data
Han Jiang, Susu Zhang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

TL;DR
This paper advocates for the use of item-level benchmark data to improve the validity and diagnostic power of AI evaluations, especially in high-stakes applications.
Contribution
It introduces the concept of item-level data as essential for rigorous AI evaluation and presents OpenEval, a repository supporting this approach.
Findings
Item-level analysis reveals validity issues in current benchmarks.
OpenEval provides a resource for community adoption of item-level evaluation.
Granular diagnostics improve understanding of AI system performance.
Abstract
AI evaluations have become the primary evidence for deploying generative AI systems across high-stakes domains. However, current evaluation paradigms often exhibit systemic validity failures. These issues, ranging from unjustified design choices to misaligned metrics, remain intractable without a principled framework for gathering validity evidence and conducting granular diagnostic analysis. In this position paper, we argue that item-level AI benchmark data is essential for establishing a rigorous science of AI evaluation. Item-level analysis enables fine-grained diagnostics and principled validation of benchmarks. We substantiate this position by dissecting current validity failures and revisiting evaluation paradigms across computer science and psychometrics. Through illustrative analyses of item properties and latent constructs, we demonstrate the unique insights afforded by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
