Fantastic Bugs and Where to Find Them in AI Benchmarks
Sang Truong, Yuheng Tu, Michael Hardy, Anka Reuel, Zeyu Tang, Jirayu Burapacheep, Jonathan Perera, Chibuike Uwakwe, Ben Domingue, Nick Haber, Sanmi Koyejo

TL;DR
This paper presents a scalable framework for identifying and revising invalid questions in AI benchmarks using statistical analysis and large language model judgments, improving evaluation reliability.
Contribution
It introduces a systematic method combining statistical analysis and LLM-based review to efficiently flag and correct problematic benchmark questions.
Findings
Achieved up to 84% precision in identifying problematic questions.
Demonstrated effectiveness across nine widely used benchmarks.
Reduced human effort with LLM-judge first pass.
Abstract
Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation. In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Mobile Crowdsensing and Crowdsourcing · Meta-analysis and systematic reviews
