BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
Nishant Balepur, Bhavya Rajasekaran, Jane Oh, Michael Xie, Atrey Desai, Vipul Gupta, Steven James Moore, Eunsol Choi, Rachel Rudinger, Jordan Lee Boyd-Graber

TL;DR
BenchMarker is a toolkit inspired by education research that uses LLM judges to identify common flaws in multiple-choice benchmarks, revealing persistent issues that affect NLP evaluation accuracy.
Contribution
Introduces BenchMarker, a novel education-inspired toolkit leveraging LLMs to detect contamination, shortcuts, and writing errors in MCQ benchmarks, improving quality control.
Findings
47% of TruthfulQA questions appear online, indicating contamination.
100% of HellaSwag questions violate at least one writing rule.
Flaws in MCQs can inflate or deflate model accuracy and alter benchmark rankings.
Abstract
Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination: items appearing exactly online; 2) shortcuts: cues in the choices that enable guessing; and 3) writing errors: structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 1) flaws persist in MCQA benchmarks, especially automatically-made and crowdsourced data - we detect 47% of TruthfulQA appears online and 100% of HellaSwag violates multiple writing rules; 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
