BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Nishant Balepur; Bhavya Rajasekaran; Jane Oh; Michael Xie; Atrey Desai; Vipul Gupta; Steven James Moore; Eunsol Choi; Rachel Rudinger; Jordan Lee Boyd-Graber

arXiv:2602.06221·cs.CL·April 21, 2026

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Nishant Balepur, Bhavya Rajasekaran, Jane Oh, Michael Xie, Atrey Desai, Vipul Gupta, Steven James Moore, Eunsol Choi, Rachel Rudinger, Jordan Lee Boyd-Graber

PDF

TL;DR

BenchMarker is a toolkit inspired by education research that uses LLM judges to identify common flaws in multiple-choice benchmarks, revealing persistent issues that affect NLP evaluation accuracy.

Contribution

Introduces BenchMarker, a novel education-inspired toolkit leveraging LLMs to detect contamination, shortcuts, and writing errors in MCQ benchmarks, improving quality control.

Findings

01

47% of TruthfulQA questions appear online, indicating contamination.

02

100% of HellaSwag questions violate at least one writing rule.

03

Flaws in MCQs can inflate or deflate model accuracy and alter benchmark rankings.

Abstract

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination: items appearing exactly online; 2) shortcuts: cues in the choices that enable guessing; and 3) writing errors: structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 1) flaws persist in MCQA benchmarks, especially automatically-made and crowdsourced data - we detect 47% of TruthfulQA appears online and 100% of HellaSwag violates multiple writing rules; 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.