Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber

TL;DR
This paper critiques the limitations of multiple choice question answering for LLM evaluation, proposing reforms and alternative generative formats to better assess knowledge and capabilities.
Contribution
It identifies flaws in MCQA, advocates for generative testing formats, and offers educational-inspired fixes to improve LLM evaluation methods.
Findings
MCQA struggles with subjectivity, knowledge testing, and matching use cases
Proposed fixes include rubrics, scoring methods, and Item Response Theory
Generative formats better capture LLM capabilities and user needs
Abstract
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing, where LLMs construct and explain answers, better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA, robustness, biases, and unfaithful explanations, showing how our prior solutions better measure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Occupational and Professional Licensing Regulation
