Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur; Rachel Rudinger; Jordan Lee Boyd-Graber

arXiv:2502.14127·cs.CL·June 3, 2025

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber

PDF

Open Access 1 Video

TL;DR

This paper critiques the limitations of multiple choice question answering for LLM evaluation, proposing reforms and alternative generative formats to better assess knowledge and capabilities.

Contribution

It identifies flaws in MCQA, advocates for generative testing formats, and offers educational-inspired fixes to improve LLM evaluation methods.

Findings

01

MCQA struggles with subjectivity, knowledge testing, and matching use cases

02

Proposed fixes include rubrics, scoring methods, and Item Response Theory

03

Generative formats better capture LLM capabilities and user needs

Abstract

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing, where LLMs construct and explain answers, better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA, robustness, biases, and unfaithful explanations, showing how our prior solutions better measure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above· underline

Taxonomy

TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Occupational and Professional Licensing Regulation