Reasoning Models are Test Exploiters: Rethinking Multiple-Choice
Narun Raman, Taylor Lundy, Kevin Leyton-Brown

TL;DR
This paper critically examines how multiple-choice question-answering (MCQA) tests influence the perceived reasoning abilities of large language models, revealing that models often exploit options rather than genuinely reasoning.
Contribution
It systematically evaluates 15 benchmarks and 27 models, showing how MCQA can misrepresent models' reasoning skills and providing guidelines for more accurate assessment.
Findings
Models outperform in MCQA when allowed to reason after seeing options.
Models exploit answer choices to improve performance, not genuine reasoning.
Guidelines are proposed for better evaluation of reasoning capabilities.
Abstract
When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
