Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman; Taylor Lundy; Kevin Leyton-Brown

arXiv:2507.15337·cs.CL·October 3, 2025

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

PDF

Open Access

TL;DR

This paper critically examines how multiple-choice question-answering (MCQA) tests influence the perceived reasoning abilities of large language models, revealing that models often exploit options rather than genuinely reasoning.

Contribution

It systematically evaluates 15 benchmarks and 27 models, showing how MCQA can misrepresent models' reasoning skills and providing guidelines for more accurate assessment.

Findings

01

Models outperform in MCQA when allowed to reason after seeing options.

02

Models exploit answer choices to improve performance, not genuine reasoning.

03

Guidelines are proposed for better evaluation of reasoning capabilities.

Abstract

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications