Choices Speak Louder than Questions
Gyeongje Cho, Yeonkyoung So, Jaejin Lee

TL;DR
This paper investigates the limitations of current MCQA evaluation methods, introduces NPSQ to better assess true comprehension, and demonstrates its robustness against superficial answer choice influences.
Contribution
It proposes a novel scoring method, NPSQ, that isolates question impact and improves the reliability of MCQA evaluation over traditional approaches.
Findings
Traditional scoring methods are sensitive to superficial answer features.
NPSQ remains stable despite modifications to answer options.
Choice sensitivity affects the assessment of language model comprehension.
Abstract
Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of choice sensitivity, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods - such as those based on log-likelihood or its length-normalized variant - are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable…
Peer Reviews
Decision·ICLR 2026 Poster
I liked this paper. I think it's well-written, addresses an interesting issue, primes other researchers for future work in this area, and makes non-obvious contributions to the literature. * Understanding choice sensitivity seems like an important issue in benchmark design. * The method of calculating choice sensitivity is natural and intuitive. * The empirical results convincingly show a wide variety of surprising behavior of LLMs wrt choice sensitivity. * The authors (generally) do a great
I think the biggest weakness of the paper lies in the presentation. I think Section 3 was beautiful. It flowed naturally, the experiments were clean, and the authors did a great job of pulling out crisp conclusions. Section 4 was fine; it introduces NPSQ. It would have been nice to connect the mathematical notation in Section 3 to that in Section 4 a bit more directly -- it was VERY unclear what the "score" function in Section 3 was -- and since it was mentioned that "log p(x|q,c)" was part o
1. The paper identifies and formalizes a pervasive evaluation artifact, choice sensitivity, and gives a principled, testable metric to mitigate it. 2. The core construct (question-conditioned vs. question-ablated likelihood shift with normalization) is simple, auditable, and easy to slot into existing LM-eval pipelines.
1. The normalization in NPSQ is not stress-tested against plausible alternatives (e.g., z-scores, temperature scaling, ECE), leaving ranking stability under-substantiated. 2. Key stability claims (flip rates, adversarial drops) lack uncertainty quantification and significance testing, weakening the statistical support for the conclusions. 3. The metric relies on token-level probabilities and a hand-crafted “no-question” template whose wording or API backend may change outcomes, reducing reprod
The paper addresses a clear and increasingly important issue in LLM evaluation. As models achieve high scores on benchmarks, it is crucial to understand if this reflects true comprehension or artifact exploitation. The proposed NPSQ metric is intuitive, well-motivated, and directly targets the identified problem by quantifying the "value" of the question. The use of adversarial choices provides a very clear and convincing demonstration of the weaknesses of existing metrics and the robustness o
While valuable, the contribution is an incremental improvement in evaluation methodology rather than a new task, model, and with no fundamental insight into model reasoning. The analysis of why models exhibit this sensitivity is not deeply explored, though this is not the primary focus. The experiments are solid but could be extended to a wider range of model architectures and benchmarks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Expert finding and Q&A systems
