Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident, Especially When They are Wrong
Tairan Fu, Javier Conde, Gonzalo Mart\'inez, Mar\'ia Grandury, Pedro Reviriego

TL;DR
This paper investigates how reasoning before answering affects LLM confidence in MCQ tests, revealing that reasoning inflates confidence, especially when wrong, and degrades calibration.
Contribution
It demonstrates that reasoning increases LLM confidence levels and worsens calibration, highlighting the need for cautious interpretation of probabilities in MCQ evaluations.
Findings
Models are more confident when reasoning before answering.
Confidence increase is larger when the answer is incorrect.
Chain-of-Thought prompting degrades calibration accuracy.
Abstract
Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through the probability assigned to its response. In this work, we investigate how LLM confidence is influenced by the answering approach when the model answers directly or reasons before responding. Experiments on a general knowledge benchmark, covering 57 subjects and seven LLMs, show that models are systematically more confident when providing reasoning before answering, and that this confidence increase is larger when the selected answer is incorrect than when it is correct. We hypothesize that the reasoning process alters token probabilities, as the final answer prediction depends jointly on the question and the model's self-generated reasoning, leading to inflated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
