Confidence in the Reasoning of Large Language Models
Yudi Pawitan, Chris Holmes

TL;DR
This paper investigates how large language models assess their own confidence in answers, revealing that they tend to overestimate confidence and lack an internally coherent sense of certainty, despite performing better than random guessing.
Contribution
It introduces a novel evaluation of LLM confidence through qualitative persistence and self-reported scores, highlighting limitations in their self-assessment capabilities.
Findings
LLMs perform better than random guessing on reasoning tasks.
There is a positive correlation between qualitative confidence and accuracy.
LLMs tend to overstate their confidence scores.
Abstract
There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMs -- GPT4o, GPT4-turbo and Mistral -- on two benchmark sets of questions on causal judgement and formal fallacies and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
