Confidence in the Reasoning of Large Language Models

Yudi Pawitan; Chris Holmes

arXiv:2412.15296·cs.CL·December 23, 2024·2 cites

Confidence in the Reasoning of Large Language Models

Yudi Pawitan, Chris Holmes

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models assess their own confidence in answers, revealing that they tend to overestimate confidence and lack an internally coherent sense of certainty, despite performing better than random guessing.

Contribution

It introduces a novel evaluation of LLM confidence through qualitative persistence and self-reported scores, highlighting limitations in their self-assessment capabilities.

Findings

01

LLMs perform better than random guessing on reasoning tasks.

02

There is a positive correlation between qualitative confidence and accuracy.

03

LLMs tend to overstate their confidence scores.

Abstract

There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMs -- GPT4o, GPT4-turbo and Mistral -- on two benchmark sets of questions on causal judgement and formal fallacies and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yudpaw-git/statspuzzle
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training