Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan; Sid Black; Oliver Sourbut

arXiv:2512.24661·cs.CL·January 1, 2026

Do Large Language Models Know What They Are Capable Of?

Casey O. Barkan, Sid Black, Oliver Sourbut

PDF

Open Access 3 Reviews

TL;DR

This paper examines whether large language models can accurately predict their success on tasks, how their self-assessment evolves during multi-step tasks, and how in-context learning influences their decision-making and overconfidence.

Contribution

It provides empirical insights into LLMs' self-assessment abilities, their overconfidence issues, and the impact of in-context experiences on improving their decision-making.

Findings

01

Most LLMs predict success better than random chance.

02

Newer and larger LLMs do not necessarily have better self-assessment accuracy.

03

In-context experiences can reduce overconfidence in some LLMs.

Abstract

We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

1. Comprehensive experimental design and clear visualization of the results. 2. Provide empirical evidence of 'self-awareness of ability' across different LLMs with different sizes.

Weaknesses

1. Need an explicitly defined research gap, motivation, and why the study can fill the gap. The contributions of this work have not been highlighted from previous work. 2. How LLMs know what they are capable of is quantified by self-reported confidence in this study. Auto-regressive LLM models trained by next-token-prediction solve the confidence estimation given tasks in a fundamentally different manner from humans, making it sort of ambiguous to represent 'real confidence' in the tasks merely

Reviewer 02Rating 8Confidence 4

Strengths

* While the research question is relatively simple, the paper evaluates the question thoroughly and considers the problem from multiple angles. And to the best of my knowledge, this paper introduces some new paradigms to evaluate model confidence. * The paper is very well written, the experimental setup is very clear, and it connects very well to previous work. * The paper adequately discusses its limitations.

Weaknesses

* While I think the task in Experiment 2 with fictional costs is an interesting approach, I was wondering whether this has been validated that LLMs can make such risk-based decisions when they have direct access to the underlying risks. In other words, is there evidence that LLMs can accurately compute the expected reward and base decisions on this implicit calculation? While I don't think this would change results here, since the confidence estimates still seem to be inaccurate anyways, it woul

Reviewer 03Rating 6Confidence 3

Strengths

* The paper is very well written with great flow, good schematics, and no major grammatical or formatting errors * The paper crystallizes an well-known phenomenon intuitively and anecdotally understood into a rigorous analysis * Calibration is a very important problem to study and understand

Weaknesses

* Unfortunately the world of foundation models move incredibly fast and the models tested in this work is already fairly dated. For example, GPT-4.1, Sonnet 3.5, etc. are no longer available. As noted by the authors, some of the behaviors characterized are divergent, and thus likely no longer relevant to the newest class of models (e.g. GPT-5). * It would benefit the audience to draw connections between the in-advance setup of the work to calibration in traditional neural network architecture (

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI