Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He,, Bryan Hooi

TL;DR
This paper systematically evaluates black-box confidence elicitation methods for large language models, revealing their tendencies, improvements with scale, and potential strategies to enhance uncertainty estimation without internal model access.
Contribution
It introduces a comprehensive framework for black-box confidence elicitation in LLMs and benchmarks various prompting, sampling, and aggregation strategies across multiple models and tasks.
Findings
LLMs tend to be overconfident when verbalizing confidence.
Scaling up models improves calibration and failure prediction.
Proposed strategies can mitigate overconfidence and narrow the gap with white-box methods.
Abstract
Empowering large language models to accurately express confidence in their answers is essential for trustworthy decision-making. Previous confidence elicitation methods, which primarily rely on white-box access to internal model information or model fine-tuning, have become less suitable for LLMs, especially closed-source commercial APIs. This leads to a growing need to explore the untapped area of black-box approaches for LLM uncertainty estimation. To better break down the problem, we define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency. We then benchmark these methods on two key tasks-confidence calibration and failure prediction-across five types of datasets (e.g., commonsense and arithmetic reasoning) and five widely-used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Financial Distress and Bankruptcy Prediction
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Softmax · Dropout · Byte Pair Encoding · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer
