Benchmarking Large Language Model Uncertainty for Prompt Optimization
Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin

TL;DR
This paper introduces a benchmark dataset to evaluate uncertainty estimation in large language models, revealing current metrics mainly capture answer confidence rather than correctness, and highlights the need for better, optimization-aware uncertainty metrics.
Contribution
It provides a new benchmark dataset and analysis framework for evaluating uncertainty metrics in LLMs, emphasizing the gap in metrics that truly reflect correctness for prompt optimization.
Findings
Current uncertainty metrics mainly reflect answer confidence.
Existing metrics poorly correlate with correctness uncertainty.
Need for improved, optimization-objective-aware uncertainty metrics.
Abstract
Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Byte Pair Encoding · Softmax · Layer Normalization · Dropout · Residual Connection · Attention Dropout · Linear Layer
