Benchmarking Large Language Model Uncertainty for Prompt Optimization

Pei-Fu Guo; Yun-Da Tsai; Shou-De Lin

arXiv:2409.10044·cs.LG·December 30, 2024

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a benchmark dataset to evaluate uncertainty estimation in large language models, revealing current metrics mainly capture answer confidence rather than correctness, and highlights the need for better, optimization-aware uncertainty metrics.

Contribution

It provides a new benchmark dataset and analysis framework for evaluating uncertainty metrics in LLMs, emphasizing the gap in metrics that truly reflect correctness for prompt optimization.

Findings

01

Current uncertainty metrics mainly reflect answer confidence.

02

Existing metrics poorly correlate with correctness uncertainty.

03

Need for improved, optimization-objective-aware uncertainty metrics.

Abstract

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

0frett/po-uncertainty-benchmarking
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Byte Pair Encoding · Softmax · Layer Normalization · Dropout · Residual Connection · Attention Dropout · Linear Layer