UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Xunzhi Wang, Zhuowei Zhang, Gaonan Chen, Qiongyu Li, Bitong Luo, Zhixin Han, Haotian Wang, Zhiyu li, Hang Gao, Mengting Hu

TL;DR
UBench introduces a confidence interval-based benchmark for evaluating the uncertainty of large language models across diverse tasks, enabling effective comparison without requiring internal model access or extensive retraining.
Contribution
The paper presents UBench, a novel benchmark for LLM uncertainty evaluation using confidence intervals, applicable to both open and closed-source models, and explores factors affecting uncertainty estimation.
Findings
Confidence interval methods effectively quantify uncertainty.
Open-source models perform competitively with closed-source models.
Chain-of-Thought and role-playing prompts can improve uncertainty estimates.
Abstract
Despite recent progress in systematic evaluation frameworks, benchmarking the uncertainty of large language models (LLMs) remains a highly challenging task. Existing methods for benchmarking the uncertainty of LLMs face three key challenges: the need for internal model access, additional training, or high computational costs. This is particularly unfavorable for closed-source models. To this end, we introduce UBench, a new benchmark for evaluating the uncertainty of LLMs. Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities. Based on this, we conduct extensive experiments. This includes comparisons with other advanced uncertainty estimation methods, the assessment of the uncertainty of 20 LLMs, and an exploration of the effects of Chain-of-Thought (CoT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
