UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Xunzhi Wang; Zhuowei Zhang; Gaonan Chen; Qiongyu Li; Bitong Luo; Zhixin Han; Haotian Wang; Zhiyu li; Hang Gao; Mengting Hu

arXiv:2406.12784·cs.CL·June 5, 2025

UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

Xunzhi Wang, Zhuowei Zhang, Gaonan Chen, Qiongyu Li, Bitong Luo, Zhixin Han, Haotian Wang, Zhiyu li, Hang Gao, Mengting Hu

PDF

Open Access 1 Repo

TL;DR

UBench introduces a confidence interval-based benchmark for evaluating the uncertainty of large language models across diverse tasks, enabling effective comparison without requiring internal model access or extensive retraining.

Contribution

The paper presents UBench, a novel benchmark for LLM uncertainty evaluation using confidence intervals, applicable to both open and closed-source models, and explores factors affecting uncertainty estimation.

Findings

01

Confidence interval methods effectively quantify uncertainty.

02

Open-source models perform competitively with closed-source models.

03

Chain-of-Thought and role-playing prompts can improve uncertainty estimates.

Abstract

Despite recent progress in systematic evaluation frameworks, benchmarking the uncertainty of large language models (LLMs) remains a highly challenging task. Existing methods for benchmarking the uncertainty of LLMs face three key challenges: the need for internal model access, additional training, or high computational costs. This is particularly unfavorable for closed-source models. To this end, we introduce UBench, a new benchmark for evaluating the uncertainty of LLMs. Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities. Based on this, we conduct extensive experiments. This includes comparisons with other advanced uncertainty estimation methods, the assessment of the uncertainty of 20 LLMs, and an exploration of the effects of Chain-of-Thought (CoT)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Cyno2232/UBENCH
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer