UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Ruihan Yang; Caiqi Zhang; Zhisong Zhang; Xinting Huang; Dong Yu; Nigel Collier; Deqing Yang

arXiv:2505.16922·cs.CL·October 10, 2025

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation

Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, Deqing Yang

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces UNCLE, a benchmark for evaluating how well large language models express uncertainty in long-form generation, revealing current limitations and proposing methods to improve uncertainty communication.

Contribution

The paper presents UNCLE, the first benchmark linking short- and long-form QA for uncertainty expression, along with new metrics and analysis of model performance.

Findings

01

Current models poorly express uncertainty in long-form generation.

02

Training-based methods improve uncertainty expression more than prompt-based methods.

03

UNCLE reveals significant gaps between short- and long-form uncertainty communication.

Abstract

Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE covers five domains and includes more than 1,000 entities, each with paired short- and long-form QA items. Our dataset is the first to directly link short- and long-form QA through aligned questions and gold-standard answers. Along with UNCLE, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. We then demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rhyang2021/UNCLE
dataset· 12 dl
12 dl

Videos

UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Misinformation and Its Impacts