UNCLE: Benchmarking Uncertainty Expressions in Long-Form Generation
Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, Deqing Yang

TL;DR
This paper introduces UNCLE, a benchmark for evaluating how well large language models express uncertainty in long-form generation, revealing current limitations and proposing methods to improve uncertainty communication.
Contribution
The paper presents UNCLE, the first benchmark linking short- and long-form QA for uncertainty expression, along with new metrics and analysis of model performance.
Findings
Current models poorly express uncertainty in long-form generation.
Training-based methods improve uncertainty expression more than prompt-based methods.
UNCLE reveals significant gaps between short- and long-form uncertainty communication.
Abstract
Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE covers five domains and includes more than 1,000 entities, each with paired short- and long-form QA items. Our dataset is the first to directly link short- and long-form QA through aligned questions and gold-standard answers. Along with UNCLE, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. We then demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Misinformation and Its Impacts
