AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, Samuel J. Bell

TL;DR
This paper introduces AbstentionBench, a comprehensive benchmark to evaluate how well large language models can recognize unanswerable questions and abstain, revealing current limitations and the impact of scaling and prompting strategies.
Contribution
The work provides the first large-scale, systematic evaluation framework for LLM abstention, highlighting its challenges and effects of model scaling and prompting.
Findings
Abstention remains an unsolved problem across models.
Scaling models does not improve abstention performance.
Prompt engineering can temporarily boost abstention but does not address fundamental issues.
Abstract
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
