AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Polina Kirichenko; Mark Ibrahim; Kamalika Chaudhuri; Samuel J. Bell

arXiv:2506.09038·cs.AI·June 11, 2025

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions

Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, Samuel J. Bell

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces AbstentionBench, a comprehensive benchmark to evaluate how well large language models can recognize unanswerable questions and abstain, revealing current limitations and the impact of scaling and prompting strategies.

Contribution

The work provides the first large-scale, systematic evaluation framework for LLM abstention, highlighting its challenges and effects of model scaling and prompting.

Findings

01

Abstention remains an unsolved problem across models.

02

Scaling models does not improve abstention performance.

03

Prompt engineering can temporarily boost abstention but does not address fundamental issues.

Abstract

For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which can be underspecified, ill-posed, or fundamentally unanswerable, require LLMs to reason about uncertainty and selectively abstain -- i.e., refuse to answer definitively. However, abstention remains understudied, without a systematic evaluation framework for modern LLMs. In this work, we introduce AbstentionBench, a large-scale benchmark for holistically evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information. Evaluating 20 frontier LLMs reveals abstention is an unsolved problem, and one where scaling models is of little use. While recent reasoning LLMs have shown…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/abstentionbench
pytorchOfficial

Datasets

facebook/AbstentionBench
dataset· 1.4k dl
1.4k dl

Videos

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions· slideslive

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks