Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty
Sravanthi Machcha, Sushrita Yerra, Sahil Gupta, Aishwarya Sahoo, Sharmin Sultana, Hong Yu, Zonghai Yao

TL;DR
This paper introduces MedAbstain, a benchmark for evaluating medical LLMs' ability to abstain under uncertainty, revealing current models' limitations and guiding safer deployment in critical settings.
Contribution
The paper presents MedAbstain, a new benchmark and evaluation protocol for assessing abstention in medical LLMs, emphasizing the importance of explicit abstention options for safety.
Findings
Explicit abstention options improve model uncertainty and safety.
Larger models and advanced prompts offer limited improvements.
Input perturbations are less effective than abstention options.
Abstract
Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) -- a discrete-choice setting that generalizes to agentic action selection -- integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
