CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

Kevin H. Guo; Chao Yan; Avinash Baidya; Katherine Brown; Xiang Gao; Juming Xiong; Zhijun Yin; and Bradley A. Malin

arXiv:2605.01011·cs.CL·May 12, 2026

CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, and Bradley A. Malin

PDF

TL;DR

This paper introduces the CLEAR framework to evaluate how noise and ambiguity impact the reasoning and reliability of medical LLMs, revealing significant limitations in current benchmarks.

Contribution

The paper presents a novel evaluation framework that systematically assesses the effects of ambiguity and uncertainty on medical LLMs' performance, highlighting key limitations of existing benchmarks.

Findings

01

Increasing plausible answers reduces model accuracy and abstention ability.

02

Framing of abstention options affects model caution, with IDK increasing errors.

03

Scaling models does not fully address reliability issues, indicating a humility deficit.

Abstract

Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.