CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, and Bradley A. Malin

TL;DR
This paper introduces the CLEAR framework to evaluate how noise and ambiguity impact the reasoning and reliability of medical LLMs, revealing significant limitations in current benchmarks.
Contribution
The paper presents a novel evaluation framework that systematically assesses the effects of ambiguity and uncertainty on medical LLMs' performance, highlighting key limitations of existing benchmarks.
Findings
Increasing plausible answers reduces model accuracy and abstention ability.
Framing of abstention options affects model caution, with IDK increasing errors.
Scaling models does not fully address reliability issues, indicating a humility deficit.
Abstract
Medical large language model (LLM) evaluations rely on simplified, exam-style benchmarks that rarely reflect the ambiguity of real-world medical inquiries. We introduce the CLinical Evaluation of Ambiguity and Reliability (CLEAR) framework, which assesses how decision-space presentation, ambiguity, and uncertainty affect LLMs' reasoning on medical benchmarks. CLEAR systematically perturbs (1) the number of plausible answer options, (2) the presence of a ground truth or abstention option, and (3) the semantic framing of answer options. Applying CLEAR on three benchmarks evaluated across 17 LLMs reveals three notable limitations of existing evaluation methods. First, increasing the number of plausible answers degrades a model's ability to identify the correct answer and abstain against incorrect ones. Second, this lack of caution intensifies as the framing of abstention shifts from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
