Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu, Lu Gan

TL;DR
This paper introduces a standardized patient simulator and benchmark to evaluate large language models' active diagnostic reasoning, revealing performance drops in interactive settings compared to static evaluations.
Contribution
It presents an OSCE-inspired benchmark for active evidence-seeking in clinical diagnosis, highlighting challenges and limitations of current models in interactive scenarios.
Findings
Multi-turn evidence seeking reduces diagnostic accuracy by 12.75%.
Supporting-evidence quality drops by 24.36% in interactive settings.
Static benchmarks may overestimate model performance in real-world clinical interactions.
Abstract
Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
