Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs
Yunpeng Xiao, Carl Yang, Mark Mai, Xiao Hu, Kai Shu

TL;DR
This paper critiques current LLM evaluations in medicine, proposing a new paradigm that better captures real-world clinical decision-making by considering background and questions, and extends evaluation metrics beyond accuracy.
Contribution
It introduces a unifying framework for clinical decision-making tasks, reviews existing datasets and methods, and emphasizes comprehensive evaluation metrics for clinically meaningful LLMs.
Findings
Existing datasets underrepresent real clinical complexity
Methods vary in effectiveness depending on task difficulty
Extended evaluation metrics improve assessment of LLMs in clinical settings
Abstract
Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling
