Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Yunpeng Xiao; Carl Yang; Mark Mai; Xiao Hu; Kai Shu

arXiv:2510.20001·cs.CL·October 24, 2025

Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs

Yunpeng Xiao, Carl Yang, Mark Mai, Xiao Hu, Kai Shu

PDF

Open Access

TL;DR

This paper critiques current LLM evaluations in medicine, proposing a new paradigm that better captures real-world clinical decision-making by considering background and questions, and extends evaluation metrics beyond accuracy.

Contribution

It introduces a unifying framework for clinical decision-making tasks, reviews existing datasets and methods, and emphasizes comprehensive evaluation metrics for clinically meaningful LLMs.

Findings

01

Existing datasets underrepresent real clinical complexity

02

Methods vary in effectiveness depending on task difficulty

03

Extended evaluation metrics improve assessment of LLMs in clinical settings

Abstract

Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling