Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators
Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum,, Aidan Gilson, Reid McMurry, R. Andrew Taylor, Aidong Zhang, Qingyu Chen,, Zhiyong Lu

TL;DR
This study evaluates the ability of large language models to support clinical decision-making by recommending medical calculators, finding they are currently less accurate than human experts.
Contribution
It provides a comprehensive comparison of multiple LLMs and humans in clinical calculator recommendation, highlighting current limitations of LLMs in this domain.
Findings
LLMs achieved up to 66% accuracy on clinical questions.
Humans outperformed LLMs with approximately 80% accuracy.
Majority of LLM errors stem from comprehension issues.
Abstract
Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
