Humans and Large Language Models in Clinical Decision Support: A Study   with Medical Calculators

Nicholas Wan; Qiao Jin; Joey Chan; Guangzhi Xiong; Serina Applebaum,; Aidan Gilson; Reid McMurry; R. Andrew Taylor; Aidong Zhang; Qingyu Chen,; Zhiyong Lu

arXiv:2411.05897·cs.CL·March 25, 2025

Humans and Large Language Models in Clinical Decision Support: A Study with Medical Calculators

Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum,, Aidan Gilson, Reid McMurry, R. Andrew Taylor, Aidong Zhang, Qingyu Chen,, Zhiyong Lu

PDF

TL;DR

This study evaluates the ability of large language models to support clinical decision-making by recommending medical calculators, finding they are currently less accurate than human experts.

Contribution

It provides a comprehensive comparison of multiple LLMs and humans in clinical calculator recommendation, highlighting current limitations of LLMs in this domain.

Findings

01

LLMs achieved up to 66% accuracy on clinical questions.

02

Humans outperformed LLMs with approximately 80% accuracy.

03

Majority of LLM errors stem from comprehension issues.

Abstract

Although large language models (LLMs) have been assessed for general medical knowledge using licensing exams, their ability to support clinical decision-making, such as selecting medical calculators, remains uncertain. We assessed nine LLMs, including open-source, proprietary, and domain-specific models, with 1,009 multiple-choice question-answer pairs across 35 clinical calculators and compared LLMs to humans on a subset of questions. While the highest-performing LLM, OpenAI o1, provided an answer accuracy of 66.0% (CI: 56.7-75.3%) on the subset of 100 questions, two human annotators nominally outperformed LLMs with an average answer accuracy of 79.5% (CI: 73.5-85.0%). Ultimately, we evaluated medical trainees and LLMs in recommending medical calculators across clinical scenarios like risk stratification and diagnosis. With error analysis showing that the highest-performing LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.