MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

Nikhil Khandekar; Qiao Jin; Guangzhi Xiong; Soren Dunn; Serina S; Applebaum; Zain Anwar; Maame Sarfo-Gyamfi; Conrad W Safranek; Abid A Anwar,; Andrew Zhang; Aidan Gilson; Maxwell B Singer; Amisha Dave; Andrew Taylor,; Aidong Zhang; Qingyu Chen; and Zhiyong Lu

arXiv:2406.12036·cs.CL·July 2, 2024·6 cites

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations

Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S, Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar,, Andrew Zhang, Aidan Gilson, Maxwell B Singer, Amisha Dave, Andrew Taylor,, Aidong Zhang, Qingyu Chen, and Zhiyong Lu

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

MedCalc-Bench introduces a new dataset to evaluate large language models' ability to perform medical calculations, revealing current limitations in accuracy and reasoning for clinical use.

Contribution

This paper presents the first dataset focused on medical calculation tasks for LLMs, enabling targeted evaluation of their quantitative reasoning in medicine.

Findings

01

LLMs struggle with accurate entity extraction in medical calculations

02

Current LLMs often use incorrect equations or rules for medical computations

03

Arithmetic errors are common in LLMs when performing medical calculations

Abstract

As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ncbi-nlp/medcalc-bench
noneOfficial

Models

🤗
sigjhl/medgemma-1.5-4b-it-MedCalcCaller
model· 2 dl
2 dl

Videos

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations· slideslive

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling

MethodsSparse Evolutionary Training