MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S, Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W Safranek, Abid A Anwar,, Andrew Zhang, Aidan Gilson, Maxwell B Singer, Amisha Dave, Andrew Taylor,, Aidong Zhang, Qingyu Chen, and Zhiyong Lu

TL;DR
MedCalc-Bench introduces a new dataset to evaluate large language models' ability to perform medical calculations, revealing current limitations in accuracy and reasoning for clinical use.
Contribution
This paper presents the first dataset focused on medical calculation tasks for LLMs, enabling targeted evaluation of their quantitative reasoning in medicine.
Findings
LLMs struggle with accurate entity extraction in medical calculations
Current LLMs often use incorrect equations or rules for medical computations
Arithmetic errors are common in LLMs when performing medical calculations
Abstract
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling
MethodsSparse Evolutionary Training
