MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Kangkun Mao; Jinru Ding; Jiayuan Chen; Mouxiao Bian; Ruiyao Chen; Xinwei Peng; Sijie Ren; Linyang Li; Jie Xu

arXiv:2510.27267·cs.CL·November 3, 2025

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Kangkun Mao, Jinru Ding, Jiayuan Chen, Mouxiao Bian, Ruiyao Chen, Xinwei Peng, Sijie Ren, Linyang Li, Jie Xu

PDF

Open Access

TL;DR

This paper introduces MedCalc-Eval, a comprehensive benchmark for evaluating large language models' medical calculation skills, and MedCalc-Env, a reinforcement learning environment to enhance multi-step clinical reasoning, achieving state-of-the-art results.

Contribution

The paper presents the largest medical calculation benchmark and a novel RL environment for improving LLMs' quantitative reasoning in medicine.

Findings

01

Qwen2.5-32B fine-tuned in MedCalc-Env achieves state-of-the-art performance.

02

Benchmark covers diverse calculation tasks across multiple medical specialties.

03

Identifies remaining challenges like unit conversion and multi-condition reasoning.

Abstract

As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs' medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare and Education · Topic Modeling