MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Artus Krohn-Grimberghe

arXiv:2603.02222·cs.LG·March 4, 2026

MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Artus Krohn-Grimberghe

PDF

Open Access

TL;DR

This paper audits the MedCalc-Bench benchmark, uncovers errors, demonstrates that open-book prompting significantly improves model accuracy, and argues the benchmark mainly measures memorization rather than clinical reasoning.

Contribution

It systematically audits and corrects errors in MedCalc-Bench, shows that open-book prompting greatly enhances performance, and advocates for re-framing the benchmark as a tool-use evaluation.

Findings

01

Over 20 errors identified and fixed in the benchmark

02

Open-book prompting raises accuracy to 81-85%

03

GPT-5.2 achieves 95-97% accuracy, limited by dataset issues

Abstract

MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Electronic Health Records Systems · Machine Learning in Healthcare