MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation
Artus Krohn-Grimberghe

TL;DR
This paper audits the MedCalc-Bench benchmark, uncovers errors, demonstrates that open-book prompting significantly improves model accuracy, and argues the benchmark mainly measures memorization rather than clinical reasoning.
Contribution
It systematically audits and corrects errors in MedCalc-Bench, shows that open-book prompting greatly enhances performance, and advocates for re-framing the benchmark as a tool-use evaluation.
Findings
Over 20 errors identified and fixed in the benchmark
Open-book prompting raises accuracy to 81-85%
GPT-5.2 achieves 95-97% accuracy, limited by dataset issues
Abstract
MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Electronic Health Records Systems · Machine Learning in Healthcare
