MedMCP-Calc: Benchmarking LLMs for Realistic Medical Calculator Scenarios via MCP Integration
Yakun Zhu, Yutong Huang, Shengqian Qin, Zhongzhen Huang, Shaoting Zhang, Xiaofan Zhang

TL;DR
MedMCP-Calc introduces a comprehensive benchmark for evaluating large language models in realistic medical calculator scenarios, emphasizing multi-step, context-aware tasks involving external data and tools, revealing significant performance gaps and guiding improvements.
Contribution
This work presents the first benchmark for LLMs in complex medical calculator scenarios, integrating MCP for realistic, multi-stage clinical tasks, and develops CalcMate, a fine-tuned model with enhanced capabilities.
Findings
Top models struggle with fuzzy queries and calculator selection.
Performance varies significantly across clinical domains.
Fine-tuned CalcMate achieves state-of-the-art results among open-source models.
Abstract
Medical calculators are fundamental to quantitative, evidence-based clinical practice. However, their real-world use is an adaptive, multi-stage process, requiring proactive EHR data acquisition, scenario-dependent calculator selection, and multi-step computation, whereas current benchmarks focus only on static single-step calculations with explicit instructions. To address these limitations, we introduce MedMCP-Calc, the first benchmark for evaluating LLMs in realistic medical calculator scenarios through Model Context Protocol (MCP) integration. MedMCP-Calc comprises 118 scenario tasks across 4 clinical domains, featuring fuzzy task descriptions mimicking natural queries, structured EHR database interaction, external reference retrieval, and process-level evaluation. Our evaluation of 23 leading models reveals critical limitations: even top performers like Claude Opus 4.5 exhibit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElectronic Health Records Systems · Machine Learning in Healthcare · Biomedical Text Mining and Ontologies
