How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

Minh-Vuong Nguyen; Fatemeh Shiri; Zhuang Li; Karin Verspoor

arXiv:2604.11133·cs.CL·April 16, 2026

How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor

PDF

1 Repo

TL;DR

This paper introduces ClinicNumRobBench, a comprehensive benchmark for evaluating the robustness of large language models in clinical numerical reasoning across diverse note formats and question types.

Contribution

It provides a new benchmark with 1,624 instances to assess LLMs on clinical numeracy, highlighting strengths and weaknesses in value retrieval, comparison, and aggregation tasks.

Findings

01

Value retrieval accuracy exceeds 85% in most models.

02

Relational comparison and aggregation are significantly more challenging.

03

Fine-tuning can decrease numeracy performance and increase sensitivity to note format.

Abstract

Large Language Models (LLMs) are increasingly being explored for clinical question answering and decision support, yet safe deployment critically requires reliable handling of patient measurements in heterogeneous clinical notes. Existing evaluations of LLMs for clinical numerical reasoning provide limited operation-level coverage, restricted primarily to arithmetic computation, and rarely assess the robustness of numerical understanding across clinical note formats. We introduce ClinicNumRobBench, a benchmark of 1,624 context-question instances with ground-truth answers that evaluates four main types of clinical numeracy: value retrieval, arithmetic computation, relational comparison, and aggregation. To stress-test robustness, ClinicNumRobBench presents longitudinal MIMIC-IV vital-sign records in three semantically equivalent representations, including a real-world note-style variant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MinhVuong2000/ClinicNumRobBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.