Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Haoyang Li; Xuejia Chen; Zhanchao XU; Darian Li; Nicole Hu; Fei Teng; Yiming Li; Luyu Qiu; Chen Jason Zhang; Qing Li; Lei Chen

arXiv:2502.11075·cs.CL·June 4, 2025

Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, Lei Chen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces NumericBench, a new benchmark designed to evaluate and expose the weaknesses of large language models in fundamental numerical reasoning tasks, highlighting the need for models to better understand and manipulate numbers.

Contribution

The paper presents NumericBench, a comprehensive benchmark for six core numerical abilities, addressing gaps in existing evaluation methods for LLMs' numerical reasoning skills.

Findings

01

State-of-the-art LLMs show significant weaknesses in numerical reasoning.

02

NumericBench reveals persistent numerical understanding gaps in models like GPT-4.

03

Benchmark datasets include synthetic and real-world data, challenging models with complex tasks.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

treeai-lab/numericbench
noneOfficial

Datasets

TreeAILab/NumericBench
dataset· 189 dl
189 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Mathematics Education and Teaching Techniques · Educational Tools and Methods

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax