Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

Zhishen Sun; Guang Dai; Ivor Tsang; Haishan Ye

arXiv:2511.08022·cs.AI·November 12, 2025

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

Zhishen Sun, Guang Dai, Ivor Tsang, Haishan Ye

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the reasoning robustness of large language models by introducing perturbations, revealing their sensitivity to numerical information and reliance on pattern matching over logical reasoning.

Contribution

The study develops a novel perturbation framework to evaluate LLM reasoning, highlighting their vulnerabilities and limitations in complex environments.

Findings

01

Models are more sensitive to numerical perturbations.

02

Performance drops significantly with increased perturbation intensity.

03

LLMs often rely on pattern matching rather than true reasoning.

Abstract

LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs' reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs' problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The paper systematically explores multiple perturbation types (numerical vs. non-numerical) and intensities, scaling from single-sentence insertions to twice the original problem length. The inclusion of both GSM8K and AIMEprovides a well-rounded evaluation across difficulty levels. 2. Wide model coverage – The study evaluates a broad spectrum of models, including open-source LLMs (Qwen, DeepSeek, LLaMA, Gemma) and proprietary reasoning models, offering valuable cross-model insights into rob

Weaknesses

1. The core finding that models are brittle to irrelevant or noisy context—has been partially demonstrated in prior works such as GSM-Plus and MathCheck. The paper could benefit from a deeper discussion on what unique insight it contributes beyond confirming existing robustness issues, or how its new findings could be applied. 2.Several promising analyses are missing: (1) The perturbation length is relatively small. Scaling the irrelevant text by 10× or 100× could better approximate real-world

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well written and easy to follow. 2. The experiments are straight forward and covers a variety of LLMs. 3. The core question removal is an interesting discovery.

Weaknesses

1. While the authors claim to propose a novel perturbation method, similar approaches have been explored in the literature. For example, GSM-IC [1] (which you referenced in line 80) found that LLMs can be distracted by irrelevant context, which directly contradicts your statement *"...still remain open that whether semantically irrelevant perturbations can affect the problem solution"*. Coleg [2] also investigated extending question context length with (mostly) irrelevant details, and proposed a

Reviewer 03Rating 2Confidence 4

Strengths

- Extensive evaluation across 13 models, two benchmarks, and multiple perturbation levels. Results are reproducible and statistically clear. - Highlights a critical limitation—numerical distractibility—that impacts real-world deployment (e.g., in finance or science). The “memorization over reasoning” finding aligns with broader concerns in the field.

Weaknesses

- Similar “distractor sentence” or “context perturbation” methodologies already exist (Shi et al. 2023; Mirzadeh et al. 2024; Huang et al. 2025). The work does not propose a fundamentally new paradigm, only a modest variation. - The paper claims to “guarantee that all factual statements are semantically irrelevant to the training samples” (lines 203–206). However, this assumption is not verifiable nor logically guaranteed under the proposed generation procedure. The authors use GPT-4 to generate

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Cognitive and developmental aspects of mathematical skills · Mathematics Education and Teaching Techniques