An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao; Xiang Wan; ChengXiang Zhai

arXiv:2508.08833·cs.CL·December 5, 2025

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao, Xiang Wan, ChengXiang Zhai

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new benchmark and evaluation framework to assess the robustness of large language models in mathematical reasoning by testing their sensitivity to mathematically-equivalent linguistic and parametric variations.

Contribution

The paper presents PutnamGAP, a novel benchmark dataset, and a systematic evaluation methodology to measure LLMs' robustness in mathematical reasoning beyond traditional accuracy metrics.

Findings

01

Models show significant performance drops on variants.

02

OpenAI's O3 model drops 4.7% on surface variants and 12.9% on parametric variants.

03

Smaller models perform substantially worse overall.

Abstract

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The benchmark seems extensive, covering diverse categories, and based on high-level mathematical problems. The important contribution seems to be the evaluation metric. (some concerns below) 2. Evaluation is extensive. Each of the step seems to have been supported by many experiements. 3. Analysis gives us clear insights. The point with curriculum learning is important, but I did not find too much details in the main paper.

Weaknesses

The main motivation is well-known. Other work has tried this in different ways: GSM8k_MORE, GSM-symbolic. Even PUTNAM-AXIOM does this, but not in a scalable way. The GAP framework seems innovative, though depends a lot of LLMs to do every step. I am unsure how does errors in generation taken care of. Many important things are in Appendix, which makes the main contributions hard to follow -- like robustness metric details and motivation, curriculum learning training etc.

Reviewer 02Rating 4Confidence 3

Strengths

* **Originality:** The paper's originality is high. While robustness testing is not new, the GAP framework's focus on **mathematical equivalence** is a crucial distinction from prior work on contrast sets or perturbations that change the problem's substance. The specific methodology, distinguishing between surface-level ($\mathcal{T}_{surf}$) and deep-structural ($\mathcal{T}_{para}$) perturbations, provides a novel and insightful way to disentangle different reasoning failures. * **Quality:**

Weaknesses

The paper's primary weakness is that it is more descriptive than diagnostic. It excels at *identifying* and *quantifying* the robustness failure but offers limited insight into *why* it occurs or how to fix it. * **Analysis is Descriptive, Not Diagnostic:** The central finding—that LLM performance drops on perturbed inputs—is, while well-proven, not entirely surprising. The paper stops short of a deep analysis of these failures. * The error taxonomy (Section 5.3) is a good start, but it's

Reviewer 03Rating 2Confidence 5

Strengths

1. Robustness in mathematical reasoning is an increasingly important and underexplored direction. The paper tackles this with a clear motivation and a well-defined experimental setup. 2. The authors introduce five transformation types (four surface-level renamings and one parametric rewrite), providing a systematic way to probe reasoning robustness. 3. The evaluation spans 18 models and demonstrates consistent degradation under mathematically equivalent perturbations, validating the effectiv

Weaknesses

1. The experimental analysis is relatively limited and could be enriched by additional studies. (1) It would be useful to include math-specialized models in the evaluation to see how training objectives or dataset composition influence robustness, and to provide insights into how robustness might be improved. (2) The paper could explore whether specific prompting strategies (e.g., instructing models like O1 to pay attention to variable names or perform meta-reasoning) could help defend against t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Mathematics Education and Teaching Techniques · Teaching and Learning Programming