An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems
Yuren Hao, Xiang Wan, ChengXiang Zhai

TL;DR
This paper introduces a new benchmark and evaluation framework to assess the robustness of large language models in mathematical reasoning by testing their sensitivity to mathematically-equivalent linguistic and parametric variations.
Contribution
The paper presents PutnamGAP, a novel benchmark dataset, and a systematic evaluation methodology to measure LLMs' robustness in mathematical reasoning beyond traditional accuracy metrics.
Findings
Models show significant performance drops on variants.
OpenAI's O3 model drops 4.7% on surface variants and 12.9% on parametric variants.
Smaller models perform substantially worse overall.
Abstract
In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 51.5% on the originals…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The benchmark seems extensive, covering diverse categories, and based on high-level mathematical problems. The important contribution seems to be the evaluation metric. (some concerns below) 2. Evaluation is extensive. Each of the step seems to have been supported by many experiements. 3. Analysis gives us clear insights. The point with curriculum learning is important, but I did not find too much details in the main paper.
The main motivation is well-known. Other work has tried this in different ways: GSM8k_MORE, GSM-symbolic. Even PUTNAM-AXIOM does this, but not in a scalable way. The GAP framework seems innovative, though depends a lot of LLMs to do every step. I am unsure how does errors in generation taken care of. Many important things are in Appendix, which makes the main contributions hard to follow -- like robustness metric details and motivation, curriculum learning training etc.
* **Originality:** The paper's originality is high. While robustness testing is not new, the GAP framework's focus on **mathematical equivalence** is a crucial distinction from prior work on contrast sets or perturbations that change the problem's substance. The specific methodology, distinguishing between surface-level ($\mathcal{T}_{surf}$) and deep-structural ($\mathcal{T}_{para}$) perturbations, provides a novel and insightful way to disentangle different reasoning failures. * **Quality:**
The paper's primary weakness is that it is more descriptive than diagnostic. It excels at *identifying* and *quantifying* the robustness failure but offers limited insight into *why* it occurs or how to fix it. * **Analysis is Descriptive, Not Diagnostic:** The central finding—that LLM performance drops on perturbed inputs—is, while well-proven, not entirely surprising. The paper stops short of a deep analysis of these failures. * The error taxonomy (Section 5.3) is a good start, but it's
1. Robustness in mathematical reasoning is an increasingly important and underexplored direction. The paper tackles this with a clear motivation and a well-defined experimental setup. 2. The authors introduce five transformation types (four surface-level renamings and one parametric rewrite), providing a systematic way to probe reasoning robustness. 3. The evaluation spans 18 models and demonstrates consistent degradation under mathematically equivalent perturbations, validating the effectiv
1. The experimental analysis is relatively limited and could be enriched by additional studies. (1) It would be useful to include math-specialized models in the evaluation to see how training objectives or dataset composition influence robustness, and to provide insights into how robustness might be improved. (2) The paper could explore whether specific prompting strategies (e.g., instructing models like O1 to pay attention to variable names or perform meta-reasoning) could help defend against t
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Mathematics Education and Teaching Techniques · Teaching and Learning Programming
