GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

TL;DR
This paper introduces GSM-Symbolic, a new benchmark for evaluating mathematical reasoning in large language models, revealing their fragility and limited genuine reasoning abilities through controlled, diverse question generation.
Contribution
The paper presents GSM-Symbolic, an improved, controllable benchmark for assessing LLMs' mathematical reasoning, exposing their performance variability and reasoning fragility.
Findings
LLMs show performance drops when question numerical values change.
Adding irrelevant clauses significantly reduces model accuracy.
Models struggle with genuine logical reasoning, relying on training data patterns.
Abstract
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for…
Peer Reviews
Decision·ICLR 2025 Poster
Paper tests a very good question. The state of LLMs is very strange right now. They can clearly solve very tricky problems. At the same time, they often fail in very elementary ways as well. This paper is an exceptional analysis of this, especially testing across several different variants of GSM-Symbolic. In general, I think this paper is a fantastic contribution to the field and as a result vote accept.
Some weaknesses are referenced in the questions section. One additional question: would the authors be able to include statistical significance results in the Appendix or main results? I think this would significantly improve the paper. Secondly, while the GSM-Noop experiments are very interesting, I think there is a large difference in the claim (also made by prior work) that LLMs are bad at handling irrelenvant context and them not performing reasoning. Related Work: I think Srivastava+ 2024
**Strengths:** 1. **Comprehensive Experimental Design**: The authors explore various perturbations in mathematical questions such as changing names or numbers, adjusting the number of clauses, and introducing irrelevant information. These detailed experiments provide robust evidence of the fragility exhibited by prominent language models under slight modifications, thereby contributing significantly to our understanding of their limitations. 2. **Development of the GSM-Symbolic Dataset**: The
**Weaknesses:** 1. **Oversight of Computational Complexity**: The paper does not sufficiently address the computational challenges posed by the range of parameter values (5 to 100) used in their experiments. This range suggests that computing expressions like x + y + z, particularly with all parameters as two-digit numbers, could introduce significant computational errors unrelated to the models' reasoning capabilities. The potential impact of these computational difficulties on the results, pa
+ GSM-Symbolic provides a strategy to enrich the math reasoning datasets. The symbolic template can be applied for other datasets. + The NoOp design is innovative, it shows undiscussed disadvantage of current LLM on reasoning tasks. + The experiment design is adequate and valid. This paper evaluates 25 models, the results provide a comprehensive overview of the current SOTA methods.
- Figure 3 shows the chatgpt-o1 and GPT-4o has tiny accuracy drop. But even without GSM-Symbolic, running any LLM models for 50 times on the same dataset can cause this variance. - For numerical variables, how did the authors choose their boundary? This is not explained in the paper - To add clause in GSM-NoOp, we can see results drop a lot in Figure 8, but what is the standard to add this extra clause? Because adding clause can be subjective if authors already know some patterns will make the m
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
