Not All LLM Reasoners Are Created Equal
Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville,, Rishabh Agarwal

TL;DR
This paper investigates the reasoning capabilities of large language models on grade-school math problems, revealing significant gaps in compositional reasoning that vary across models and tuning methods.
Contribution
It introduces a novel evaluation method for assessing multi-step reasoning in LLMs and highlights the systematic differences in their reasoning abilities.
Findings
Smaller and math-specialized models show larger reasoning gaps.
Instruction tuning and code generation have inconsistent effects across models.
Finetuning on GSM can cause overfitting and reduce reasoning flexibility.
Abstract
We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The proposed compositional GSM8K is a straightforward and effective method for testing the two-hop math reasoning abilities of LLMs. 2. The paper is easy to follow. The experiments are comprehensive, covering various types of models. The analysis explores various potential factors that could lead to weaker performance, such as model sizes, instruction tuning, fine-tuning, and using code as a format. It sufficiently supports their central claim regarding the deficiencies of LLMs in compositio
1. As noted by the authors in the related work section, there are already several benchmarks for assessing the robustness of LLMs in math reasoning. Although this paper includes extensive experiments, the conclusion that LLMs struggle with multi-hop reasoning is not particularly surprising to me. 2. The two-hop QA format appears to be a minimal approach for testing mathematical compositional reasoning abilities. I was hoping for a more intricately designed benchmark, similar to SCAN or CFQ.
1. Introduction of Compositional GSM, a new evaluation approach that chains two GSM8K test questions together, requiring models to correctly solve both in sequence 2. Comprehensive evaluation of various LLMs, including Gemini, Gemma2, LLAMA3, GPT, Phi, Qwen2.5, and Mistral families 3. Several important empirical findings: - Most models show a clear performance gap between standard GSM8K and compositional problems - Smaller, cost-efficient, and math-specialized models show larger reasoning gaps
- My biggest concern is the gap between the motivation of the study and the proposed evaluation method. The authors propose the Compositional GSM as a tool to evaluate LLM reasoning, but don't sufficiently validate that the task actually measures real compositional reasoning ability. While they show poor performance on the second question even when the first is solved correctly, they don't establish whether this definitively indicates a reasoning gap versus other potential issues like prompt sen
It also highlights multiple interesting findings, such as "small and cost efficient LLMs, which are broadly accessible and crucial for real-world applications (Wan et al., 2024), exhibit larger reasoning gaps".
This paper tries to cover too many aspects without studying each in a detailed and comprehensive manner. For example, it showed cost-efficient LLMs perform badly on compositional GSM8k, but does not try to probe into the reasons behind this with more detailed error analysis or case studies. The study focuses on GSM8k vs two GSM8k questions chained together. Although GSM8k is a very important math reasoning benchmark, it fails to discuss other popular math reasoning datasets such as MATH whic
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Systems and Judicial Processes · Dispute Resolution and Class Actions · Business Law and Ethics
