Can LLMs Solve longer Math Word Problems Better?
Xin Xu, Tong Xiao, Zitong Chao, Zhenya Huang, Can Yang, Yang Wang

TL;DR
This paper investigates how well large language models can solve longer math word problems, introduces new datasets and metrics, and proposes methods to improve their reasoning capabilities with extended narratives.
Contribution
It introduces the E-GSM dataset, new evaluation metrics for context length generalizability, and tailored prompting and fine-tuning techniques to enhance LLM performance on lengthy MWPs.
Findings
Existing LLMs show limited ability to handle longer MWPs.
Proposed methods improve LLM performance on extended narratives.
Enhanced methods also generalize across other MWP benchmarks.
Abstract
Math Word Problems (MWPs) play a vital role in assessing the capabilities of Large Language Models (LLMs), yet current research primarily focuses on questions with concise contexts. The impact of longer contexts on mathematical reasoning remains under-explored. This study pioneers the investigation of Context Length Generalizability (CoLeG), which refers to the ability of LLMs to solve MWPs with extended narratives. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives, and propose two novel metrics to evaluate the efficacy and resilience of LLMs in tackling these problems. Our analysis of existing zero-shot prompting techniques with proprietary LLMs along with open-source LLMs reveals a general deficiency in CoLeG. To alleviate these issues, we propose tailored approaches for different categories of LLMs. For proprietary LLMs, we introduce a…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper introduces E-GSM, a dataset with lengthy, distracting sentences that make it considerably more challenging than the original GSM. This dataset offers a valuable tool for evaluating the robustness of LLMs. - The approach used to create E-GSM can also be applied to expand existing math training datasets, providing new supervised fine-tuning (SFT) data in the math domain.
- The augmented math questions may include contradicting sentences. The augmented math questions may become unsolvable or yield answers that differ from the original ones. Although human evaluations on 200 samples suggest that “94.5% of questions meet acceptable quality,” this accuracy may still be inadequate, particularly given that the labels in the GSM8K test set might contain errors. An alternative could be to release these 200 samples as a verified subset of the E-GSM dataset. Reporting C
The paper explored the impact of question length on LLMs’ performance and proposed a method to extend the length of GSM questions. The paper presented a method called CoRe to help proprietary LLMs better handle these long-form questions. For the open source LLMs, the authors fine-tuned them with a fine-tuning dataset comprising 65K CoT data, created by the authors.
1. The paper explores the artificial long math problems, but in real cases, there are seldom questions written in the way that the authors presented, i.e. very verbose questions talking about a relatively simple math problem. Therefore, it is unknown whether the conduct here can help in solving real-world long math problems where although the question is quite long, it already describes the problem in a succinct way that it could. Better solving them is our ultimate goal, rather than solving the
Strong motivation through rigorous statistical analysis shows LLMs struggle with longer MWPs (Section 2.1) Proposes creative solutions (CoRe prompting and extension fine-tuning) to address identified limitations Well-designed metrics (CoLeG-E and CoLeG-R) that capture both efficacy and robustness of LLMs on long MWPs Sufficient experiments have proven the effectiveness of the method
The paper focuses on LLMs tackling longer math word problems, rather than genuinely difficult ones. Addressing truly challenging problems would likely yield more impactful and valuable research insights. A deeper analysis of the types of errors LLMs make on extended MWPs would strengthen the paper. This could shed light on whether mistakes stem from misinterpreting context, losing track of key information, or actual computational errors. The authors don't explore whether breaking down problems
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Open Education and E-Learning
