Can LLMs Solve longer Math Word Problems Better?

Xin Xu; Tong Xiao; Zitong Chao; Zhenya Huang; Can Yang; Yang Wang

arXiv:2405.14804·cs.CL·February 27, 2025

Can LLMs Solve longer Math Word Problems Better?

Xin Xu, Tong Xiao, Zitong Chao, Zhenya Huang, Can Yang, Yang Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates how well large language models can solve longer math word problems, introduces new datasets and metrics, and proposes methods to improve their reasoning capabilities with extended narratives.

Contribution

It introduces the E-GSM dataset, new evaluation metrics for context length generalizability, and tailored prompting and fine-tuning techniques to enhance LLM performance on lengthy MWPs.

Findings

01

Existing LLMs show limited ability to handle longer MWPs.

02

Proposed methods improve LLM performance on extended narratives.

03

Enhanced methods also generalize across other MWP benchmarks.

Abstract

Math Word Problems (MWPs) play a vital role in assessing the capabilities of Large Language Models (LLMs), yet current research primarily focuses on questions with concise contexts. The impact of longer contexts on mathematical reasoning remains under-explored. This study pioneers the investigation of Context Length Generalizability (CoLeG), which refers to the ability of LLMs to solve MWPs with extended narratives. We introduce Extended Grade-School Math (E-GSM), a collection of MWPs featuring lengthy narratives, and propose two novel metrics to evaluate the efficacy and resilience of LLMs in tackling these problems. Our analysis of existing zero-shot prompting techniques with proprietary LLMs along with open-source LLMs reveals a general deficiency in CoLeG. To alleviate these issues, we propose tailored approaches for different categories of LLMs. For proprietary LLMs, we introduce a…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

- This paper introduces E-GSM, a dataset with lengthy, distracting sentences that make it considerably more challenging than the original GSM. This dataset offers a valuable tool for evaluating the robustness of LLMs. - The approach used to create E-GSM can also be applied to expand existing math training datasets, providing new supervised fine-tuning (SFT) data in the math domain.

Weaknesses

- The augmented math questions may include contradicting sentences. The augmented math questions may become unsolvable or yield answers that differ from the original ones. Although human evaluations on 200 samples suggest that “94.5% of questions meet acceptable quality,” this accuracy may still be inadequate, particularly given that the labels in the GSM8K test set might contain errors. An alternative could be to release these 200 samples as a verified subset of the E-GSM dataset. Reporting C

Reviewer 02Rating 3Confidence 4

Strengths

The paper explored the impact of question length on LLMs’ performance and proposed a method to extend the length of GSM questions. The paper presented a method called CoRe to help proprietary LLMs better handle these long-form questions. For the open source LLMs, the authors fine-tuned them with a fine-tuning dataset comprising 65K CoT data, created by the authors.

Weaknesses

1. The paper explores the artificial long math problems, but in real cases, there are seldom questions written in the way that the authors presented, i.e. very verbose questions talking about a relatively simple math problem. Therefore, it is unknown whether the conduct here can help in solving real-world long math problems where although the question is quite long, it already describes the problem in a succinct way that it could. Better solving them is our ultimate goal, rather than solving the

Reviewer 03Rating 5Confidence 3

Strengths

Strong motivation through rigorous statistical analysis shows LLMs struggle with longer MWPs (Section 2.1) Proposes creative solutions (CoRe prompting and extension fine-tuning) to address identified limitations Well-designed metrics (CoLeG-E and CoLeG-R) that capture both efficacy and robustness of LLMs on long MWPs Sufficient experiments have proven the effectiveness of the method

Weaknesses

The paper focuses on LLMs tackling longer math word problems, rather than genuinely difficult ones. Addressing truly challenging problems would likely yield more impactful and valuable research insights. A deeper analysis of the types of errors LLMs make on extended MWPs would strengthen the paper. This could shed light on whether mistakes stem from misinterpreting context, losing track of key information, or actual computational errors. The authors don't explore whether breaking down problems

Code & Models

Repositories

xinxu-ustc/coleg-math
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Open Education and E-Learning