Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Anselm R. Strohmaier; Wim Van Dooren; Kathrin Se{\ss}ler; Brian Greer; Lieven Verschaffel

arXiv:2506.24006·cs.CL·August 12, 2025

Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Anselm R. Strohmaier, Wim Van Dooren, Kathrin Se{\ss}ler, Brian Greer, Lieven Verschaffel

PDF

Open Access

TL;DR

This paper reviews the capabilities of large language models in solving mathematical word problems, finding they excel at superficial problem-solving but struggle with real-world context understanding, limiting their educational usefulness.

Contribution

It provides a comprehensive scoping review, including technical, literature, and empirical analyses, revealing LLMs' superficial understanding of word problems from a mathematics education perspective.

Findings

01

LLMs solve s-problems with near-perfect accuracy.

02

Most word problems in research lack real-world context.

03

LLMs struggle with problems involving real-world or nonsensical contexts.

Abstract

The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Text Readability and Simplification · Computational and Text Analysis Methods

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer · PrIme Sample Attention · ALIGN