Large Language Models and Mathematical Reasoning Failures

Johan Boye; Birger Moell

arXiv:2502.11574·cs.AI·February 24, 2025

Large Language Models and Mathematical Reasoning Failures

Johan Boye, Birger Moell

PDF

Open Access

TL;DR

This study assesses the mathematical reasoning of large language models using high-school-level problems, revealing persistent reasoning failures despite high accuracy in answers, especially in complex, multi-step, or real-world scenarios.

Contribution

It introduces a comprehensive analysis of reasoning steps in LLMs, highlighting specific failure modes and emphasizing the importance of evaluating reasoning processes beyond answer correctness.

Findings

01

All models exhibit reasoning errors in spatial, strategic, and arithmetic tasks.

02

Newer models perform better but still struggle with multi-step deduction.

03

Models often produce correct answers through flawed logic.

Abstract

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsFocus