Large Language Models and Mathematical Reasoning Failures
Johan Boye, Birger Moell

TL;DR
This study assesses the mathematical reasoning of large language models using high-school-level problems, revealing persistent reasoning failures despite high accuracy in answers, especially in complex, multi-step, or real-world scenarios.
Contribution
It introduces a comprehensive analysis of reasoning steps in LLMs, highlighting specific failure modes and emphasizing the importance of evaluating reasoning processes beyond answer correctness.
Findings
All models exhibit reasoning errors in spatial, strategic, and arithmetic tasks.
Newer models perform better but still struggle with multi-step deduction.
Models often produce correct answers through flawed logic.
Abstract
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
MethodsFocus
