Mathematical Computation and Reasoning Errors by Large Language Models
Liang Zhang, Edith Aurora Graf

TL;DR
This study evaluates the accuracy and reasoning errors of four large language models on challenging math tasks, revealing procedural slips as the main error source and showing dual-agent setups improve performance, informing better AI integration in math education.
Contribution
It introduces a systematic analysis of LLM reasoning errors on custom math tasks and demonstrates how dual-agent configurations enhance accuracy in mathematical problem-solving.
Findings
OpenAI GPT-4o1 achieves high accuracy across math categories.
Procedural slips are the most common errors affecting performance.
Dual-agent configurations significantly improve problem-solving accuracy.
Abstract
Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math problem-solving tasks is foundational for ensuring reliable and precise feedback and assessment in math education practices. Our study focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1, DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including arithmetic, algebra, and number theory, and identifies step-level reasoning errors within their solutions. Instead of relying on standard benchmarks, we intentionally build math tasks (via item models) that are challenging for LLMs and prone to errors. The accuracy of final answers and the presence of errors in individual solution steps were systematically analyzed and coded.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
