Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Tiasa Singha Roy; Aditeya Baral; Ayush Rajesh Jhaveri; Yusuf Baig

arXiv:2505.15623·cs.CL·May 22, 2025

Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning

Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig

PDF

Open Access

TL;DR

This paper investigates the limitations of large language models in mathematical reasoning, introducing the MAPLE score to better evaluate their multi-step logic capabilities beyond mere accuracy.

Contribution

It presents a novel evaluation framework and metric, MAPLE score, to assess LLMs' reasoning processes more comprehensively than traditional accuracy measures.

Findings

01

LLMs struggle with multi-step mathematical reasoning

02

MAPLE score effectively captures reasoning errors and redundancies

03

Current accuracy metrics overlook reasoning process flaws

Abstract

Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing