Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning
Tiasa Singha Roy, Aditeya Baral, Ayush Rajesh Jhaveri, Yusuf Baig

TL;DR
This paper investigates the limitations of large language models in mathematical reasoning, introducing the MAPLE score to better evaluate their multi-step logic capabilities beyond mere accuracy.
Contribution
It presents a novel evaluation framework and metric, MAPLE score, to assess LLMs' reasoning processes more comprehensively than traditional accuracy measures.
Findings
LLMs struggle with multi-step mathematical reasoning
MAPLE score effectively captures reasoning errors and redundancies
Current accuracy metrics overlook reasoning process flaws
Abstract
Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning, particularly in executing precise, multi-step logic. However, current evaluation frameworks judge their performance solely based on accuracy, which only accounts for the final answer. This study explores these pitfalls by employing a novel evaluation framework. We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
