Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes   in Mathematical Reasoning

Joykirat Singh; Akshay Nambi; Vibhav Vineet

arXiv:2406.10834·cs.CL·June 18, 2024

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Joykirat Singh, Akshay Nambi, Vibhav Vineet

PDF

Open Access

TL;DR

This paper evaluates large language models' ability to detect and correct reasoning mistakes in math word problems, introducing a new dataset and benchmarking models' reasoning robustness.

Contribution

It introduces MWP-MISTAKE, a novel dataset for assessing LLMs' reasoning error detection and correction capabilities, and provides comprehensive benchmarking insights.

Findings

01

GPT-4o outperforms other models in mistake detection

02

Smaller models face significant challenges in reasoning tasks

03

Data contamination affects model reliability

Abstract

Large Language Models (LLMs) have been applied to Math Word Problems (MWPs) with transformative impacts, revolutionizing how these complex problems are approached and solved in various domains including educational settings. However, the evaluation of these models often prioritizes final accuracy, overlooking the crucial aspect of reasoning capabilities. This work addresses this gap by focusing on the ability of LLMs to detect and correct reasoning mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking reveals significant insights into the strengths and weaknesses of state-of-the-art models, such as GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior performance in mistake detection and rectification and the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducational Assessment and Pedagogy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Cosine Annealing · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Dropout