From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical   Integrity on Faulty Mathematical Problems

A M Muntasir Rahman; Junyi Ye; Wei Yao; Sierra S. Liu; Jesse Yu,; Jonathan Yu; Wenpeng Yin; Guiling Wang

arXiv:2410.18921·cs.CL·April 8, 2025

From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

A M Muntasir Rahman, Junyi Ye, Wei Yao, Sierra S. Liu, Jesse Yu,, Jonathan Yu, Wenpeng Yin, Guiling Wang

PDF

Open Access

TL;DR

This paper introduces FaultyMath, a benchmark dataset to evaluate whether large language models can identify logical flaws in mathematical problems, revealing that most models act as blind solvers rather than logical thinkers.

Contribution

The paper presents a new diverse dataset and comprehensive evaluation framework to assess LLMs' ability to detect logical inconsistencies in math problems, highlighting current limitations.

Findings

01

Most LLMs act as blind solvers without deeper reasoning.

02

Models struggle to reliably detect faulty math problems.

03

Hints and explanations have limited impact on improving model reasoning.

Abstract

Consider the math problem: "Lily received 3 cookies from her best friend yesterday and ate 5 for breakfast. Today, her friend gave her 3 more cookies. How many cookies does Lily have now?" Many large language models (LLMs) in previous research approach this problem by calculating the answer "1" using the equation "3 - 5 + 3." However, from a human perspective, we recognize the inherent flaw in this problem: Lily cannot eat 5 cookies if she initially only had 3. This discrepancy prompts a key question: Are current LLMs merely Blind Solver that apply mathematical operations without deeper reasoning, or can they function as Logical Thinker capable of identifying logical inconsistencies? To explore this question, we propose a benchmark dataset, FaultyMath, which includes faulty math problems of rich diversity: i) multiple mathematical categories, e.g., algebra, geometry, number theory,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Statistics Education and Methodologies