Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown   in State-Of-the-Art Large Language Models

Marianna Nezhurina; Lucia Cipolina-Kun; Mehdi Cherti; Jenia Jitsev

arXiv:2406.02061·cs.LG·March 6, 2025·20 cites

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper reveals that state-of-the-art large language models, including GPT-4 and Claude 3, fail dramatically on simple reasoning tasks, exposing significant gaps in their generalization and reasoning abilities despite high benchmark scores.

Contribution

The study demonstrates a simple reasoning problem causes major performance breakdowns in SOTA LLMs, challenging current benchmark validity and highlighting the need for improved evaluation methods.

Findings

01

SOTA models perform poorly on simple common sense math problems

02

Models exhibit overconfidence and confabulations in incorrect solutions

03

Standard interventions like chain-of-thought prompting fail to improve reasoning accuracy

Abstract

Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws, and therefore transfer robustly across various conditions in few- or zero-shot manner. Such claims rely on standardized benchmarks that suppose to measure generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function, including large scale advanced models like GPT-4 or Claude 3 Opus, using a simple, short common sense math problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is dramatic as it manifests on a simple problem in both low average performance and strong performance fluctuations on natural variations in problem template that do not change either problem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

ryota39/aiw_ja
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling