Frontier LLMs Still Struggle with Simple Reasoning Tasks

Alan Malek; Jiawei Ge; Nevena Lazic; Chi Jin; Andr\'as Gy\"orgy; Csaba Szepesv\'ari

arXiv:2507.07313·cs.CL·July 11, 2025

Frontier LLMs Still Struggle with Simple Reasoning Tasks

Alan Malek, Jiawei Ge, Nevena Lazic, Chi Jin, Andr\'as Gy\"orgy, Csaba Szepesv\'ari

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that state-of-the-art large language models still struggle with simple reasoning tasks, showing systematic failures and poor out-of-distribution generalization even on trivialized problems.

Contribution

The study introduces a suite of procedurally generated simple reasoning tasks and the unpuzzles dataset to analyze model failures and out-of-distribution generalization issues.

Findings

01

Models fail on simple reasoning tasks due to statistical shortcuts and errors.

02

Modern LLMs perform poorly on trivialized puzzles, indicating memorization issues.

03

Out-of-distribution generalization remains a challenge for frontier LLMs.

Abstract

While state-of-the-art large language models (LLMs) demonstrate advanced reasoning capabilities-achieving remarkable performance on challenging competitive math and coding benchmarks-they also frequently fail on tasks that are easy for humans. This work studies the performance of frontier LLMs on a broad set of such "easy" reasoning problems. By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

S1. The paper provides a timely and comprehensive evaluation of thinking models on simple reasoning tasks, filling a gap in the literature since most prior work focused on earlier model generations. The inclusion of o1, o3, DeepSeek R1, and Gemini thinking variants makes this a valuable reference for understanding current capabilities. S2. The procedurally generated task suite with tunable parameters (document length, tree depth, number of cities) is well-designed and allows systematic study of

Weaknesses

W1. Several tasks conflate multiple difficulty dimensions simultaneously, making it unclear which factors drive failures. For example, increasing tree depth in logic tasks increases both the number of reasoning steps and context length. More controlled ablations isolating individual factors would strengthen causal claims about failure modes. W2. The UNPUZZLES dataset, while introducing an interesting phenomenon, is limited in size (97 puzzles) and requires manual construction and evaluation. Th

Reviewer 02Rating 2Confidence 4

Strengths

In-depth error analysis. Nice idea with the unpuzzle puzzles, especially the context-shifted unpuzzle, which allows for a more detailed failure analysis and attribution.

Weaknesses

No human evaluation. You write “quite easy for humans,” but do not test this statement. You make this a central part of the paper, yet it remains untested. The story should be cleaner. Sections 3 and 4 feel related, but currently they should be two papers. I suggest thinking more about how the paper is presented to tie these two together. “One suggestion from our paper is that LLMs should be evaluated not only by the most difficult problem they can solve, but also by the simplest problem they

Reviewer 03Rating 4Confidence 5

Strengths

The paper shows that state-of-the-art LLMs still struggle on simple reasoning tasks by constructing procedurally generated reasoning tasks and a human-annotated small puzzle dataset. It revealed several failure modes that are common among frontier LLMs.

Weaknesses

The novelty of the paper is limited: the idea that frontier LLMs struggle with "simple"/"tedious" tasks has been studied in several works [1, 2, 3]. The advantage of this paper compared to previous works, e.g., the procedured generation, so that the benchmark is less prone to data contamination, does not provide new information about the fact that LLMs struggle with these simple tasks. [1] Kazemnejad, Amirhossein, et al. "The impact of positional encoding on length generalization in transformer

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification