Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Md. Fahad Ullah Utsho; Mohd. Ruhul Ameen; Akif Islam; Md. Golam Rashed; Dipankar Das

arXiv:2604.13371·cs.CL·April 16, 2026

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Md. Fahad Ullah Utsho, Mohd. Ruhul Ameen, Akif Islam, Md. Golam Rashed, Dipankar Das

PDF

TL;DR

This paper introduces a benchmarking framework to evaluate how large reasoning models' accuracy and reasoning quality decline sharply as task complexity increases across various classical reasoning problems.

Contribution

It systematically demonstrates the existence of a reasoning collapse phase transition in large reasoning models under increasing problem complexity.

Findings

01

Models perform well at low complexity but sharply decline beyond certain thresholds.

02

Accuracy drops often exceed 50% as complexity increases.

03

Increased reasoning length does not consistently improve correctness.

Abstract

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik's Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.