Reasoning Models Reason Well, Until They Don't
Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, Abulhair Saparov

TL;DR
Large reasoning models (LRMs) show impressive performance on existing benchmarks but fail to generalize as reasoning complexity increases, highlighting the need for more robust methods.
Contribution
The paper introduces the Deep Reasoning Dataset (DeepRD) and demonstrates that LRMs' performance drops sharply with increased complexity, revealing limitations of current benchmarks.
Findings
LRMs perform well on existing benchmarks but fail at higher complexity levels.
Most real-world reasoning tasks fall within LRMs' success regime, but long-tail cases expose failures.
Existing benchmarks underestimate the true complexity of reasoning problems.
Abstract
Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
