LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani; Daniel Nichols; Charles London; Peggy Li; Fabio Pizzati; Acer Blake; Hasan Hammoud; Tavish McDonald; Akshat Naik; Alesia Ivanova; Vignesh Baskaran; Ivan Laptev; Ruben Glatt; Tal Ben-Nun; Philip Torr; Natasha Jaques; Ameya Prabhu; Brian Bartoldson; Bhavya Kailkhura; Christian Schroeder de Witt

arXiv:2604.14140·cs.LG·April 16, 2026

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu, Brian Bartoldson

PDF

1 Datasets

TL;DR

LongCoT is a new benchmark designed to evaluate the long-horizon reasoning abilities of advanced language models across diverse complex tasks, revealing significant gaps in current model capabilities.

Contribution

The paper introduces LongCoT, a scalable benchmark with 2,500 problems to measure long-horizon chain-of-thought reasoning in frontier models.

Findings

01

Best models achieve less than 10% accuracy on LongCoT.

02

LongCoT effectively isolates long-horizon reasoning failures.

03

Current models show substantial limitations in extended reasoning tasks.

Abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LongHorizonReasoning/longcot
dataset· 552 dl
552 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.