LongReasonArena: A Long Reasoning Benchmark for Large Language Models
Jiayu Ding, Shuming Ma, Lei Cui, Nanning Zheng, Furu Wei

TL;DR
LongReasonArena introduces a new benchmark to evaluate the long reasoning abilities of large language models, emphasizing multi-step algorithms and scalable reasoning lengths up to one million tokens, revealing significant challenges for current models.
Contribution
This paper presents LongReasonArena, the first benchmark specifically designed to assess long reasoning capabilities of LLMs with scalable, multi-step tasks.
Findings
Deepseek-R1 achieves only 7.5% accuracy on the benchmark.
Model accuracy declines linearly with the logarithm of reasoning steps.
The benchmark challenges both open-source and proprietary LLMs.
Abstract
Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
