LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Zhan Ling; Kang Liu; Kai Yan; Yifan Yang; Weijian Lin; Ting-Han Fan; Lingfeng Shen; Zhengyin Du; Jiecao Chen

arXiv:2501.15089·cs.CL·November 19, 2025

LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Zhan Ling, Kang Liu, Kai Yan, Yifan Yang, Weijian Lin, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen

PDF

Open Access 1 Datasets

TL;DR

LongReason is a synthetic benchmark designed to evaluate the long-context reasoning abilities of large language models across diverse tasks, revealing current models' limitations as context length increases.

Contribution

We introduce LongReason, a comprehensive synthetic benchmark for assessing long-context reasoning in LLMs, covering multiple reasoning types and providing a new standard for evaluation.

Findings

01

Most LLMs' performance drops with longer context

02

State-of-the-art models still have significant room for improvement

03

LongReason is publicly available for research use

Abstract

Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus on a narrow range of tasks or those that do not demand complex reasoning. To address this gap and enable a more comprehensive evaluation of the long-context reasoning capabilities of current LLMs, we propose a new synthetic benchmark, LongReason, which is constructed by synthesizing long-context reasoning questions from a varied set of short-context reasoning questions through context expansion. LongReason consists of 794 multiple-choice reasoning questions with diverse reasoning patterns across three task categories: reading comprehension, logical inference, and mathematical word problems. We evaluate 21 LLMs on LongReason, revealing that most models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lz1bytedance/LongReason
dataset· 615 dl
615 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Recommender Systems and Techniques

MethodsSparse Evolutionary Training · Focus