LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

Ziyang Chen; Xing Wu; Junlong Jia; Chaochen Gao; Qi Fu; Debing Zhang; Songlin Hu

arXiv:2601.02872·cs.CL·January 7, 2026

LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark

Ziyang Chen, Xing Wu, Junlong Jia, Chaochen Gao, Qi Fu, Debing Zhang, Songlin Hu

PDF

Open Access 1 Datasets

TL;DR

LongBench Pro is a comprehensive bilingual benchmark with 1,500 real-world long-context samples in English and Chinese, designed to evaluate and analyze large language models' understanding of extended contexts across multiple tasks and difficulty levels.

Contribution

It introduces a scalable, realistic bilingual benchmark with detailed task-specific metrics and a novel Human-Model Collaborative Construction pipeline for efficient data creation.

Findings

01

Long-context optimization surpasses parameter scaling in improving comprehension.

02

Effective context length is often shorter than claimed, with cross-lingual misalignment.

03

The 'thinking' paradigm benefits native reasoning models; mixed-thinking offers a Pareto trade-off.

Abstract

The rapid expansion of context length in large language models (LLMs) has outpaced existing evaluation benchmarks. Current long-context benchmarks often trade off scalability and realism: synthetic tasks underrepresent real-world complexity, while fully manual annotation is costly to scale to extreme lengths and diverse scenarios. We present LongBench Pro, a more realistic and comprehensive bilingual benchmark of 1,500 naturally occurring long-context samples in English and Chinese spanning 11 primary tasks and 25 secondary tasks, with input lengths from 8k to 256k tokens. LongBench Pro supports fine-grained analysis with task-specific metrics and a multi-dimensional taxonomy of context requirement (full vs. partial dependency), length (six levels), and difficulty (four levels calibrated by model performance). To balance quality with scalability, we propose a Human-Model Collaborative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

caskcsg/LongBench-Pro
dataset· 890 dl
890 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications