LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv,, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

TL;DR
LongBench v2 is a comprehensive benchmark with 503 challenging long-context questions across diverse tasks, designed to evaluate and improve the deep understanding and reasoning capabilities of large language models in real-world scenarios.
Contribution
This paper introduces LongBench v2, a new benchmark with diverse long-context tasks and high-quality data, highlighting the need for enhanced reasoning and scaling in LLMs.
Findings
Best model achieves 50.1% accuracy on LongBench v2.
LongBench v2 questions are challenging, with humans at 53.7% accuracy.
Enhanced reasoning models like o1-preview outperform humans by 4%.
Abstract
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Semantic Web and Ontologies
