LongBench v2: Towards Deeper Understanding and Reasoning on Realistic   Long-context Multitasks

Yushi Bai; Shangqing Tu; Jiajie Zhang; Hao Peng; Xiaozhi Wang; Xin Lv,; Shulin Cao; Jiazheng Xu; Lei Hou; Yuxiao Dong; Jie Tang; Juanzi Li

arXiv:2412.15204·cs.CL·January 6, 2025·2 cites

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv,, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

LongBench v2 is a comprehensive benchmark with 503 challenging long-context questions across diverse tasks, designed to evaluate and improve the deep understanding and reasoning capabilities of large language models in real-world scenarios.

Contribution

This paper introduces LongBench v2, a new benchmark with diverse long-context tasks and high-quality data, highlighting the need for enhanced reasoning and scaling in LLMs.

Findings

01

Best model achieves 50.1% accuracy on LongBench v2.

02

LongBench v2 questions are challenging, with humans at 53.7% accuracy.

03

Enhanced reasoning models like o1-preview outperform humans by 4%.

Abstract

This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/longbench
pytorchOfficial

Datasets

Videos

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks· underline

Taxonomy

TopicsTopic Modeling · Semantic Web and Ontologies