XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies
Xuanfan Ni, Hengyi Cai, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Piji, Li

TL;DR
This paper introduces XL$^2$Bench, a comprehensive benchmark designed to evaluate large language models' ability to understand extremely long texts with long-range dependencies, addressing a critical gap in current evaluation methods.
Contribution
The paper presents a new benchmark with diverse long-text scenarios and tasks, covering 27 subtasks in English and Chinese, to better assess LLMs' long context understanding capabilities.
Findings
LLMs perform significantly worse than humans on XL$^2$Bench.
Performance declines are consistent across datasets, indicating challenges in long-text comprehension.
The benchmark effectively highlights the limitations of current LLMs in handling extremely long contexts.
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks but are constrained by their small context window sizes. Various efforts have been proposed to expand the context window to accommodate even up to 200K input tokens. Meanwhile, building high-quality benchmarks with much longer text lengths and more demanding tasks to provide comprehensive evaluations is of immense practical interest to facilitate long context understanding research of LLMs. However, prior benchmarks create datasets that ostensibly cater to long-text comprehension by expanding the input of traditional tasks, which falls short to exhibit the unique characteristics of long-text understanding, including long dependency tasks and longer text length compatible with modern LLMs' context window size. In this paper, we introduce a benchmark for extremely long context understanding with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Human Pose and Action Recognition · Image Retrieval and Classification Techniques
