Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang,, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei, Huang, Yongbin Li

TL;DR
This paper introduces Loong, a realistic benchmark for evaluating long-context LLMs through multi-document QA tasks that reflect real-world scenarios, revealing current models' limitations.
Contribution
The paper presents Loong, a novel benchmark with realistic multi-document QA tasks and diverse context lengths, addressing limitations of previous benchmarks.
Findings
Existing models show significant room for improvement.
Retrieval augmented generation performs poorly on Loong.
Loong effectively assesses long-context understanding.
Abstract
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong's test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices
