Leave No Document Behind: Benchmarking Long-Context LLMs with Extended   Multi-Doc QA

Minzheng Wang; Longze Chen; Cheng Fu; Shengyi Liao; Xinghua Zhang,; Bingli Wu; Haiyang Yu; Nan Xu; Lei Zhang; Run Luo; Yunshui Li; Min Yang; Fei; Huang; Yongbin Li

arXiv:2406.17419·cs.CL·October 4, 2024·1 cites

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang,, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei, Huang, Yongbin Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Loong, a realistic benchmark for evaluating long-context LLMs through multi-document QA tasks that reflect real-world scenarios, revealing current models' limitations.

Contribution

The paper presents Loong, a novel benchmark with realistic multi-document QA tasks and diverse context lengths, addressing limitations of previous benchmarks.

Findings

01

Existing models show significant room for improvement.

02

Retrieval augmented generation performs poorly on Loong.

03

Loong effectively assesses long-context understanding.

Abstract

Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong's test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mozerwang/loong
noneOfficial

Videos

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA· underline

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices