CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang

TL;DR
CorpusQA introduces a large-scale benchmark for evaluating language models' ability to perform reasoning across extensive document collections, highlighting current limitations and proposing new directions.
Contribution
The paper presents a novel 10-million-token benchmark and a data synthesis framework for challenging corpus-level reasoning tasks, improving evaluation and training of long-context models.
Findings
State-of-the-art models struggle with increasing input length.
Standard retrieval methods fail on large, dispersed corpora.
Memory-augmented architectures outperform traditional models.
Abstract
While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single long texts or rely on a "sparse retrieval" assumption-that answers can be derived from a few relevant chunks. This assumption fails for true corpus-level analysis, where evidence is highly dispersed across hundreds of documents and answers require global integration, comparison, and statistical aggregation. To address this critical gap, we introduce CorpusQA, a new benchmark scaling up to 10 million tokens, generated via a novel data synthesis framework. By decoupling reasoning from textual representation, this framework creates complex, computation-intensive queries with programmatically guaranteed ground-truth answers, challenging systems to perform holistic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
