PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu; Tsz Ting Chung; Chulun Zhou; Tong Li; Rui Lu; Jiangnan Li; Liyan Xu; Haoshu Lu; Ning Zhang; Jing Li; Jie Zhou

arXiv:2508.09848·cs.CL·August 15, 2025

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou

PDF

2 Datasets

TL;DR

PRELUDE is a new benchmark designed to evaluate the ability of models to understand and reason over long narratives by assessing the plausibility of prequel stories in relation to original books, highlighting the challenges in current AI systems.

Contribution

The paper introduces PRELUDE, a novel benchmark that emphasizes global comprehension and reasoning over long contexts, revealing significant gaps in current AI model capabilities.

Findings

01

88% of instances require integrating multiple narrative parts

02

Models lag human performance by over 15% in accuracy

03

Models often produce flawed reasoning despite correct answers

Abstract

We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.