ChapterBreak: A Challenge Dataset for Long-Range Language Models
Simeng Sun, Katherine Thai, Mohit Iyyer

TL;DR
ChapterBreak introduces a challenging dataset for evaluating long-range language models' ability to understand discourse at the chapter level, revealing current models' limitations in leveraging global context.
Contribution
The paper presents a new dataset, ChapterBreak, designed to evaluate long-range language models on complex narrative chapter transitions, highlighting their current shortcomings.
Findings
Existing LRLMs underperform on the dataset
Global context processing is insufficient in current models
The dataset reveals complex transition types requiring global understanding
Abstract
While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
