ChapterBreak: A Challenge Dataset for Long-Range Language Models

Simeng Sun; Katherine Thai; Mohit Iyyer

arXiv:2204.10878·cs.CL·April 26, 2022

ChapterBreak: A Challenge Dataset for Long-Range Language Models

Simeng Sun, Katherine Thai, Mohit Iyyer

PDF

Open Access 2 Repos 1 Datasets

TL;DR

ChapterBreak introduces a challenging dataset for evaluating long-range language models' ability to understand discourse at the chapter level, revealing current models' limitations in leveraging global context.

Contribution

The paper presents a new dataset, ChapterBreak, designed to evaluate long-range language models on complex narrative chapter transitions, highlighting their current shortcomings.

Findings

01

Existing LRLMs underperform on the dataset

02

Global context processing is insufficient in current models

03

The dataset reveals complex transition types requiring global understanding

Abstract

While numerous architectures for long-range language models (LRLMs) have recently been proposed, a meaningful evaluation of their discourse-level language understanding capabilities has not yet followed. To this end, we introduce ChapterBreak, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative. A fine-grained human annotation reveals that our dataset contains many complex types of chapter transitions (e.g., parallel narratives, cliffhanger endings) that require processing global context to comprehend. Experiments on ChapterBreak show that existing LRLMs fail to effectively leverage long-range context, substantially underperforming a segment-level model trained directly for this task. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

chtmp223/suri
dataset· 107 dl
107 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis