Chapter Captor: Text Segmentation in Novels

Charuta Pethe; Allen Kim; Steven Skiena

arXiv:2011.04163·cs.CL·November 10, 2020

Chapter Captor: Text Segmentation in Novels

Charuta Pethe, Allen Kim, Steven Skiena

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a new dataset and methods for segmenting chapters in novels, improving understanding of long text structure and revealing historical trends in chapter organization.

Contribution

It presents a novel dataset of 9,126 novels, hybrid methods for chapter boundary detection, and insights into historical changes in novel chapter structures.

Findings

01

Achieved 0.77 F1-score in chapter header recognition

02

Attained 0.453 F1-score in exact chapter boundary prediction

03

Revealed historical trends in novel chapter structures

Abstract

Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cpethe/chapter-captor
pytorchOfficial

Datasets

afoland/chapterized_PG
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques