Chapter Captor: Text Segmentation in Novels
Charuta Pethe, Allen Kim, Steven Skiena

TL;DR
This paper introduces a new dataset and methods for segmenting chapters in novels, improving understanding of long text structure and revealing historical trends in chapter organization.
Contribution
It presents a novel dataset of 9,126 novels, hybrid methods for chapter boundary detection, and insights into historical changes in novel chapter structures.
Findings
Achieved 0.77 F1-score in chapter header recognition
Attained 0.453 F1-score in exact chapter boundary prediction
Revealed historical trends in novel chapter structures
Abstract
Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
