Mapping Mutable Genres in Structurally Complex Volumes
Ted Underwood, Michael L. Black, Loretta Auvil, Boris Capitanu

TL;DR
This paper presents a multi-layered classification approach using hidden Markov models and ensemble classifiers to segment and classify large, heterogeneous digital library volumes by genre, accounting for historical changes.
Contribution
It introduces a novel method combining segmentation and ensemble classification to handle genre mapping in large, evolving digital volumes, addressing scale and heterogeneity challenges.
Findings
Successfully classified 469,200 volumes from HathiTrust
Extracted and analyzed 32,209 fiction volumes for narrative perspective trends
Identified genre-specific associations with narrative points of view
Abstract
To mine large digital libraries in humanistically meaningful ways, scholars need to divide them by genre. This is a task that classification algorithms are well suited to assist, but they need adjustment to address the specific challenges of this domain. Digital libraries pose two problems of scale not usually found in the article datasets used to test these algorithms. 1) Because libraries span several centuries, the genres being identified may change gradually across the time axis. 2) Because volumes are much longer than articles, they tend to be internally heterogeneous, and the classification task needs to begin with segmentation. We describe a multi-layered solution that trains hidden Markov models to segment volumes, and uses ensembles of overlapping classifiers to address historical change. We test this approach on a collection of 469,200 volumes drawn from HathiTrust Digital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
