Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set
Arthur M. Jacobs, Annette Kinder

TL;DR
This paper demonstrates that quasi error-free text classification and authorship recognition are achievable across the entire Gutenberg Literary English Corpus using a consistent set of style and content features, advancing digital humanities research.
Contribution
Introduces a novel, unified feature set and method for accurate text classification and authorship recognition across diverse literary texts in the GLEC.
Findings
High accuracy in text classification and authorship recognition across the corpus
Identification of key diagnostic features including type-token ratio and surprise
A simple, versatile tool applicable to both short poems and long novels
Abstract
The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg English Poetry Corpus, has been submitted to quantitative text analyses providing predictions for scientific studies of literature. Here we show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features, computed via style and sentiment analysis, in both tasks. Our results identify two standard and two novel features (i.e., type-token ratio, frequency, sonority score, surprise) as most diagnostic in these tasks. By providing a simple tool applicable to both short poems and long novels generating quantitative predictions about features that co-determe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Sentiment Analysis and Opinion Mining · Advanced Text Analysis Techniques
