Explorations in an English Poetry Corpus: A Neurocognitive Poetics Perspective
Arthur M. Jacobs

TL;DR
This paper introduces a large English poetry corpus and demonstrates its utility for digital humanities, NLP, and neurocognitive poetics research through quantitative analysis of author and text features.
Contribution
It provides a new, extensive poetry corpus with analytical tools and demonstrates its application in author similarity, topic detection, and sentiment analysis.
Findings
Author similarities based on latent semantic analysis
Identification of significant topics per author
Text-analytic metrics for lexical diversity and sentiment
Abstract
This paper describes a corpus of about 3000 English literary texts with about 250 million words extracted from the Gutenberg project that span a range of genres from both fiction and non-fiction written by more than 130 authors (e.g., Darwin, Dickens, Shakespeare). Quantitative Narrative Analysis (QNA) is used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus (GEPC) which comprises over 100 poetic texts with around 2 million words from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA studies show author similarities based on latent semantic analysis, significant topics for each author or various text-analytic metrics for George Eliot's poem 'How Lisa Loved the King' and James Joyce's 'Chamber Music', concerning e.g. lexical diversity or sentiment analysis. The GEPC is particularly suited for research in Digital Humanities, Natural Language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Digital Humanities and Scholarship · Topic Modeling
