Explorations in an English Poetry Corpus: A Neurocognitive Poetics   Perspective

Arthur M. Jacobs

arXiv:1801.02054·cs.CL·January 9, 2018

Explorations in an English Poetry Corpus: A Neurocognitive Poetics Perspective

Arthur M. Jacobs

PDF

Open Access

TL;DR

This paper introduces a large English poetry corpus and demonstrates its utility for digital humanities, NLP, and neurocognitive poetics research through quantitative analysis of author and text features.

Contribution

It provides a new, extensive poetry corpus with analytical tools and demonstrates its application in author similarity, topic detection, and sentiment analysis.

Findings

01

Author similarities based on latent semantic analysis

02

Identification of significant topics per author

03

Text-analytic metrics for lexical diversity and sentiment

Abstract

This paper describes a corpus of about 3000 English literary texts with about 250 million words extracted from the Gutenberg project that span a range of genres from both fiction and non-fiction written by more than 130 authors (e.g., Darwin, Dickens, Shakespeare). Quantitative Narrative Analysis (QNA) is used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus (GEPC) which comprises over 100 poetic texts with around 2 million words from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA studies show author similarities based on latent semantic analysis, significant topics for each author or various text-analytic metrics for George Eliot's poem 'How Lisa Loved the King' and James Joyce's 'Chamber Music', concerning e.g. lexical diversity or sentiment analysis. The GEPC is particularly suited for research in Digital Humanities, Natural Language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Digital Humanities and Scholarship · Topic Modeling