Evaluating Diversity in Automatic Poetry Generation
Yanran Chen, Hannes Gr\"oner, Sina Zarrie{\ss}, Steffen Eger

TL;DR
This paper assesses the diversity of automatically generated poetry, revealing current models' limitations in variety and style, and demonstrating that style-conditioning and character-level modeling improve diversity.
Contribution
It introduces a comprehensive evaluation of diversity in automatic poetry generation across multiple dimensions and compares various model types and fine-tuning methods.
Findings
Current models are underdiverse in rhyme, semantics, and length.
Style-conditioning and character-level modeling increase diversity.
Identifies key limitations for future improvement.
Abstract
Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation -- can humans distinguish between automatic and human generated poetry -- we evaluate the diversity of automatically generated poetry (with a focus on quatrains), by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3-8B, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsArtificial Intelligence in Games · Educational Games and Gamification · Human Motion and Animation
MethodsFocus
