Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Petr Plech\'a\v{c}; Artjoms \v{S}e\c{l}a; Silvie Cinkov\'a; Mirella De Sisto; Lara Nugues; Ne\v{z}a Ko\v{c}nik; Antonina Martynenko; Ben Nagy; Luca Giovannini; Robert Kol\'ar

arXiv:2604.08156·cs.CL·April 10, 2026

Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Petr Plech\'a\v{c}, Artjoms \v{S}e\c{l}a, Silvie Cinkov\'a, Mirella De Sisto, Lara Nugues, Ne\v{z}a Ko\v{c}nik, Antonina Martynenko, Ben Nagy, Luca Giovannini, Robert Kol\'ar

PDF

TL;DR

This paper explores how training data size impacts the accuracy of unsupervised rhyme recognition across multiple languages, using RhymeTagger and comparing it to large language models.

Contribution

It provides a comprehensive analysis of training data requirements for reliable rhyme recognition and benchmarks RhymeTagger against human agreement and LLMs.

Findings

01

RhymeTagger outperforms human agreement with sufficient data

02

LLMs struggle without phonetic representations

03

Training size significantly influences rhyme recognition accuracy

Abstract

Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.