A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry
Silvio Calderaro, Johanna Monti

TL;DR
This paper introduces A Bolu, a structured corpus of Sardinian improvisational poetry, and analyzes its linguistic features to support NLP development for minority oral languages.
Contribution
It creates the first structured dataset of Sardinian extemporaneous poetry and applies computational analysis to reveal recurring patterns and formulaic structures.
Findings
Poetry production shows recurring patterns supporting formulaicity theory.
The dataset enables better understanding of oral creativity in Sardinian.
Results contribute to developing inclusive NLP tools for minority languages.
Abstract
The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
