Mi\'{c}i Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect
Nikola Ljube\v{s}i\'c, Peter Rupnik, Tea Perin\v{c}i\'c

TL;DR
This paper presents a digitized Chakavian dialect version of The Little Prince, aligned at word level, to preserve cultural content and improve speech recognition models for dialectal speech processing.
Contribution
It introduces a new aligned audio-text dataset of The Little Prince in Chakavian dialect and demonstrates its use in adapting speech recognition models.
Findings
Word error rate halved with model adaptation
Character error reduced by up to two thirds
Dataset enables diverse AI and dialectal research
Abstract
This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the CLARIN.SI repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already -- adapting the Whisper-large-v3 open automatic speech recognition model, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDiverse Musicological Studies · Language and cultural evolution · Forensic and Genetic Research
