The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings
Nikola Ljube\v{s}i\'c, Peter Rupnik, Danijel Kor\v{z}inek

TL;DR
This paper introduces a novel approach to creating large, high-quality speech-text aligned datasets for less-resourced languages using parliamentary proceedings, significantly aiding speech and language technology development.
Contribution
The paper presents a new method for aligning long sequences of speech and text in low-resource languages, demonstrated through three datasets for Croatian, Polish, and Serbian.
Findings
Created over 5,000 hours of speech-text aligned data
Developed a novel alignment approach for long sequences
Produced high-quality datasets for three Slavic languages
Abstract
Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsFocus
