The ParlaSpeech Collection of Automatically Generated Speech and Text   Datasets from Parliamentary Proceedings

Nikola Ljube\v{s}i\'c; Peter Rupnik; Danijel Kor\v{z}inek

arXiv:2409.15397·eess.AS·March 17, 2025

The ParlaSpeech Collection of Automatically Generated Speech and Text Datasets from Parliamentary Proceedings

Nikola Ljube\v{s}i\'c, Peter Rupnik, Danijel Kor\v{z}inek

PDF

Open Access

TL;DR

This paper introduces a novel approach to creating large, high-quality speech-text aligned datasets for less-resourced languages using parliamentary proceedings, significantly aiding speech and language technology development.

Contribution

The paper presents a new method for aligning long sequences of speech and text in low-resource languages, demonstrated through three datasets for Croatian, Polish, and Serbian.

Findings

01

Created over 5,000 hours of speech-text aligned data

02

Developed a novel alignment approach for long sequences

03

Produced high-quality datasets for three Slavic languages

Abstract

Recent significant improvements in speech and language technologies come both from self-supervised approaches over raw language data as well as various types of explicit supervision. To ensure high-quality processing of spoken data, the most useful type of explicit supervision is still the alignment between the speech signal and its corresponding text transcript, which is a data type that is not available for many languages. In this paper, we present our approach to building large and open speech-and-text-aligned datasets of less-resourced languages based on transcripts of parliamentary proceedings and their recordings. Our starting point are the ParlaMint comparable corpora of transcripts of parliamentary proceedings of 26 national European parliaments. In the pilot run on expanding the ParlaMint corpora with aligned publicly available recordings, we focus on three Slavic languages,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus