ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Nikola Ljube\v{s}i\'c, Peter Rupnik, Ivan Porupski, Taja Kuzman Punger\v{s}ek

TL;DR
ParlaSpeech 3.0 provides a richly annotated, multilingual parliamentary speech corpus for Croatian, Czech, Polish, and Serbian, enhancing research capabilities across linguistic and speech processing disciplines.
Contribution
This release significantly enriches the ParlaSpeech corpora with automatic linguistic, sentiment, and disfluency annotations, expanding their utility for multidisciplinary research.
Findings
Corpora include 6,000 hours of speech data across four languages.
Enriched with linguistic annotations, sentiment predictions, and disfluency markers.
Enhanced datasets facilitate advanced analysis, exemplified by sentiment-related acoustic studies.
Abstract
ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
