Saar-Voice: A Multi-Speaker Saarbr\"ucken Dialect Speech Corpus
Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, J\"urgen Trouvain

TL;DR
Saar-Voice is a newly created six-hour speech corpus for the Saarbrücken dialect of German, aimed at advancing dialect-aware NLP and speech technologies in low-resource settings.
Contribution
The paper introduces Saar-Voice, a novel dialect speech corpus with aligned text and audio, addressing resource scarcity and methodological challenges in dialect speech processing.
Findings
The dataset includes nine speakers' recordings and textual analyses.
Challenges in orthographic and speaker variation were identified.
The corpus supports future dialect-aware TTS research, especially in low-resource scenarios.
Abstract
Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbr\"ucken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
