Saar-Voice: A Multi-Speaker Saarbr\"ucken Dialect Speech Corpus

Lena S. Oberkircher; Jesujoba O. Alabi; Dietrich Klakow; J\"urgen Trouvain

arXiv:2604.11803·cs.CL·April 14, 2026

Saar-Voice: A Multi-Speaker Saarbr\"ucken Dialect Speech Corpus

Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, J\"urgen Trouvain

PDF

TL;DR

Saar-Voice is a newly created six-hour speech corpus for the Saarbrücken dialect of German, aimed at advancing dialect-aware NLP and speech technologies in low-resource settings.

Contribution

The paper introduces Saar-Voice, a novel dialect speech corpus with aligned text and audio, addressing resource scarcity and methodological challenges in dialect speech processing.

Findings

01

The dataset includes nine speakers' recordings and textual analyses.

02

Challenges in orthographic and speaker variation were identified.

03

The corpus supports future dialect-aware TTS research, especially in low-resource scenarios.

Abstract

Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbr\"ucken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.