Voice Adaptation for Swiss German
Samuel Stucki, Jan Deriu, Mark Cieliebak

TL;DR
This paper explores voice adaptation techniques for Swiss German dialects, utilizing a large dataset of podcasts and fine-tuning a speech synthesis model to improve dialect rendering and adaptation for underrepresented languages.
Contribution
It introduces a large weakly labeled dataset of Swiss German speech and demonstrates effective fine-tuning of the XTTSv2 model for dialect-specific voice synthesis.
Findings
Achieved CMOS scores of up to -0.28
SMOS scores of 3.8
Effective dialect rendering in speech synthesis
Abstract
This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Linguistic Variation and Morphology · Natural Language Processing Techniques
