Voice Adaptation for Swiss German

Samuel Stucki; Jan Deriu; Mark Cieliebak

arXiv:2505.22054·cs.CL·May 29, 2025

Voice Adaptation for Swiss German

Samuel Stucki, Jan Deriu, Mark Cieliebak

PDF

Open Access

TL;DR

This paper explores voice adaptation techniques for Swiss German dialects, utilizing a large dataset of podcasts and fine-tuning a speech synthesis model to improve dialect rendering and adaptation for underrepresented languages.

Contribution

It introduces a large weakly labeled dataset of Swiss German speech and demonstrates effective fine-tuning of the XTTSv2 model for dialect-specific voice synthesis.

Findings

01

Achieved CMOS scores of up to -0.28

02

SMOS scores of 3.8

03

Effective dialect rendering in speech synthesis

Abstract

This work investigates the performance of Voice Adaptation models for Swiss German dialects, i.e., translating Standard German text to Swiss German dialect speech. For this, we preprocess a large dataset of Swiss podcasts, which we automatically transcribe and annotate with dialect classes, yielding approximately 5000 hours of weakly labeled training material. We fine-tune the XTTSv2 model on this dataset and show that it achieves good scores in human and automated evaluations and can correctly render the desired dialect. Our work shows a step towards adapting Voice Cloning technology to underrepresented languages. The resulting model achieves CMOS scores of up to -0.28 and SMOS scores of 3.8.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Linguistic Variation and Morphology · Natural Language Processing Techniques