An Initial Investigation of Language Adaptation for TTS Systems under   Low-resource Scenarios

Cheng Gong; Erica Cooper; Xin Wang; Chunyu Qiang; Mengzhe Geng; Dan; Wells; Longbiao Wang; Jianwu Dang; Marc Tessier; Aidan Pine; Korin Richmond,; Junichi Yamagishi

arXiv:2406.08911·cs.CL·June 14, 2024

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan, Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond,, Junichi Yamagishi

PDF

Open Access 1 Repo

TL;DR

This paper investigates how self-supervised multilingual models can be adapted for low-resource language TTS, analyzing factors affecting performance and revealing insights into optimal fine-tuning strategies.

Contribution

It provides an empirical analysis of language adaptation in SSL-based TTS systems, highlighting the impact of phonetic similarity, dataset size, and data pairing on adaptation success.

Findings

01

Phonetic similarity influences adaptation performance.

02

Dataset size and speaker diversity affect TTS adaptation.

03

Paired data is not always better than audio-only data for fine-tuning.

Abstract

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nii-yamagishilab/ZMM-TTS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech Recognition and Synthesis