Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual   Text-to-Speech Adaptation

Yingting Li; Ambuj Mehrish; Bryan Chew; Bo Cheng; Soujanya Poria

arXiv:2406.17257·cs.CL·June 26, 2024

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Yingting Li, Ambuj Mehrish, Bryan Chew, Bo Cheng, Soujanya Poria

PDF

Open Access

TL;DR

This paper introduces parameter-efficient transfer learning methods like adapters and hypernetworks to multilingual Text-to-Speech systems, significantly reducing training costs while maintaining or improving synthesis quality.

Contribution

It demonstrates that PETL techniques can effectively adapt large multilingual TTS models with only about 2.5% of parameters tuned, outperforming traditional full fine-tuning.

Findings

01

PETL methods achieve comparable or better performance than full fine-tuning.

02

Only approximately 2.5% of parameters need to be tuned for effective adaptation.

03

Code and samples are publicly available.

Abstract

Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsHyperNetwork