Continual Speaker Adaptation for Text-to-Speech Synthesis
Hamed Hemati, Damian Borth

TL;DR
This paper addresses the challenge of adding new speakers to multi-speaker TTS models without degrading performance on previous speakers by applying continual learning techniques like experience replay and weight regularization.
Contribution
It introduces a continual learning framework for TTS, demonstrating how to mitigate catastrophic forgetting during sequential speaker adaptation.
Findings
Experience replay reduces speaker forgetting.
Weight regularization helps preserve previous speaker quality.
Extended experience replay improves performance with small buffers.
Abstract
Training a multi-speaker Text-to-Speech (TTS) model from scratch is computationally expensive and adding new speakers to the dataset requires the model to be re-trained. The naive solution of sequential fine-tuning of a model for new speakers can lead to poor performance of older speakers. This phenomenon is known as catastrophic forgetting. In this paper, we look at TTS modeling from a continual learning perspective, where the goal is to add new speakers without forgetting previous speakers. Therefore, we first propose an experimental setup and show that serial fine-tuning for new speakers can cause the forgetting of the earlier speakers. Then we exploit two well-known techniques for continual learning, namely experience replay and weight regularization. We reveal how one can mitigate the effect of degradation in speech synthesis diversity in sequential training of new speakers using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Domain Adaptation and Few-Shot Learning
MethodsExperience Replay
