FedSpeech: Federated Text-to-Speech with Continual Learning
Ziyue Jiang, Yi Ren, Ming Lei, Zhou Zhao

TL;DR
FedSpeech introduces a federated text-to-speech system using continual learning techniques to preserve speaker identity and privacy, achieving high-quality multi-speaker synthesis with limited local data.
Contribution
The paper presents a novel federated learning architecture for text-to-speech that employs continual learning methods like gradual pruning and selective masks to protect speaker identity and improve performance.
Findings
Nearly matches multi-task training in speech quality
Retains speaker tones effectively
Outperforms multi-task training in speaker similarity
Abstract
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this paper, we propose a novel federated learning architecture based on continual learning approaches to overcome the difficulties above. Specifically, 1) we use gradual pruning masks to isolate parameters for preserving speakers' tones; 2) we apply selective masks for effectively reusing knowledge from tasks; 3) a private speaker embedding is introduced to keep users' privacy. Experiments on a reduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
