Multi-speaker Text-to-speech Training with Speaker Anonymized Data
Wen-Chin Huang, Yi-Chiao Wu, and Tomoki Toda

TL;DR
This paper explores training multi-speaker TTS models using speaker-anonymized data to enhance privacy, evaluating various anonymization methods and identifying metrics that predict TTS performance.
Contribution
It introduces the use of speaker anonymization techniques for multi-speaker TTS training and evaluates their effectiveness with new performance indicators.
Findings
UTMOS and GVD are effective predictors of TTS quality.
Speaker anonymization can be successfully integrated into TTS training.
Objective and subjective evaluations confirm the viability of anonymized data.
Abstract
The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
