Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice
Yan Deng, Lei He, Frank Soong

TL;DR
This paper introduces a multi-speaker neural TTS model that effectively adapts to new speakers with minimal data and enhances premium voices by leveraging multi-speaker data for improved naturalness and similarity.
Contribution
The study proposes a multi-speaker latent space approach in neural TTS that enables quick speaker adaptation and premium voice enhancement using limited data and multi-speaker information.
Findings
Achieves MOS of 4.16 for new speakers with less than 5 minutes of data.
Attains MOS of 4.5 for premium voices on out-of-domain texts.
Outperforms single speaker models in naturalness and similarity metrics.
Abstract
Neural TTS has shown it can generate high quality synthesized speech. In this paper, we investigate the multi-speaker latent space to improve neural TTS for adapting the system to new speakers with only several minutes of speech or enhancing a premium voice by utilizing the data from other speakers for richer contextual coverage and better generalization. A multi-speaker neural TTS model is built with the embedded speaker information in both spectral and speaker latent space. The experimental results show that, with less than 5 minutes of training data from a new speaker, the new model can achieve an MOS score of 4.16 in naturalness and 4.64 in speaker similarity close to human recordings (4.74). For a well-trained premium voice, we can achieve an MOS score of 4.5 for out-of-domain texts, which is comparable to an MOS of 4.58 for professional recordings, and significantly outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
