Multi-Speaker End-to-End Speech Synthesis
Jihyun Park, Kexin Zhao, Kainan Peng, Wei Ping

TL;DR
This paper introduces a multi-speaker end-to-end speech synthesis model based on ClariNet, which uses trainable speaker embeddings to generate high-fidelity speech for multiple voices, outperforming existing systems.
Contribution
The paper extends ClariNet to multi-speaker synthesis by incorporating speaker embeddings and demonstrates improved naturalness over state-of-the-art models.
Findings
Outperforms existing systems in speech naturalness
Uses shared speaker embeddings across model components
Achieves high-fidelity multi-speaker speech synthesis
Abstract
In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · Attention Is All You Need · Weight Normalization · Softmax · L1 Regularization · WaveNet · *Communicated@Fast*How Do I Communicate to Expedia? · Dense Connections · Softsign Activation
