Deep Voice 2: Multi-Speaker Neural Text-to-Speech
Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan, Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

TL;DR
This paper presents Deep Voice 2, a neural TTS system with speaker embeddings enabling multi-speaker synthesis, achieving high-quality, speaker-specific speech from limited data per speaker.
Contribution
The paper introduces Deep Voice 2 with improved architecture and a multi-speaker extension, demonstrating high-quality synthesis for hundreds of voices with minimal data per speaker.
Findings
Deep Voice 2 outperforms Deep Voice 1 in audio quality.
The multi-speaker model learns hundreds of voices with less than 30 minutes per speaker.
High speaker identity preservation in multi-speaker synthesis.
Abstract
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network
