Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Sercan Arik; Gregory Diamos; Andrew Gibiansky; John Miller; Kainan; Peng; Wei Ping; Jonathan Raiman; Yanqi Zhou

arXiv:1705.08947·cs.CL·September 22, 2017·212 cites

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan, Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

PDF

Open Access 1 Repo

TL;DR

This paper presents Deep Voice 2, a neural TTS system with speaker embeddings enabling multi-speaker synthesis, achieving high-quality, speaker-specific speech from limited data per speaker.

Contribution

The paper introduces Deep Voice 2 with improved architecture and a multi-speaker extension, demonstrating high-quality synthesis for hundreds of voices with minimal data per speaker.

Findings

01

Deep Voice 2 outperforms Deep Voice 1 in audio quality.

02

The multi-speaker model learns hundreds of voices with less than 30 minutes per speaker.

03

High speaker identity preservation in multi-speaker synthesis.

Abstract

We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

barronalex/Tacotron
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsGriffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU · Highway Network