Deep Voice: Real-time Neural Text-to-Speech

Sercan O. Arik; Mike Chrzanowski; Adam Coates; Gregory Diamos; Andrew; Gibiansky; Yongguo Kang; Xian Li; John Miller; Andrew Ng; Jonathan Raiman,; Shubho Sengupta; Mohammad Shoeybi

arXiv:1702.07825·cs.CL·March 9, 2017·395 cites

Deep Voice: Real-time Neural Text-to-Speech

Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew, Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman,, Shubho Sengupta, Mohammad Shoeybi

PDF

Open Access 3 Repos

TL;DR

Deep Voice is a fully neural, real-time text-to-speech system that integrates multiple neural components for end-to-end speech synthesis, achieving faster-than-real-time inference with optimized WaveNet variants.

Contribution

The paper introduces a novel end-to-end neural TTS system with integrated components and optimized inference, reducing complexity and training time compared to traditional methods.

Findings

01

Inference runs faster than real time.

02

Achieves up to 400x speedup with optimized kernels.

03

Simplifies TTS pipeline by replacing traditional components.

Abstract

We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet