Deep Voice: Real-time Neural Text-to-Speech
Sercan O. Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew, Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman,, Shubho Sengupta, Mohammad Shoeybi

TL;DR
Deep Voice is a fully neural, real-time text-to-speech system that integrates multiple neural components for end-to-end speech synthesis, achieving faster-than-real-time inference with optimized WaveNet variants.
Contribution
The paper introduces a novel end-to-end neural TTS system with integrated components and optimized inference, reducing complexity and training time compared to traditional methods.
Findings
Inference runs faster than real time.
Achieves up to 400x speedup with optimized kernels.
Simplifies TTS pipeline by replacing traditional components.
Abstract
We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsMixture of Logistic Distributions · Dilated Causal Convolution · WaveNet
