SpeedySpeech: Efficient Neural Speech Synthesis

Jan Vainer; Ond\v{r}ej Du\v{s}ek

arXiv:2008.03802·eess.AS·August 11, 2020

SpeedySpeech: Efficient Neural Speech Synthesis

Jan Vainer, Ond\v{r}ej Du\v{s}ek

PDF

3 Repos 4 Models

TL;DR

SpeedySpeech introduces a fast, resource-efficient neural speech synthesis system that achieves high-quality audio in real-time, with quick training and inference, surpassing previous models like Tacotron 2.

Contribution

The paper presents a novel student-teacher model for speech synthesis that eliminates the need for self-attention layers, enabling faster training and inference while maintaining high audio quality.

Findings

01

Achieves high-quality speech synthesis faster than real-time.

02

Requires low computational resources and short training time.

03

Outperforms Tacotron 2 in voice quality ratings.

Abstract

While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods1x1 Convolution · Dilated Causal Convolution · Bidirectional GRU · HuMan(Expedia)||How do I get a human at Expedia? · Dilated Convolution · Weight Normalization · Residual Connection · Highway Layer · Grouped Convolution · Max Pooling