TL;DR
SpeedySpeech introduces a fast, resource-efficient neural speech synthesis system that achieves high-quality audio in real-time, with quick training and inference, surpassing previous models like Tacotron 2.
Contribution
The paper presents a novel student-teacher model for speech synthesis that eliminates the need for self-attention layers, enabling faster training and inference while maintaining high audio quality.
Findings
Achieves high-quality speech synthesis faster than real-time.
Requires low computational resources and short training time.
Outperforms Tacotron 2 in voice quality ratings.
Abstract
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods1x1 Convolution · Dilated Causal Convolution · Bidirectional GRU · HuMan(Expedia)||How do I get a human at Expedia? · Dilated Convolution · Weight Normalization · Residual Connection · Highway Layer · Grouped Convolution · Max Pooling
