A Spectral Energy Distance for Parallel Speech Synthesis
Alexey A. Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek,, Nal Kalchbrenner

TL;DR
This paper introduces a spectral energy distance method for training parallel speech synthesis models, achieving high-quality audio generation without requiring likelihood functions or adversarial training, and demonstrating state-of-the-art results.
Contribution
The paper proposes a novel spectral energy distance for training implicit speech models, enabling stable, parallel, and likelihood-free learning with statistical guarantees.
Findings
Achieves state-of-the-art quality among implicit models using cFDSD metric.
Improves GAN-TTS scores when combined with adversarial techniques.
Provides a stable, bias-free training method without adversarial learning.
Abstract
Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsDense Connections · Batch Normalization · Dilated Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feedforward Network · Average Pooling · DBlock · Residual Connection · Conditional Batch Normalization
