A Spectral Energy Distance for Parallel Speech Synthesis

Alexey A. Gritsenko; Tim Salimans; Rianne van den Berg; Jasper Snoek,; Nal Kalchbrenner

arXiv:2008.01160·eess.AS·October 26, 2020·20 cites

A Spectral Energy Distance for Parallel Speech Synthesis

Alexey A. Gritsenko, Tim Salimans, Rianne van den Berg, Jasper Snoek,, Nal Kalchbrenner

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a spectral energy distance method for training parallel speech synthesis models, achieving high-quality audio generation without requiring likelihood functions or adversarial training, and demonstrating state-of-the-art results.

Contribution

The paper proposes a novel spectral energy distance for training implicit speech models, enabling stable, parallel, and likelihood-free learning with statistical guarantees.

Findings

01

Achieves state-of-the-art quality among implicit models using cFDSD metric.

02

Improves GAN-TTS scores when combined with adversarial techniques.

03

Provides a stable, bias-free training method without adversarial learning.

Abstract

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

A Spectral Energy Distance for Parallel Speech Synthesis· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsDense Connections · Batch Normalization · Dilated Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Feedforward Network · Average Pooling · DBlock · Residual Connection · Conditional Batch Normalization