Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Ron J. Weiss; RJ Skerry-Ryan; Eric Battenberg; Soroosh Mariooryad,; Diederik P. Kingma

arXiv:2011.03568·cs.CL·February 9, 2021

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad,, Diederik P. Kingma

PDF

1 Repo

TL;DR

Wave-Tacotron introduces a spectrogram-free, end-to-end neural TTS model that directly synthesizes speech waveforms from text, leveraging normalizing flows for parallel training and high-quality output.

Contribution

It extends Tacotron by integrating normalizing flows, enabling direct waveform generation without intermediate features and improving synthesis speed.

Findings

01

Produces speech quality close to state-of-the-art systems

02

Allows parallel training and faster synthesis

03

Eliminates need for intermediate spectrogram representations

Abstract

We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-unicamp/tts-objective-metrics
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSigmoid Activation · Residual GRU · Max Pooling · Batch Normalization · Dropout · Bidirectional GRU · Residual Connection · Dense Connections · Highway Layer · Gated Recurrent Unit