TL;DR
Wave-Tacotron introduces a spectrogram-free, end-to-end neural TTS model that directly synthesizes speech waveforms from text, leveraging normalizing flows for parallel training and high-quality output.
Contribution
It extends Tacotron by integrating normalizing flows, enabling direct waveform generation without intermediate features and improving synthesis speed.
Findings
Produces speech quality close to state-of-the-art systems
Allows parallel training and faster synthesis
Eliminates need for intermediate spectrogram representations
Abstract
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Residual GRU · Max Pooling · Batch Normalization · Dropout · Bidirectional GRU · Residual Connection · Dense Connections · Highway Layer · Gated Recurrent Unit
