End-to-End Adversarial Text-to-Speech
Jeff Donahue, Sander Dieleman, Miko{\l}aj Bi\'nkowski, Erich Elsen,, Karen Simonyan

TL;DR
This paper introduces an end-to-end adversarial text-to-speech model that directly converts text or phonemes into raw speech audio, achieving high quality with a simple, efficient architecture.
Contribution
It presents a novel, fully end-to-end, feed-forward TTS model using adversarial training and differentiable alignment, eliminating the need for multi-stage pipelines.
Findings
Achieves mean opinion score over 4 out of 5, comparable to state-of-the-art.
Uses soft dynamic time warping for temporal variation in spectrogram prediction.
Operates efficiently with a single, differentiable model.
Abstract
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Time Series Analysis and Forecasting
