End-to-End Adversarial Text-to-Speech

Jeff Donahue; Sander Dieleman; Miko{\l}aj Bi\'nkowski; Erich Elsen,; Karen Simonyan

arXiv:2006.03575·cs.SD·March 18, 2021·33 cites

End-to-End Adversarial Text-to-Speech

Jeff Donahue, Sander Dieleman, Miko{\l}aj Bi\'nkowski, Erich Elsen,, Karen Simonyan

PDF

Open Access 2 Repos 2 Videos

TL;DR

This paper introduces an end-to-end adversarial text-to-speech model that directly converts text or phonemes into raw speech audio, achieving high quality with a simple, efficient architecture.

Contribution

It presents a novel, fully end-to-end, feed-forward TTS model using adversarial training and differentiable alignment, eliminating the need for multi-stage pipelines.

Findings

01

Achieves mean opinion score over 4 out of 5, comparable to state-of-the-art.

02

Uses soft dynamic time warping for temporal variation in spectrogram prediction.

03

Operates efficiently with a single, differentiable model.

Abstract

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

End-to-End Adversarial Text-to-Speech (Paper Explained)· youtube

End-to-end Adversarial Text-to-Speech· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Time Series Analysis and Forecasting