DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial   Vector-Quantized Auto-Encoders

Yanqing Liu; Ruiqing Xue; Lei He; Xu Tan; Sheng Zhao

arXiv:2207.04646·cs.SD·July 12, 2022·1 cites

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Yanqing Liu, Ruiqing Xue, Lei He, Xu Tan, Sheng Zhao

PDF

Open Access

TL;DR

DelightfulTTS 2 introduces an end-to-end speech synthesis system that jointly optimizes acoustic modeling and waveform reconstruction using adversarial vector-quantized auto-encoders, improving speech quality over previous methods.

Contribution

It proposes a novel VQ-GAN based codec for learned speech representations and integrates joint optimization of the acoustic model and vocoder in an end-to-end framework.

Findings

01

Achieves a +0.14 CMOS gain over DelightfulTTS

02

Effectively learns intermediate speech representations

03

Joint optimization improves speech synthesis quality

Abstract

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques