Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Zhengxi Liu; Qiao Tian; Chenxu Hu; Xudong Liu; Menglin Wu; Yuping; Wang; Hang Zhao; Yuxuan Wang

arXiv:2207.06088·cs.SD·July 14, 2022·5 cites

Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

Zhengxi Liu, Qiao Tian, Chenxu Hu, Xudong Liu, Menglin Wu, Yuping, Wang, Hang Zhao, Yuxuan Wang

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end text-to-speech model that generates high-quality, expressive speech directly from text without relying on intermediate spectrograms, using advanced prosody modeling and supervision techniques.

Contribution

It proposes a new phoneme-level prosody modeling method with variational autoencoders and normalizing flows, along with a dual autoencoder for ground-truth supervision, enabling lossless and expressive speech synthesis.

Findings

01

Outperforms state-of-the-art TTS systems in quality and expressiveness

02

Effectively models prosody for more natural speech

03

Generates high-quality speech without loss of information

Abstract

Some recent studies have demonstrated the feasibility of single-stage neural text-to-speech, which does not need to generate mel-spectrograms but generates the raw waveforms directly from the text. Single-stage text-to-speech often faces two problems: a) the one-to-many mapping problem due to multiple speech variations and b) insufficiency of high frequency reconstruction due to the lack of supervision of ground-truth acoustic features during training. To solve the a) problem and generate more expressive speech, we propose a novel phoneme-level prosody modeling method based on a variational autoencoder with normalizing flows to model underlying prosodic information in speech. We also use the prosody predictor to support end-to-end expressive speech synthesis. Furthermore, we propose the dual parallel autoencoder to introduce supervision of the ground-truth acoustic features during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques