Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech
Zhengxi Liu, Qiao Tian, Chenxu Hu, Xudong Liu, Menglin Wu, Yuping, Wang, Hang Zhao, Yuxuan Wang

TL;DR
This paper introduces a novel end-to-end text-to-speech model that generates high-quality, expressive speech directly from text without relying on intermediate spectrograms, using advanced prosody modeling and supervision techniques.
Contribution
It proposes a new phoneme-level prosody modeling method with variational autoencoders and normalizing flows, along with a dual autoencoder for ground-truth supervision, enabling lossless and expressive speech synthesis.
Findings
Outperforms state-of-the-art TTS systems in quality and expressiveness
Effectively models prosody for more natural speech
Generates high-quality speech without loss of information
Abstract
Some recent studies have demonstrated the feasibility of single-stage neural text-to-speech, which does not need to generate mel-spectrograms but generates the raw waveforms directly from the text. Single-stage text-to-speech often faces two problems: a) the one-to-many mapping problem due to multiple speech variations and b) insufficiency of high frequency reconstruction due to the lack of supervision of ground-truth acoustic features during training. To solve the a) problem and generate more expressive speech, we propose a novel phoneme-level prosody modeling method based on a variational autoencoder with normalizing flows to model underlying prosodic information in speech. We also use the prosody predictor to support end-to-end expressive speech synthesis. Furthermore, we propose the dual parallel autoencoder to introduce supervision of the ground-truth acoustic features during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
