NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi, Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao,, Tie-Yan Liu

TL;DR
This paper introduces NaturalSpeech, an end-to-end TTS system that achieves human-level quality, verified through subjective tests, by leveraging advanced VAE techniques and novel modules to closely match human speech on benchmark data.
Contribution
The paper presents a novel TTS system, NaturalSpeech, that attains human-level quality using a VAE-based architecture with key modules for improved speech synthesis.
Findings
Achieves -0.01 CMOS compared to human speech
No statistically significant difference from human recordings (p >> 0.05)
Effective on the LJSpeech dataset
Abstract
Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
