NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level   Quality

Xu Tan; Jiawei Chen; Haohe Liu; Jian Cong; Chen Zhang; Yanqing Liu; Xi; Wang; Yichong Leng; Yuanhao Yi; Lei He; Frank Soong; Tao Qin; Sheng Zhao,; Tie-Yan Liu

arXiv:2205.04421·eess.AS·May 11, 2022·35 cites

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi, Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao,, Tie-Yan Liu

PDF

Open Access 3 Repos

TL;DR

This paper introduces NaturalSpeech, an end-to-end TTS system that achieves human-level quality, verified through subjective tests, by leveraging advanced VAE techniques and novel modules to closely match human speech on benchmark data.

Contribution

The paper presents a novel TTS system, NaturalSpeech, that attains human-level quality using a VAE-based architecture with key modules for improved speech synthesis.

Findings

01

Achieves -0.01 CMOS compared to human speech

02

No statistically significant difference from human recordings (p >> 0.05)

03

Effective on the LJSpeech dataset

Abstract

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing