Revisiting Over-Smoothness in Text to Speech

Yi Ren; Xu Tan; Tao Qin; Zhou Zhao; Tie-Yan Liu

arXiv:2202.13066·eess.AS·March 1, 2022

Revisiting Over-Smoothness in Text to Speech

Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

PDF

Open Access

TL;DR

This paper investigates over-smoothness in non-autoregressive text-to-speech models, proposing methods to reduce data distribution complexity and improve modeling techniques, leading to enhanced voice quality and reduced over-smoothing.

Contribution

It offers a comprehensive analysis of over-smoothing causes and explores combining condition inputs with advanced modeling methods to improve speech quality in NAR-TTS.

Findings

01

Condition inputs reduce data complexity and improve voice quality.

02

Laplacian mixture loss effectively models multimodal distributions.

03

Combining methods further alleviates over-smoothness and enhances speech quality.

Abstract

Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsActivation Normalization · Normalizing Flows · Invertible 1x1 Convolution · Affine Coupling · GLOW