Revisiting Over-Smoothness in Text to Speech
Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, Tie-Yan Liu

TL;DR
This paper investigates over-smoothness in non-autoregressive text-to-speech models, proposing methods to reduce data distribution complexity and improve modeling techniques, leading to enhanced voice quality and reduced over-smoothing.
Contribution
It offers a comprehensive analysis of over-smoothing causes and explores combining condition inputs with advanced modeling methods to improve speech quality in NAR-TTS.
Findings
Condition inputs reduce data complexity and improve voice quality.
Laplacian mixture loss effectively models multimodal distributions.
Combining methods further alleviates over-smoothness and enhances speech quality.
Abstract
Non-autoregressive text to speech (NAR-TTS) models have attracted much attention from both academia and industry due to their fast generation speed. One limitation of NAR-TTS models is that they ignore the correlation in time and frequency domains while generating speech mel-spectrograms, and thus cause blurry and over-smoothed results. In this work, we revisit this over-smoothing problem from a novel perspective: the degree of over-smoothness is determined by the gap between the complexity of data distributions and the capability of modeling methods. Both simplifying data distributions and improving modeling methods can alleviate the problem. Accordingly, we first study methods reducing the complexity of data distributions. Then we conduct a comprehensive study on NAR-TTS models that use some advanced modeling methods. Based on these studies, we find that 1) methods that provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsActivation Normalization · Normalizing Flows · Invertible 1x1 Convolution · Affine Coupling · GLOW
