Hierarchical and Multi-Scale Variational Autoencoder for Diverse and   Natural Non-Autoregressive Text-to-Speech

Jae-Sung Bae; Jinhyeok Yang; Tae-Jun Bak; Young-Sun Joo

arXiv:2204.04004·eess.AS·August 16, 2022

Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech

Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo

PDF

Open Access

TL;DR

This paper introduces HiMuV-TTS, a hierarchical variational autoencoder-based non-autoregressive TTS model that enhances speech naturalness and diversity by modeling prosody at multiple scales with adversarial training.

Contribution

The paper presents a novel hierarchical multi-scale VAE framework for NAR-TTS that improves speech diversity and naturalness over single-scale models.

Findings

01

Generated speech is more diverse and natural.

02

Model effectively captures prosody at different scales.

03

Outperforms existing single-scale VAE TTS models.

Abstract

This paper proposes a hierarchical and multi-scale variational autoencoder-based non-autoregressive text-to-speech model (HiMuV-TTS) to generate natural speech with diverse speaking styles. Recent advances in non-autoregressive TTS (NAR-TTS) models have significantly improved the inference speed and robustness of synthesized speech. However, the diversity of speaking styles and naturalness are needed to be improved. To solve this problem, we propose the HiMuV-TTS model that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale prosody and the learned text representation. In addition, we improve the quality of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings