VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive   Text-to-Speech Synthesis

Hui Lu; Zhiyong Wu; Xixin Wu; Xu Li; Shiyin Kang; Xunying Liu; Helen; Meng

arXiv:2107.03298·cs.SD·July 8, 2021

VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

Hui Lu, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen, Meng

PDF

Open Access 2 Repos

TL;DR

This paper introduces VAENAR-TTS, a non-autoregressive text-to-speech model using variational auto-encoders that achieves high-quality speech synthesis without needing phoneme durations, offering efficiency and naturalness improvements.

Contribution

The proposed VAENAR-TTS model is an end-to-end, non-autoregressive TTS system that encodes alignment in latent variables, eliminating the need for phoneme duration labels and recurrent structures.

Findings

01

Achieves state-of-the-art synthesis quality.

02

Provides synthesis speed comparable to other NAR-TTS models.

03

Does not require phoneme-level duration labels.

Abstract

This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through forced alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level durations. The VAENAR-TTS model does…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques