DurIAN-E 2: Duration Informed Attention Network with Adaptive   Variational Autoencoder and Adversarial Learning for Expressive   Text-to-Speech Synthesis

Yu Gu; Qiushi Zhu; Guangzhi Lei; Chao Weng; Dan Su

arXiv:2410.13288·eess.AS·October 18, 2024

DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis

Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su

PDF

Open Access

TL;DR

DurIAN-E 2 introduces an advanced duration-informed attention network for expressive TTS, integrating VAEs, normalizing flows, and adversarial training to significantly enhance speech quality and expressiveness.

Contribution

It presents a novel TTS model combining stacked SwishRNN Transformer encoders, Style-Adaptive Instance Normalization, VAEs with normalizing flows, and adversarial training, outperforming previous models.

Findings

01

Achieves superior speech quality and expressiveness in objective tests.

02

Outperforms several state-of-the-art TTS approaches.

03

Demonstrates effectiveness of combining VAEs, normalizing flows, and adversarial training.

Abstract

This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-level encoders to improve the modeling ability of expressiveness in the proposed the DurIAN-E 2. Meanwhile, motivated by other TTS models using generative models such as VITS, the proposed DurIAN-E 2 utilizes variational autoencoders (VAEs) augmented with normalizing flows and a BigVGAN waveform generator with adversarial training strategy, which further improve the synthesized speech quality and expressiveness. Both objective test and subjective evaluation results prove that the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques

MethodsDense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout