DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis
Yu Gu, Qiushi Zhu, Guangzhi Lei, Chao Weng, Dan Su

TL;DR
DurIAN-E 2 introduces an advanced duration-informed attention network for expressive TTS, integrating VAEs, normalizing flows, and adversarial training to significantly enhance speech quality and expressiveness.
Contribution
It presents a novel TTS model combining stacked SwishRNN Transformer encoders, Style-Adaptive Instance Normalization, VAEs with normalizing flows, and adversarial training, outperforming previous models.
Findings
Achieves superior speech quality and expressiveness in objective tests.
Outperforms several state-of-the-art TTS approaches.
Demonstrates effectiveness of combining VAEs, normalizing flows, and adversarial training.
Abstract
This paper proposes an improved version of DurIAN-E (DurIAN-E 2), which is also a duration informed attention neural network for expressive and high-fidelity text-to-speech (TTS) synthesis. Similar with the DurIAN-E model, multiple stacked SwishRNN-based Transformer blocks are utilized as linguistic encoders and Style-Adaptive Instance Normalization (SAIN) layers are also exploited into frame-level encoders to improve the modeling ability of expressiveness in the proposed the DurIAN-E 2. Meanwhile, motivated by other TTS models using generative models such as VITS, the proposed DurIAN-E 2 utilizes variational autoencoders (VAEs) augmented with normalizing flows and a BigVGAN waveform generator with adversarial training strategy, which further improve the synthesized speech quality and expressiveness. Both objective test and subjective evaluation results prove that the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
MethodsDense Connections · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Attention Is All You Need · Adam · Linear Layer · Softmax · Multi-Head Attention · Dropout
