DurIAN: Duration Informed Attention Network For Multimodal Synthesis
Chengzhu Yu, Heng Lu, Na Hu, Meng Yu, Chao Weng, Kun Xu, Peng Liu,, Deyi Tuo, Shiyin Kang, Guangzhi Lei, Dan Su, Dong Yu

TL;DR
This paper introduces DurIAN, a duration-informed attention network for multimodal synthesis that produces natural speech and facial expressions, improving efficiency and avoiding common artifacts of end-to-end systems.
Contribution
The paper presents DurIAN, a novel autoregressive model using duration-informed alignments, and a multi-band WaveRNN for faster speech generation, enhancing multimodal synthesis quality and efficiency.
Findings
DurIAN achieves speech quality comparable to state-of-the-art systems.
Multi-band WaveRNN reduces computational complexity significantly.
The system enables synchronized speech and facial expression generation.
Abstract
In this paper, we present a generic and robust multimodal synthesis system that produces highly natural speech and facial expression simultaneously. The key component of this system is the Duration Informed Attention Network (DurIAN), an autoregressive model in which the alignments between the input text and the output acoustic features are inferred from a duration model. This is different from the end-to-end attention mechanism used, and accounts for various unavoidable artifacts, in existing end-to-end speech synthesis systems such as Tacotron. Furthermore, DurIAN can be used to generate high quality facial expression which can be synchronized with generated speech with/without parallel speech and face data. To improve the efficiency of speech generation, we also propose a multi-band parallel generation strategy on top of the WaveRNN model. The proposed Multi-band WaveRNN effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSoftmax · WaveRNN
