MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang,, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi

TL;DR
This paper introduces MDT-A2G, a novel Masked Diffusion Transformer for co-speech gesture generation that improves temporal reasoning, integrates multi-modal data, and significantly accelerates both learning and inference speeds.
Contribution
The paper presents a new Masked Diffusion Transformer architecture for gesture generation, enhancing temporal relation learning and multi-modal integration, with faster training and inference.
Findings
Over 6× faster learning speed compared to traditional diffusion transformers.
Inference speed increased by 5.7× over standard diffusion models.
Achieves coherent and realistic gesture generation.
Abstract
Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
