MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture   Generation

Xiaofeng Mao; Zhengkai Jiang; Qilin Wang; Chencan Fu; Jiangning Zhang,; Jiafu Wu; Yabiao Wang; Chengjie Wang; Wei Li; Mingmin Chi

arXiv:2408.03312·cs.CV·August 7, 2024

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang,, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi

PDF

TL;DR

This paper introduces MDT-A2G, a novel Masked Diffusion Transformer for co-speech gesture generation that improves temporal reasoning, integrates multi-modal data, and significantly accelerates both learning and inference speeds.

Contribution

The paper presents a new Masked Diffusion Transformer architecture for gesture generation, enhancing temporal relation learning and multi-modal integration, with faster training and inference.

Findings

01

Over 6× faster learning speed compared to traditional diffusion transformers.

02

Inference speed increased by 5.7× over standard diffusion models.

03

Achieves coherent and realistic gesture generation.

Abstract

Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections