DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer   Normalization Mamba-2 framework

Fan Zhang; Naye Ji; Fuxing Gao; Bozuo Zhao; Jingmei Wu; Yanbing Jiang,; Hui Du; Zhenqing Ye; Jiayang Zhu; WeiFan Zhong; Leyao Yan; Xiaomeng Ma

arXiv:2408.00370·cs.GR·August 2, 2024

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang,, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma

PDF

Open Access

TL;DR

DiM-Gesture is a novel end-to-end model that generates personalized 3D full-body gestures from speech audio using a diffusion architecture, improving speed and memory efficiency over Transformer-based methods.

Contribution

Introduces DiM-Gesture, combining fuzzy feature extraction with AdaLN Mamba-2 diffusion architecture for efficient, high-fidelity gesture generation from speech.

Findings

01

Outperforms state-of-the-art methods in gesture-speech synchronization.

02

Achieves faster inference with lower memory usage.

03

Maintains naturalness and personalization in generated gestures.

Abstract

Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Hearing Impairment and Communication

MethodsLayer Normalization · Diffusion · Mamba: Linear-Time Sequence Modeling with Selective State Spaces