DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework
Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang,, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma

TL;DR
DiM-Gesture is a novel end-to-end model that generates personalized 3D full-body gestures from speech audio using a diffusion architecture, improving speed and memory efficiency over Transformer-based methods.
Contribution
Introduces DiM-Gesture, combining fuzzy feature extraction with AdaLN Mamba-2 diffusion architecture for efficient, high-fidelity gesture generation from speech.
Findings
Outperforms state-of-the-art methods in gesture-speech synchronization.
Achieves faster inference with lower memory usage.
Maintains naturalness and personalization in generated gestures.
Abstract
Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Speech and dialogue systems · Hearing Impairment and Communication
MethodsLayer Normalization · Diffusion · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
