Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng, Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu

TL;DR
This paper introduces a novel motion-decoupled diffusion framework for generating realistic, audio-driven co-speech gesture videos that maintain appearance details and temporal coherence, surpassing previous skeleton-based methods.
Contribution
It proposes a new nonlinear TPS transformation for motion features and a transformer-based diffusion model for aligned gesture-video generation, enhancing visual quality and temporal consistency.
Findings
Outperforms existing methods in motion quality
Produces more coherent and detailed gesture videos
Effective in preserving appearance information
Abstract
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Hand Gesture Recognition Systems · Simulation and Modeling Applications
MethodsFocus · Diffusion
