Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Xu He; Qiaochu Huang; Zhensong Zhang; Zhiwei Lin; Zhiyong Wu; Sicheng; Yang; Minglei Li; Zhiyi Chen; Songcen Xu; Xiaofei Wu

arXiv:2404.01862·cs.CV·April 3, 2024·1 cites

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng, Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel motion-decoupled diffusion framework for generating realistic, audio-driven co-speech gesture videos that maintain appearance details and temporal coherence, surpassing previous skeleton-based methods.

Contribution

It proposes a new nonlinear TPS transformation for motion features and a transformer-based diffusion model for aligned gesture-video generation, enhancing visual quality and temporal consistency.

Findings

01

Outperforms existing methods in motion quality

02

Produces more coherent and detailed gesture videos

03

Effective in preserving appearance information

Abstract

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems, we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically, we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thuhcsi/s2g-mddiffusion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Hand Gesture Recognition Systems · Simulation and Modeling Applications

MethodsFocus · Diffusion