Co-speech Gesture Video Generation via Motion-Based Graph Retrieval
Yafei Song, Peng Zhang, Bang Zhang

TL;DR
This paper introduces a novel framework combining diffusion models and motion graph retrieval to generate synchronized, natural co-speech gesture videos, addressing limitations of previous one-to-one mapping methods.
Contribution
It proposes a diffusion-based approach that learns joint audio-motion distributions and a motion retrieval algorithm for improved gesture video synthesis.
Findings
Significant improvement in synchronization accuracy
Enhanced naturalness of generated gestures
Effective integration of multi-level audio features
Abstract
Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Hand Gesture Recognition Systems · Human Pose and Action Recognition
