Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Yafei Song; Peng Zhang; Bang Zhang

arXiv:2512.02576·cs.CV·December 3, 2025

Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Yafei Song, Peng Zhang, Bang Zhang

PDF

Open Access

TL;DR

This paper introduces a novel framework combining diffusion models and motion graph retrieval to generate synchronized, natural co-speech gesture videos, addressing limitations of previous one-to-one mapping methods.

Contribution

It proposes a diffusion-based approach that learns joint audio-motion distributions and a motion retrieval algorithm for improved gesture video synthesis.

Findings

01

Significant improvement in synchronization accuracy

02

Enhanced naturalness of generated gestures

03

Effective integration of multi-level audio features

Abstract

Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Hand Gesture Recognition Systems · Human Pose and Action Recognition