Democratizing High-Fidelity Co-Speech Gesture Video Generation

Xu Yang; Shaoli Huang; Shenbo Xie; Xuelin Chen; Yifei Liu; Changxing Ding

arXiv:2507.06812·cs.CV·July 15, 2025

Democratizing High-Fidelity Co-Speech Gesture Video Generation

Xu Yang, Shaoli Huang, Shenbo Xie, Xuelin Chen, Yifei Liu, Changxing Ding

PDF

Open Access

TL;DR

This paper introduces a lightweight diffusion-based framework for generating realistic co-speech gesture videos, leveraging 2D skeletons and a new large-scale dataset to improve synchronization and visual quality.

Contribution

It presents a novel skeleton-conditioned diffusion model and the first public dataset for high-fidelity co-speech gesture video synthesis, enhancing accessibility and performance.

Findings

01

Outperforms state-of-the-art in visual quality and synchronization

02

Generalizes well across speakers and contexts

03

Provides a new dataset with extensive annotations

Abstract

Co-speech gesture video generation aims to synthesize realistic, audio-aligned videos of speakers, complete with synchronized facial expressions and body gestures. This task presents challenges due to the significant one-to-many mapping between audio and visual content, further complicated by the scarcity of large-scale public datasets and high computational demands. We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. Our approach introduces a diffusion model conditioned on fine-grained audio segments and a skeleton extracted from the speaker's reference image, predicting skeletal motions through skeleton-audio feature fusion to ensure strict audio coordination and body shape consistency. The generated skeletons are then fed into an off-the-shelf human video generation model with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis