CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild
Xingqun Qi, Hengyuan Zhang, Yatian Wang, Jiahao Pan, Chen Liu, Peng, Li, Xiaowei Chi, Mengfei Li, Wei Xue, Shanghang Zhang, Wenhan Luo, Qifeng, Liu, Yike Guo

TL;DR
CoCoGesture introduces a large-scale dataset and a diffusion-based framework with a novel training paradigm for generating diverse, coherent 3D gestures from unseen speech inputs, significantly improving zero-shot speech-to-gesture synthesis.
Contribution
The paper presents a new large-scale co-speech 3D gesture dataset and a diffusion model with a fine-tuning approach using audio ControlNet and Mixture-of-Gesture-Experts for improved gesture generation.
Findings
Outperforms state-of-the-art zero-shot speech-to-gesture methods
Constructed GES-X dataset with over 40 million gesture instances
Demonstrates vivid and diverse gesture synthesis from unseen speech prompts
Abstract
Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Motion and Animation · Robotics and Automated Systems
MethodsDiffusion
