Learning Speech-driven 3D Conversational Gestures from Video
Ikhsanul Habibie, Weipeng Xu, Dushyant Mehta, Lingjie Liu, Hans-Peter, Seidel, Gerard Pons-Moll, Mohamed Elgharib, Christian Theobalt

TL;DR
This paper introduces a novel method for automatically generating synchronized 3D body, hand, face, and head gestures for virtual characters directly from speech input, using a CNN and GAN architecture trained on a large in-the-wild video dataset.
Contribution
It presents the first joint synthesis approach for full 3D conversational gestures from speech, leveraging a new large-scale annotated dataset and advanced deep learning models.
Findings
Achieved state-of-the-art quality in 3D conversational gesture synthesis.
Created a large annotated dataset of 33 hours from in-the-wild videos.
Demonstrated the effectiveness of CNN and GAN architectures for synchronized gesture generation.
Abstract
We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people. To this end, we apply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
