Investigating the impact of 2D gesture representation on co-speech gesture generation
Teo Guichoux, Laure Soulier, Nicolas Obin, Catherine Pelachaud

TL;DR
This paper examines how the dimensionality of gesture data (2D vs. 3D) affects the quality of generated co-speech gestures in deep learning models, highlighting the importance of data representation choices.
Contribution
It provides an empirical comparison of 2D and 3D gesture representations in deep generative models for co-speech gesture synthesis.
Findings
2D gesture data can be effectively used for gesture generation.
Lifting 2D gestures to 3D impacts the naturalness of generated gestures.
Direct 3D gesture generation may outperform lifted 2D approaches.
Abstract
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Human Motion and Animation
