BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer
Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki, Shiratori, Junichi Yamagishi, Taku Komura

TL;DR
This paper introduces BodyFormer, a transformer-based model that synthesizes diverse and realistic 3D body gestures from speech by modeling stochasticity and motion speed variations, even with limited data.
Contribution
The paper presents a variational transformer with mode positional embedding and intra-modal pre-training for speech-to-gesture synthesis, addressing data scarcity and gesture diversity.
Findings
Produces more realistic and appropriate gestures than state-of-the-art methods.
Effectively models stochastic gesture variations during speech.
Handles limited training data through pre-training scheme.
Abstract
Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
