BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Kunkun Pang; Dafei Qin; Yingruo Fan; Julian Habekost; Takaaki; Shiratori; Junichi Yamagishi; Taku Komura

arXiv:2310.06851·cs.CV·October 12, 2023

BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer

Kunkun Pang, Dafei Qin, Yingruo Fan, Julian Habekost, Takaaki, Shiratori, Junichi Yamagishi, Taku Komura

PDF

TL;DR

This paper introduces BodyFormer, a transformer-based model that synthesizes diverse and realistic 3D body gestures from speech by modeling stochasticity and motion speed variations, even with limited data.

Contribution

The paper presents a variational transformer with mode positional embedding and intra-modal pre-training for speech-to-gesture synthesis, addressing data scarcity and gesture diversity.

Findings

01

Produces more realistic and appropriate gestures than state-of-the-art methods.

02

Effectively models stochastic gesture variations during speech.

03

Handles limited training data through pre-training scheme.

Abstract

Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.