Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He,, Linchao Bao

TL;DR
This paper introduces a novel conditional variational autoencoder approach for generating diverse and realistic gestures from speech audio, effectively modeling the one-to-many mapping and enabling user-controlled motion synthesis.
Contribution
The paper proposes a new VAE model with shared and motion-specific latent codes to generate diverse gestures from speech, addressing limitations of previous one-to-one mapping methods.
Findings
Generates more realistic and diverse gestures than state-of-the-art methods
Successfully models one-to-many audio-to-motion mapping
Enables user-controlled motion sequence generation
Abstract
Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization
