Audio2Gestures: Generating Diverse Gestures from Audio
Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao,, Zhenyu He

TL;DR
This paper introduces a novel method for generating diverse co-speech gestures from audio by explicitly modeling the one-to-many mapping using a split latent code, resulting in more realistic and varied motions.
Contribution
It proposes a new VAE-based framework with shared and motion-specific codes, along with training strategies to improve diversity and realism in gesture generation from audio.
Findings
Generated motions are more diverse and realistic than previous methods.
The approach is compatible with various backbones like RNN and Transformer.
Structured losses improve motion dynamics and detail.
Abstract
People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization
