Audio2Gestures: Generating Diverse Gestures from Audio

Jing Li; Di Kang; Wenjie Pei; Xuefei Zhe; Ying Zhang; Linchao Bao,; Zhenyu He

arXiv:2301.06690·cs.CV·January 18, 2023

Audio2Gestures: Generating Diverse Gestures from Audio

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao,, Zhenyu He

PDF

Open Access

TL;DR

This paper introduces a novel method for generating diverse co-speech gestures from audio by explicitly modeling the one-to-many mapping using a split latent code, resulting in more realistic and varied motions.

Contribution

It proposes a new VAE-based framework with shared and motion-specific codes, along with training strategies to improve diversity and realism in gesture generation from audio.

Findings

01

Generated motions are more diverse and realistic than previous methods.

02

The approach is compatible with various backbones like RNN and Transformer.

03

Structured losses improve motion dynamics and detail.

Abstract

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization