Audio2Gestures: Generating Diverse Gestures from Speech Audio with   Conditional Variational Autoencoders

Jing Li; Di Kang; Wenjie Pei; Xuefei Zhe; Ying Zhang; Zhenyu He,; Linchao Bao

arXiv:2108.06720·cs.CV·August 17, 2021·6 cites

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He,, Linchao Bao

PDF

Open Access

TL;DR

This paper introduces a novel conditional variational autoencoder approach for generating diverse and realistic gestures from speech audio, effectively modeling the one-to-many mapping and enabling user-controlled motion synthesis.

Contribution

The paper proposes a new VAE model with shared and motion-specific latent codes to generate diverse gestures from speech, addressing limitations of previous one-to-one mapping methods.

Findings

01

Generates more realistic and diverse gestures than state-of-the-art methods

02

Successfully models one-to-many audio-to-motion mapping

03

Enables user-controlled motion sequence generation

Abstract

Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code mainly models the strong correlation between audio and motion (such as the synchronized audio and motion beats), while the motion-specific code captures diverse motion information independent of the audio. However, splitting the latent code into two parts poses training difficulties for the VAE model. A mapping network facilitating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Video Analysis and Summarization