EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su,, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

TL;DR
EMAGE is a novel framework that generates realistic, full-body co-speech gestures from audio, utilizing a new dataset and masked gesture modeling to improve gesture synthesis quality and diversity.
Contribution
It introduces BEAT2, a comprehensive 3D gesture dataset, and a Masked Audio Gesture Transformer for improved holistic gesture generation from speech.
Findings
State-of-the-art gesture generation performance
Effective integration of masked gesture priors
Flexible generation with predefined spatial-temporal inputs
Abstract
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Hand Gesture Recognition Systems · Subtitles and Audiovisual Media
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam · Dropout · Absolute Position Encodings · Layer Normalization
