EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via   Expressive Masked Audio Gesture Modeling

Haiyang Liu; Zihao Zhu; Giorgio Becherini; Yichen Peng; Mingyang Su,; You Zhou; Xuefei Zhe; Naoya Iwamoto; Bo Zheng; Michael J. Black

arXiv:2401.00374·cs.CV·April 2, 2024·6 cites

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su,, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, Michael J. Black

PDF

Open Access 1 Repo 1 Models

TL;DR

EMAGE is a novel framework that generates realistic, full-body co-speech gestures from audio, utilizing a new dataset and masked gesture modeling to improve gesture synthesis quality and diversity.

Contribution

It introduces BEAT2, a comprehensive 3D gesture dataset, and a Masked Audio Gesture Transformer for improved holistic gesture generation from speech.

Findings

01

State-of-the-art gesture generation performance

02

Effective integration of masked gesture priors

03

Flexible generation with predefined spatial-temporal inputs

Abstract

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PantoMatrix/PantoMatrix
pytorchOfficial

Models

🤗
camenduru/EMAGE
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Hand Gesture Recognition Systems · Subtitles and Audiovisual Media

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Softmax · Label Smoothing · Adam · Dropout · Absolute Position Encodings · Layer Normalization