BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis
Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You, Zhou, Elif Bozkurt, Bo Zheng

TL;DR
This paper introduces BEAT, a large-scale multi-modal dataset with 76 hours of data capturing conversational gestures across different emotions and languages, and proposes a baseline model for gesture synthesis.
Contribution
The creation of BEAT, the largest multi-modal dataset for conversational gestures, and the development of a cascaded model for gesture synthesis conditioned on multiple modalities.
Findings
BEAT dataset contains 76 hours of multi-modal data from 30 speakers.
The proposed CaMN model achieves state-of-the-art performance in gesture synthesis.
Semantic relevance metric SRGR effectively evaluates gesture semantic relevance.
Abstract
Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Multimodal Machine Learning Applications
