BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for   Conversational Gestures Synthesis

Haiyang Liu; Zihao Zhu; Naoya Iwamoto; Yichen Peng; Zhengqing Li; You; Zhou; Elif Bozkurt; Bo Zheng

arXiv:2203.05297·cs.CV·September 21, 2022·1 cites

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You, Zhou, Elif Bozkurt, Bo Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces BEAT, a large-scale multi-modal dataset with 76 hours of data capturing conversational gestures across different emotions and languages, and proposes a baseline model for gesture synthesis.

Contribution

The creation of BEAT, the largest multi-modal dataset for conversational gestures, and the development of a cascaded model for gesture synthesis conditioned on multiple modalities.

Findings

01

BEAT dataset contains 76 hours of multi-modal data from 30 speakers.

02

The proposed CaMN model achieves state-of-the-art performance in gesture synthesis.

03

Semantic relevance metric SRGR effectively evaluates gesture semantic relevance.

Abstract

Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Based on this observation, we propose a baseline model, Cascaded Motion Network (CaMN), which consists of above six modalities modeled in a cascaded architecture for gesture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

PantoMatrix/PantoMatrix
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Human Pose and Action Recognition · Multimodal Machine Learning Applications