MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
Anna Deichler, Jim O'Regan, Jonas Beskow

TL;DR
This paper introduces MM-Conv, a comprehensive multimodal dataset captured via VR for studying co-speech gesture generation in referential 3D scenes, enabling improved AI understanding of human gestures in context.
Contribution
It presents a novel multimodal dataset with rich contextual data for advancing co-speech gesture generation in virtual environments.
Findings
Dataset includes motion capture, speech, gaze, and scene graphs.
Supports development of context-aware gesture generation models.
Enhances understanding of multimodal communication in 3D scenes.
Abstract
In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems
MethodsSparse Evolutionary Training
