MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

Anna Deichler; Jim O'Regan; Jonas Beskow

arXiv:2410.00253·cs.CV·October 2, 2024

MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans

Anna Deichler, Jim O'Regan, Jonas Beskow

PDF

Open Access

TL;DR

This paper introduces MM-Conv, a comprehensive multimodal dataset captured via VR for studying co-speech gesture generation in referential 3D scenes, enabling improved AI understanding of human gestures in context.

Contribution

It presents a novel multimodal dataset with rich contextual data for advancing co-speech gesture generation in virtual environments.

Findings

01

Dataset includes motion capture, speech, gaze, and scene graphs.

02

Supports development of context-aware gesture generation models.

03

Enhances understanding of multimodal communication in 3D scenes.

Abstract

In this paper, we present a novel dataset captured using a VR headset to record conversations between participants within a physics simulator (AI2-THOR). Our primary objective is to extend the field of co-speech gesture generation by incorporating rich contextual information within referential settings. Participants engaged in various conversational scenarios, all based on referential communication tasks. The dataset provides a rich set of multimodal recordings such as motion capture, speech, gaze, and scene graphs. This comprehensive dataset aims to enhance the understanding and development of gesture generation models in 3D scenes by providing diverse and contextually rich data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Speech and dialogue systems

MethodsSparse Evolutionary Training