Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, Youngjae Yu

TL;DR
This paper introduces MARS, a multimodal language model trained on VENUS, a large-scale dataset of videos with aligned text and nonverbal cues, to enhance AI understanding and generation of nonverbal communication in conversations.
Contribution
The paper presents VENUS, a novel large-scale dataset of annotated videos with nonverbal cues, and MARS, a model that integrates text and nonverbal understanding for conversational AI.
Findings
MARS effectively generates text and nonverbal cues from conversational input.
VENUS dataset is large-scale and highly effective for training multimodal models.
MARS demonstrates improved multimodal understanding and generation capabilities.
Abstract
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Speech and dialogue systems · Language, Discourse, Communication Strategies
