TL;DR
This paper introduces a new multimodal dialogue dataset and benchmark to improve controllable and expressive dialogue generation across speech, vision, and text modalities, highlighting current limitations and future challenges.
Contribution
It presents a novel dataset and benchmark for multimodal dialogue generation, enabling explicit style control and evaluation of cross-modal consistency in human-like interactions.
Findings
Training on MM-Dia improves controllability of dialogue generation.
Current models struggle to replicate nuanced human expressiveness.
The dataset enables evaluation of style consistency across modalities.
Abstract
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
