From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Zeyu Jin; Songtao Zhou; Haoyu Wang; Minghao Tian; Kaifeng Yun; Zhuo Chen; Xiaoyu Qin; Jia Jia

arXiv:2603.29162·cs.MM·May 12, 2026

From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Zeyu Jin, Songtao Zhou, Haoyu Wang, Minghao Tian, Kaifeng Yun, Zhuo Chen, Xiaoyu Qin, Jia Jia

PDF

1 Video

TL;DR

This paper introduces a new multimodal dialogue dataset and benchmark to improve controllable and expressive dialogue generation across speech, vision, and text modalities, highlighting current limitations and future challenges.

Contribution

It presents a novel dataset and benchmark for multimodal dialogue generation, enabling explicit style control and evaluation of cross-modal consistency in human-like interactions.

Findings

01

Training on MM-Dia improves controllability of dialogue generation.

02

Current models struggle to replicate nuanced human expressiveness.

03

The dataset enables evaluation of style consistency across modalities.

Abstract

The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Natural Alignment to Conditional Controllability in Multimodal Dialogue· slideslive