MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Youxin Pang; Jiajun Liu; Lingfeng Tan; Yong Zhang; Feng Gao; Xiang Deng; Zhuoliang Kang; Xiaoming Wei; Yebin Liu

arXiv:2512.03034·cs.CV·March 10, 2026

MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu

PDF

Open Access

TL;DR

MAViD introduces a multimodal framework that effectively integrates audio-visual understanding and generation for dialogue systems, enabling coherent, long-duration interactions with improved fusion and control.

Contribution

The paper presents a Conductor-Creator architecture with novel fusion and dual generative models, advancing multimodal dialogue understanding and generation capabilities.

Findings

01

Generated long, coherent audio-visual dialogues.

02

Achieved accurate interpretation of multimodal queries.

03

Enhanced multimodal fusion for seamless interactions.

Abstract

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech. The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components. The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions. Furthermore, to address the difficulty of generating long videos with consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization