MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Liyang Li; Wen Wang; Canyu Zhao; Tianjian Feng; Zhiyue Zhao; Hao Chen; and Chunhua Shen

arXiv:2604.19679·cs.CV·April 23, 2026

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, and Chunhua Shen

PDF

TL;DR

MMControl introduces a unified framework for multi-modal control in joint audio-video generation, enabling fine-grained, composable control over multiple conditions including visual and acoustic signals.

Contribution

It presents a dual-stream conditional injection mechanism and modality-specific guidance scaling for enhanced multi-modal controllability in diffusion-based models.

Findings

01

Achieves identity and timbre consistency in generated videos and audio.

02

Enables independent, dynamic adjustment of control signals during inference.

03

Demonstrates fine-grained, composable control over multiple generation aspects.

Abstract

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.