3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

Yaoru Li; Heyu Si; Federico Landi; Pilar Oplustil Gallegos; Ioannis Koutsoumpas; O. Ricardo Cortez Vazquez; Ruiju Fu; Qi Guo; Xin Jin; Shunyu Liu; Mingli Song

arXiv:2511.21780·cs.MM·December 1, 2025

3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

Yaoru Li, Heyu Si, Federico Landi, Pilar Oplustil Gallegos, Ioannis Koutsoumpas, O. Ricardo Cortez Vazquez, Ruiju Fu, Qi Guo, Xin Jin, Shunyu Liu, Mingli Song

PDF

Open Access

TL;DR

This paper introduces 3MDiT, a unified tri-modal diffusion transformer that jointly models text, audio, and video streams for synchronized audio-video generation, improving quality and alignment.

Contribution

It proposes a novel tri-modal diffusion transformer framework that models audio, video, and text as evolving streams with feature fusion and dynamic text conditioning, enabling better synchronization and reuse of T2V models.

Findings

01

High-quality synchronized audio-video generation demonstrated.

02

Improved audio-video synchronization metrics.

03

Flexible training and adaptation regimes achieved.

Abstract

Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Video Analysis and Summarization