MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

Guile Wu; David Huang; Dongfeng Bai; Bingbing Liu

arXiv:2508.14327·cs.CV·March 16, 2026

MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer

Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu

PDF

Open Access

TL;DR

This paper introduces MoVieDrive, a unified diffusion transformer model that generates multi-modal, multi-view urban scene videos for autonomous driving, enhancing scene understanding and controllability over existing RGB-only methods.

Contribution

It presents a novel multi-modal multi-view diffusion transformer that integrates diverse data types into a single controllable framework for urban scene synthesis.

Findings

01

Achieves high-quality multi-modal video generation

02

Supports controllable scene structure and content

03

Outperforms state-of-the-art methods in experiments

Abstract

Urban scene synthesis with video generation models has recently shown great potential for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Surveillance and Tracking Methods