MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer
Guile Wu, David Huang, Dongfeng Bai, Bingbing Liu

TL;DR
This paper introduces MoVieDrive, a unified diffusion transformer model that generates multi-modal, multi-view urban scene videos for autonomous driving, enhancing scene understanding and controllability over existing RGB-only methods.
Contribution
It presents a novel multi-modal multi-view diffusion transformer that integrates diverse data types into a single controllable framework for urban scene synthesis.
Findings
Achieves high-quality multi-modal video generation
Supports controllable scene structure and content
Outperforms state-of-the-art methods in experiments
Abstract
Urban scene synthesis with video generation models has recently shown great potential for autonomous driving. Existing video generation approaches to autonomous driving primarily focus on RGB video generation and lack the ability to support multi-modal video generation. However, multi-modal data, such as depth maps and semantic maps, are crucial for holistic urban scene understanding in autonomous driving. Although it is feasible to use multiple models to generate different modalities, this increases the difficulty of model deployment and does not leverage complementary cues for multi-modal data generation. To address this problem, in this work, we propose a novel multi-modal multi-view video generation approach to autonomous driving. Specifically, we construct a unified diffusion transformer model composed of modal-shared components and modal-specific components. Then, we leverage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Analysis and Summarization · Video Surveillance and Tracking Methods
