MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes
Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, Qiang Xu

TL;DR
MagicDrive3D is a new framework that enables controllable, high-quality 3D street scene generation from autonomous driving data, supporting multi-condition control and any-view rendering, with applications in autonomous driving simulation.
Contribution
It introduces a novel multi-view video synthesis approach combined with 3D scene generation, reducing data collection challenges and enabling flexible control in 3D street scene modeling.
Findings
Generates diverse, high-quality 3D street scenes
Supports multi-condition control including maps, objects, and text
Enhances downstream tasks like BEV segmentation
Abstract
Controllable generative models for images and videos have seen significant success, yet 3D scene generation, especially in unbounded scenarios like autonomous driving, remains underdeveloped. Existing methods lack flexible controllability and often rely on dense view data collection in controlled environments, limiting their generalizability across common datasets (e.g., nuScenes). In this paper, we introduce MagicDrive3D, a novel framework for controllable 3D street scene generation that combines video-based view synthesis with 3D representation (3DGS) generation. It supports multi-condition control, including road maps, 3D objects, and text descriptions. Unlike previous approaches that require 3D representation before training, MagicDrive3D first trains a multi-view video generation model to synthesize diverse street views. This method utilizes routinely collected autonomous driving…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper presents impressive visualization results, with generated scenes that are virtually indistinguishable from real-world counterparts. It introduces an innovative generation-first, reconstruction-later pipeline, which simplifies both scene control and data acquisition, offering a more streamlined approach to 3D scene synthesis. The deformable Gaussian splatting (DGS) method significantly enhances the quality of both generated and reconstructed views, demonstrating robust performance in
The method occasionally struggles with generating intricate objects, such as pedestrians, and detailed texture areas, like road fences, which can affect the realism of the scenes in certain contexts. The experiments are conducted solely on the nuScenes dataset, which includes 700 training and 150 validation clips. Although widely used, this dataset may not fully capture the complexity of real-world environments, raising concerns about the method’s generalizability to more diverse and challengin
- [S1: Significance] The paper addresses an important problem in the field of computer vision: controllable 3D scene generation. The proposed method has the potential to be used in a variety of applications, including autonomous driving simulation, virtual reality, and video gaming.
- [W1] The technical contributions of pose conditioned video generation and its relation in the framework is not clearly stated. - [W1.1] According to Figure 2, it looks like the video generator works without conditioning on input camera images. If that is the case, the reviewer would like to understand what’s the benefit of feeding the video generated multi-view data to Stage 2 compared to using ground-truth data? Based on my understanding, the exposure discrepancy across multi-views and dyn
1. The paper is well-structured and straightforward to understand. 2. The concept of breaking down 3D scene generation into a sequential multi-view generative stage followed by a static reconstruction stage, utilizing two distinct representations that have proven effective in their respective areas, is particularly intriguing. 3. The ablation studies demonstrate a significant improvement over the selected baselines (3DGS and LucidDreamer).
1. The performance on test views is not particularly strong. As noted in the manuscript, the PSNR on novel views in both test settings is below 22. While this work does advance the field of scene generation, it is not yet suitable for practical applications, such as generating synthetic data for end-to-end autonomous driving policy training. 2. The manuscript lacks a comparison with key baselines during the reconstruction phase, specifically Street Gaussians [A]. 3. Have you attempted a long-ter
1. The proposed framework supports controllable scene generation using BEV maps, 3D bounding boxes, and text descriptions, which enhances its applicability in tasks like autonomous driving simulations. 2. The introduction of deformable 3D GS effectively addresses local dynamics and exposure discrepancies, ensuring better scene generation quality.
1. MagicDrive3D is composed of two parts: a video generation model and a 3DGS to recover 3D scenes from images, both are proposed in previous works, while showing technical improvements, still limiting the overall novelty of the paper. 2. The comparison in Table 2 is only made with Vallinia 3D-GS, yet there are several other dynamic 3D-GS methods for road scenes (e.g., PVG[1], StreetGaussian[2]) that should also be considered for comparison. [1] Periodic Vibration Gaussian: Dynamic Urban Scene
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · 3D Modeling in Geospatial Applications · 3D Surveying and Cultural Heritage
