Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang

TL;DR
Dream4Drive introduces a synthetic data generation framework that enhances perception tasks in autonomous driving by creating diverse, multi-view, photorealistic videos, significantly improving corner case detection and perception model performance.
Contribution
The paper presents Dream4Drive, a novel synthetic data generator that improves downstream perception tasks by enabling scalable, multi-view, photorealistic video creation and introduces the DriveObj3D dataset for diverse 3D-aware video editing.
Findings
Synthetic data boosts perception model performance across training epochs.
Dream4Drive generates multi-view corner cases at scale.
The framework enhances perception in autonomous driving scenarios.
Abstract
Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into…
Peer Reviews
Decision·ICLR 2026 Poster
- Identifies and analyzes the limitations of current generative methods in downstream tasks, showing that even a small amount (as little as 2%) of high-quality generated data can lead to significant performance gains. - Introduces DriveObj3D, a large-scale 3D asset dataset encompassing typical object categories in driving scenarios, which supports diverse and 3D-aware video editing applications.
- The method relies on scene insertion and asset insertion that currently lack sufficient automation and are primarily dependent on manual effort.
1. Dream4Drive introduces a combination of dense 3D-aware guidance maps with generative video editing, enabling realistic, geometry-consistent, multi-view scene synthesis. 2. Demonstrating that fewer than 2% synthetic samples can enhance real data training is a strong and practically relevant result, suggesting high efficiency in data augmentation. 3. The proposed DriveObj3D fills a gap in open 3D asset resources for driving research, supporting reproducibility, and future extension in video-l
1. The paper does not clarify how inserted 3D assets are constrained to remain physically valid: ensuring no collisions, adherence to drivable areas, and correct orientations, nor how occlusion layers between objects are resolved across depth, normal, and edge maps. 2. There is no qualitative criterion for evaluating 3D asset realism or consistency, raising uncertainty about the reliability of the generated dataset and edited scenes. 3. As shown in Table 3, the method’s performance fluctuates
- The paper identifies a key shortcoming in most generative simulators for autonomous driving data: improvements in downstream task from training on both real and synthetic data have in the past been attributed only to the addition of synthetic data while ignoring that increasing training epochs improves performance even just using the base dataset (nuScenes). - The paper provides a large-scale dataset (DriveObj3D) along with a method to insert assets from the dataset in a natural manner to matc
W1. The paper lacks novelty beyond the Epoch related finding. The actual data generation framework is a composition of existing models and methods rather than a technical contribution. While adding multiview images as conditions to generate a 3D mesh for 3D assets does improve asset quality over meshes generated from single view, it seems like an unsurprising improvement. Moreover, the actual improvement in scores from Table 5 compared to other 3D asset generation methods is relatively small, de
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
