UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

Guosheng Zhao; Yaozeng Wang; Xiaofeng Wang; Zheng Zhu; Tingdong Yu; Guan Huang; Yongchen Zai; Ji Jiao; Changliang Xue; Xiaole Wang; Zhen Yang; Futang Zhu; Xingang Wang

arXiv:2602.02002·cs.CV·February 3, 2026

UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang

PDF

Open Access

TL;DR

UniDriveDreamer introduces a unified multimodal world model for autonomous driving that directly generates synchronized multi-camera and LiDAR observations, improving data synthesis and downstream task performance.

Contribution

It presents a novel single-stage framework with modality-specific VAEs, a cross-modal alignment technique, and a diffusion transformer for joint multimodal future observation generation.

Findings

01

Outperforms previous methods in multimodal data synthesis

02

Improves downstream autonomous driving tasks

03

Demonstrates stable training and cross-modal consistency

Abstract

World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis