Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Xiangyu Guo; Zhanqian Wu; Kaixin Xiong; Ziyang Xu; Lijun Zhou; Gangwei Xu; Shaoqing Xu; Haiyang Sun; Bing Wang; Guang Chen; Hangjun Ye; Wenyu Liu; Xinggang Wang

arXiv:2506.07497·cs.CV·June 23, 2025

Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang

PDF

Open Access

TL;DR

Genesis is a comprehensive framework that jointly generates consistent multi-view driving videos and LiDAR sequences, leveraging a shared latent space and semantic supervision to improve realism and utility for autonomous driving applications.

Contribution

It introduces a novel two-stage architecture combining diffusion models, 3D-VAE, and NeRF-based rendering, with a shared latent space and a captioning module for structured semantic guidance.

Findings

01

Achieves state-of-the-art metrics on nuScenes benchmark

02

Enhances downstream tasks like segmentation and 3D detection

03

Demonstrates semantic fidelity and practical utility of generated data

Abstract

We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Vision and Imaging

MethodsDiffusion