SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
Hiba Dahmani, Nathan Piasco, Moussab Bennehar, Luis Rold\~ao, Dzmitry Tsishkou, Laurent Caraffa, Jean-Philippe Tarel, Roland Br\'emond

TL;DR
This paper introduces SEM-ROVER, a novel 3D generative framework for large-scale outdoor driving scenes using a semantic-conditioned diffusion model on a discrete voxel grid, enabling consistent, photorealistic scene generation.
Contribution
The work presents a scalable 3D scene generation method based on a new voxel representation and diffusion model, overcoming limitations of prior small-scale or view-dependent approaches.
Findings
Generates diverse large-scale urban outdoor scenes.
Produces photorealistic images with various sensor setups and camera paths.
Maintains scene consistency across multiple viewpoints.
Abstract
Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on -Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
