BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

Zeming Chen; Hang Zhao

arXiv:2507.00707·cs.CV·July 2, 2025

BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

Zeming Chen, Hang Zhao

PDF

TL;DR

BEV-VAE introduces a novel multi-view image generation framework that ensures spatial consistency and explicit 3D scene modeling for autonomous driving, enabling controllable view synthesis from a unified BEV latent space.

Contribution

It proposes BEV-VAE, a method combining a BEV latent space with a latent diffusion transformer for consistent multi-view scene generation in autonomous driving.

Findings

01

Strong performance on nuScenes and Argoverse 2 datasets.

02

Effective 3D consistent reconstruction and generation.

03

Supports arbitrary view synthesis with optional 3D layouts.

Abstract

Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.