BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents
Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiaofan Li, Xiao, Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

TL;DR
BEVWorld introduces a unified multimodal BEV latent space for holistic environment modeling in autonomous driving, enabling realistic future scene generation and improved downstream task performance.
Contribution
It presents a novel framework combining a multi-modal tokenizer and a BEV sequence diffusion model for joint scene encoding and future forecasting.
Findings
Effective in generating realistic future scenes
Improves perception and motion prediction tasks
Demonstrates strong performance on autonomous driving benchmarks
Abstract
World models have attracted increasing attention in autonomous driving for their ability to forecast potential future scenarios. In this paper, we propose BEVWorld, a novel framework that transforms multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for holistic environment modeling. The proposed world model consists of two main components: a multi-modal tokenizer and a latent BEV sequence diffusion model. The multi-modal tokenizer first encodes heterogeneous sensory data, and its decoder reconstructs the latent BEV tokens into LiDAR and surround-view image observations via ray-casting rendering in a self-supervised manner. This enables joint modeling and bidirectional encoding-decoding of panoramic imagery and point cloud data within a shared spatial representation. On top of this, the latent BEV sequence diffusion model performs temporally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Topic Modeling · Big Data Technologies and Applications
MethodsSoftmax · Attention Is All You Need · Diffusion
