GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
Zhenya Yang, Zhe Liu, Yuxiang Lu, Liping Hou, Chenxuan Miao, Siyi Peng, Bailan Feng, Xiang Bai, Hengshuang Zhao

TL;DR
GenieDrive introduces a physics-informed framework for driving video generation that leverages 4D occupancy, a specialized VAE, and attention mechanisms to produce controllable, multi-view, and physically consistent driving videos efficiently.
Contribution
The paper presents a novel physics-aware driving video generation framework using 4D occupancy, a compact VAE, and attention modules, achieving improved accuracy and efficiency over prior methods.
Findings
7.2% improvement in forecasting mIoU
20.7% reduction in FVD for video quality
41 FPS inference speed with 3.47 M parameters
Abstract
Physics-aware driving world model is essential for drive planning, out-of-distribution data synthesis, and closed-loop evaluation. However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. Our approach starts by generating 4D occupancy, which serves as a physics-informed foundation for subsequent video generation. 4D occupancy contains rich physical information, including high-resolution 3D structures and dynamics. To facilitate effective compression of such high-resolution occupancy, we propose a VAE that encodes occupancy into a latent tri-plane representation, reducing the latent size to only 58% of that used in previous methods. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Model Reduction and Neural Networks
