DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes
Hengwei Bian, Lingdong Kong, Haozhe Xie, Liang Pan, Yu Qiao, Ziwei Liu

TL;DR
DynamicCity is a novel framework that generates large-scale, high-quality dynamic 4D urban scenes with semantics, employing innovative models for efficient representation and generation, outperforming existing methods on key benchmarks.
Contribution
The paper introduces a new 4D occupancy generation framework with a HexPlane-based representation and a diffusion model, achieving significant improvements in quality, efficiency, and versatility.
Findings
Up to 12.56 mIoU improvement in HexPlane fitting
2.06x training speedup and 70.84% memory reduction
Outperforms state-of-the-art methods on CarlaSC and Waymo datasets
Abstract
Urban scene generation has been developing rapidly recently. However, existing methods primarily focus on generating static and single-frame scenes, overlooking the inherently dynamic nature of real-world driving environments. In this work, we introduce DynamicCity, a novel 4D occupancy generation framework capable of generating large-scale, high-quality dynamic 4D scenes with semantics. DynamicCity mainly consists of two key models. 1) A VAE model for learning HexPlane as the compact 4D representation. Instead of using naive averaging operations, DynamicCity employs a novel Projection Module to effectively compress 4D features into six 2D feature maps for HexPlane construction, which significantly enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in parallel, which improves both network…
Peer Reviews
Decision·ICLR 2025 Spotlight
- Compared to existing methods that lack the ability of long-term dynamic generation, this paper utilizes Hexplane as the compact 4D representation and reorganizes into one feature map to achieve efficient reconstruction and generation. - The decoding manner in parallel proposed in Expansion & Squeeze Strategy (ESS) alleviates the problem of dense queries and further improves the generation efficiency. - Based on the VAE and DiT pipeline, the authors introduce diverse conditions for generatio
- Despite the compact HexPlane and parallel decoder, the dense feature volume and projection module of autoencoder are still very heavy, which limits its efficiency and scalability. - The sample of the dataset is quite limited, which may lead to overfitting and memorization of the data by the generation model. This paper also lacks clarification of the division of training and test sets, as well as experiments and comparative results for their generalization ability and generative diversity.
- [S0] Flexible modeling approach which seems to scale well to large scenes while still allowing a wide range of rich conditioning methods. - [S1] The proposed approach outperforms OccSora, a very modern competitor, in a wide range of metrics, including FID, precision, and recall. - [S2] Some interesting implementation tricks could potentially be applied to other related tasks. For example, diffusion runs on a 2D setting with a clever tiling of the six HexPlanes into a single plane (Fig
- [W0] The pretrained networks used to calculate IS, FID, and KID should be motivated more thoroughly, especially in the 2D case. - For example, it is not clear why it is meaningful to use a CNN presumably trained on ImageNet or COCO to reason about samples consisting of semantic color maps. Unless this 2D CNN is trained to process semantic color maps as inputs, passing semantic color maps to such a CNN would produce OOD feature maps. - [W1] One conceptual limitation is the f
1. The paper is well-structured and easy to follow. 2. The generated results are impressive, with an accompanying demonstration video that effectively showcases the method’s capabilities. 3. The proposed method has a clear motivation and presents a well-reasoned pipeline.
1. I’m a little confused by the title and the task definition. While the authors state that the proposed method is designed for “LiDAR” generation, the results seem more akin to “occupancy” generation. These concepts are distinct, despite both representing the scene’s geometry. For true LiDAR generation [1][2], the outputs should be LiDAR point clouds that reflect the sampling properties of LiDAR sensors (e.g., ray drop, ray-based sampling, etc). 2. The authors note that the method can support l
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote Sensing and LiDAR Applications · 3D Surveying and Cultural Heritage · Robotics and Sensor-Based Localization
MethodsDiffusion · Focus
