TL;DR
$I^{2}$-World introduces an efficient 4D scene forecasting framework that hierarchically tokenizes 3D scenes, capturing spatial and temporal dependencies, and achieves state-of-the-art accuracy with high computational efficiency.
Contribution
The paper proposes a novel intra-inter scene tokenization method with an encoder-decoder architecture for efficient and accurate 4D occupancy forecasting.
Findings
Outperforms existing methods by 25.1% in mIoU and 36.9% in IoU.
Requires only 2.9 GB of memory and runs at 37 FPS.
Achieves state-of-the-art results in 4D scene forecasting.
Abstract
Forecasting the evolution of 3D scenes and generating unseen scenarios via occupancy-based world models offers substantial potential for addressing corner cases in autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose -World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design preserves the compactness of 3D tokenizers while retaining the dynamic expressiveness of 4D tokenizers. Unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
