OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang

TL;DR
This paper introduces OccSTeP, a new benchmark and world model for 4D occupancy spatio-temporal persistence in autonomous driving, enabling robust scene understanding and forecasting despite noisy or missing data.
Contribution
It presents the first OccSTeP benchmark with challenging scenarios and proposes OccSTeP-WM, a novel dense voxel-based world model with linear-complexity attention and recurrent modules.
Findings
Achieved 23.70% semantic mIoU with a 6.56% improvement.
Achieved 35.89% occupancy IoU with a 9.26% improvement.
Demonstrated robustness in online inference with noisy or missing data.
Abstract
Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Generative Adversarial Networks and Image Synthesis · Robotics and Sensor-Based Localization
