Occupancy World Model for Robots

Zhang Zhang; Qiang Zhang; Wei Cui; Shuai Shi; Yijie Guo; Gang Han; Wen Zhao; Jingkai Sun; Jiahang Cao; Jiaxu Wang; Hao Cheng; Xiaozhu Ju; Zhengping Che; Renjing Xu; Jian Tang

arXiv:2505.05512·cs.CV·May 12, 2025

Occupancy World Model for Robots

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Jingkai Sun, Jiahang Cao, Jiaxu Wang, Hao Cheng, Xiaozhu Ju, Zhengping Che, Renjing Xu, Jian Tang

PDF

Open Access

TL;DR

This paper introduces RoboOccWorld, a novel occupancy world model for indoor robots that predicts scene evolution using a guided autoregressive transformer and spatio-temporal aggregation, outperforming existing methods.

Contribution

The work presents a new framework with CCSA and HSTA for indoor 3D occupancy prediction, and restructures the OccWorld-ScanNet benchmark for better evaluation.

Findings

01

RoboOccWorld outperforms state-of-the-art methods in indoor 3D occupancy prediction.

02

The proposed CCSA effectively guides the transformer with camera pose conditions.

03

HSTA improves the exploitation of spatio-temporal cues from observations.

Abstract

Understanding and forecasting the scene evolutions deeply affect the exploration and decision of embodied agents. While traditional methods simulate scene evolutions through trajectory prediction of potential instances, current works use the occupancy world model as a generative framework for describing fine-grained overall scene dynamics. However, existing methods cluster on the outdoor structured road scenes, while ignoring the exploration of forecasting 3D occupancy scene evolutions for robots in indoor scenes. In this work, we explore a new framework for learning the scene evolutions of observed fine-grained occupancy and propose an occupancy world model based on the combined spatio-temporal receptive field and guided autoregressive transformer to forecast the scene evolutions, called RoboOccWorld. We propose the Conditional Causal State Attention (CCSA), which utilizes camera poses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Robotics and Sensor-Based Localization · Advanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need