SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du; Yiming Zhao; Zhenglong Guo; Yong Pan; Wenbo Hou; Zhihui Hao; Kun Zhan; Qijun Chen

arXiv:2511.22039·cs.CV·April 15, 2026

SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen

PDF

TL;DR

This paper presents a transformer-based approach for trajectory-conditioned 3D scene occupancy forecasting that outperforms existing methods by directly predicting multi-frame occupancy from raw image features without relying on BEV projections or VAEs.

Contribution

The novel architecture bypasses BEV projections and VAE limitations, enabling more effective spatiotemporal modeling for occupancy forecasting.

Findings

01

Achieves state-of-the-art results on nuScenes benchmark.

02

Outperforms existing approaches by a significant margin.

03

Demonstrates robust scene dynamics understanding under arbitrary trajectories.

Abstract

This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.