SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction
Suzeyu Chen, Leheng Li, Ying-Cong Chen

TL;DR
SPOT-Occ introduces a prototype-guided sparse transformer decoder for efficient, accurate 3D occupancy prediction from camera data, enhancing autonomous vehicle safety with real-time performance.
Contribution
The paper proposes a novel prototype-based sparse transformer decoder with a guided feature selection mechanism and denoising paradigm, improving efficiency and accuracy in 3D occupancy prediction.
Findings
Outperforms previous methods in speed and accuracy
Uses a two-stage prototype-guided feature aggregation
Leverages ground-truth masks for stable query-prototype association
Abstract
Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · 3D Shape Modeling and Analysis
