A Spatiotemporal Approach to Tri-Perspective Representation for 3D Semantic Occupancy Prediction
Sathira Silva, Savindu Bhashitha Wannigama, Gihan Jayatilaka, Muhammad, Haris Khan, Roshan Ragel

TL;DR
This paper introduces S2TPVFormer, a spatiotemporal transformer that leverages temporal cues via a novel attention mechanism to improve vision-based 3D semantic occupancy prediction, showing significant performance gains on nuScenes.
Contribution
It proposes a new spatiotemporal transformer architecture with a Temporal Cross-View Hybrid Attention mechanism for enhanced 3D scene understanding.
Findings
Achieved +4.1% mIoU improvement on nuScenes dataset.
Demonstrated the effectiveness of incorporating temporal cues in 3D occupancy prediction.
Validated the approach's superiority over baseline TPVFormer.
Abstract
Holistic understanding and reasoning in 3D scenes are crucial for the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic applications captures finer 3D details compared to traditional 3D detection methods. Vision-based 3D semantic occupancy prediction is increasingly overlooked in favor of LiDAR-based approaches, which have shown superior performance in recent years. However, we present compelling evidence that there is still potential for enhancing vision-based methods. Existing approaches predominantly focus on spatial cues such as tri-perspective view (TPV) embeddings, often overlooking temporal cues. This study introduces S2TPVFormer, a spatiotemporal transformer architecture designed to predict temporally coherent 3D semantic occupancy. By introducing temporal cues through a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Neural Network Applications
MethodsFocus
