A Spatiotemporal Approach to Tri-Perspective Representation for 3D   Semantic Occupancy Prediction

Sathira Silva; Savindu Bhashitha Wannigama; Gihan Jayatilaka; Muhammad; Haris Khan; Roshan Ragel

arXiv:2401.13785·cs.CV·February 18, 2025·2 cites

A Spatiotemporal Approach to Tri-Perspective Representation for 3D Semantic Occupancy Prediction

Sathira Silva, Savindu Bhashitha Wannigama, Gihan Jayatilaka, Muhammad, Haris Khan, Roshan Ragel

PDF

Open Access

TL;DR

This paper introduces S2TPVFormer, a spatiotemporal transformer that leverages temporal cues via a novel attention mechanism to improve vision-based 3D semantic occupancy prediction, showing significant performance gains on nuScenes.

Contribution

It proposes a new spatiotemporal transformer architecture with a Temporal Cross-View Hybrid Attention mechanism for enhanced 3D scene understanding.

Findings

01

Achieved +4.1% mIoU improvement on nuScenes dataset.

02

Demonstrated the effectiveness of incorporating temporal cues in 3D occupancy prediction.

03

Validated the approach's superiority over baseline TPVFormer.

Abstract

Holistic understanding and reasoning in 3D scenes are crucial for the success of autonomous driving systems. The evolution of 3D semantic occupancy prediction as a pretraining task for autonomous driving and robotic applications captures finer 3D details compared to traditional 3D detection methods. Vision-based 3D semantic occupancy prediction is increasingly overlooked in favor of LiDAR-based approaches, which have shown superior performance in recent years. However, we present compelling evidence that there is still potential for enhancing vision-based methods. Existing approaches predominantly focus on spatial cues such as tri-perspective view (TPV) embeddings, often overlooking temporal cues. This study introduces S2TPVFormer, a spatiotemporal transformer architecture designed to predict temporally coherent 3D semantic occupancy. By introducing temporal cues through a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Neural Network Applications

MethodsFocus