ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene   Completion From Monocular Camera

Jing Liang; He Yin; Xuewei Qi; Jong Jin Park; Min Sun; Rajasimman; Madhivanan; Dinesh Manocha

arXiv:2410.11019·cs.CV·March 4, 2025

ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman, Madhivanan, Dinesh Manocha

PDF

Open Access

TL;DR

ET-Former is an end-to-end method that uses a novel triplane deformable attention mechanism and CVAE to improve 3D semantic scene completion from monocular images, achieving state-of-the-art accuracy with low memory use.

Contribution

The paper introduces a triplane deformable attention mechanism and uncertainty estimation via CVAE for monocular 3D scene completion, advancing geometric understanding and efficiency.

Findings

01

Achieves highest IoU and mIoU scores on Semantic-KITTI dataset.

02

Reduces GPU memory usage compared to previous methods.

03

Improves SOTA IoU from 44.71 to 51.49.

Abstract

We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need