2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision
Cheng-Kun Yang, Min-Hung Chen, Yung-Yu Chuang, Yen-Yu Lin

TL;DR
This paper introduces a Multimodal Interlaced Transformer that effectively fuses 2D and 3D features for weakly supervised point cloud segmentation using only scene-level labels, outperforming existing methods.
Contribution
Proposes a novel transformer model with interlaced 2D-3D cross-attention for weakly supervised segmentation without extra annotations.
Findings
Achieves superior performance on S3DIS and ScanNet benchmarks.
Effectively fuses 2D and 3D features through iterative interlaced attention.
Reduces reliance on detailed 2D annotations for point cloud segmentation.
Abstract
We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision· youtube
Taxonomy
Topics3D Surveying and Cultural Heritage · 3D Shape Modeling and Analysis · Industrial Vision Systems and Defect Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Dense Connections · Residual Connection · Absolute Position Encodings · Adam · Byte Pair Encoding
