Learning 3D Semantic Segmentation with only 2D Image Supervision
Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Pantofaru, Forrester, Cole, Avneesh Sud, Brian Brewington, Brian Shucker, Thomas Funkhouser

TL;DR
This paper introduces a method to train 3D semantic segmentation models using only 2D image annotations by generating pseudo-labels and fusing multiple views, addressing data scarcity and transfer issues.
Contribution
It proposes 2D3DNet, a novel framework that leverages 2D image labels for 3D segmentation, including strategies for pseudo-label trustworthiness and rare object sampling.
Findings
Achieves +6.2-11.4 mIoU improvement over baselines
Effective pseudo-label selection and scene sampling methods
Demonstrates strong generalization across diverse urban datasets
Abstract
With the recent growth of urban mapping and autonomous driving efforts, there has been an explosion of raw 3D data collected from terrestrial platforms with lidar scanners and color cameras. However, due to high labeling costs, ground-truth 3D semantic segmentation annotations are limited in both quantity and geographic diversity, while also being difficult to transfer across sensors. In contrast, large image collections with ground-truth semantic segmentations are readily available for diverse sets of scenes. In this paper, we investigate how to use only those labeled 2D image collections to supervise training 3D semantic segmentation models. Our approach is to train a 3D model from pseudo-labels derived from 2D semantic image segmentations using multiview fusion. We address several novel issues with this approach, including how to select trusted pseudo-labels, how to sample 3D scenes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
