TL;DR
UniD-Shift introduces a unified framework for 2D-3D semantic segmentation that decomposes features into shared and private components, enhancing accuracy and robustness across benchmarks.
Contribution
It proposes a share-private multimodal decomposition approach with explicit feature separation and a fusion module, improving cross-modal segmentation performance.
Findings
Achieves consistent segmentation accuracy improvements on SemanticKITTI and nuScenes.
Demonstrates stable generalization under distribution shifts in nuScenes USA-Singapore.
Offers competitive computational efficiency compared to baseline methods.
Abstract
Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
