TL;DR
Concerto is a self-supervised learning framework that combines 2D and 3D modalities to learn coherent spatial representations, outperforming existing models in scene perception and enabling open-world understanding.
Contribution
It introduces a novel joint 2D-3D self-supervised learning method that improves spatial feature coherence and performance across multiple benchmarks.
Findings
Outperforms standalone SOTA 2D and 3D models in scene perception
Achieves 80.7% mIoU on ScanNet with fine-tuning
Enables open-world perception via CLIP space projection
Abstract
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
