Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Yujia Zhang; Xiaoyang Wu; Yixing Lao; Chengyao Wang; Zhuotao Tian; Naiyan Wang; Hengshuang Zhao

arXiv:2510.23607·cs.CV·March 3, 2026

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao

PDF

1 Models 1 Video

TL;DR

Concerto is a self-supervised learning framework that combines 2D and 3D modalities to learn coherent spatial representations, outperforming existing models in scene perception and enabling open-world understanding.

Contribution

It introduces a novel joint 2D-3D self-supervised learning method that improves spatial feature coherence and performance across multiple benchmarks.

Findings

01

Outperforms standalone SOTA 2D and 3D models in scene perception

02

Achieves 80.7% mIoU on ScanNet with fine-tuning

03

Enables open-world perception via CLIP space projection

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Pointcept/Concerto
model· ♡ 19
♡ 19

Videos

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations· slideslive