Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence
Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

TL;DR
Tango3D is a novel foundation model that unifies dense pixel-to-point correspondence with global semantic retrieval in 3D understanding, enabling fine-grained alignment and rich semantic injection.
Contribution
It introduces a geometry-aware backbone, a pretrained 3D VAE, and a three-stage training strategy to achieve joint dense and global alignment in 3D models.
Findings
Successfully achieves object-level pixel-to-point alignment.
Maintains competitive global retrieval performance.
Establishes a fine-grained alignment feature space for 3D tasks.
Abstract
Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
