Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Zebin He; Mingxin Yang; Shuhui Yang; Hanxiao Sun; Xintong Han; Chunchao Guo; Wenhan Luo

arXiv:2605.19727·cs.CV·May 20, 2026

Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Zebin He, Mingxin Yang, Shuhui Yang, Hanxiao Sun, Xintong Han, Chunchao Guo, Wenhan Luo

PDF

TL;DR

Tango3D is a novel foundation model that unifies dense pixel-to-point correspondence with global semantic retrieval in 3D understanding, enabling fine-grained alignment and rich semantic injection.

Contribution

It introduces a geometry-aware backbone, a pretrained 3D VAE, and a three-stage training strategy to achieve joint dense and global alignment in 3D models.

Findings

01

Successfully achieves object-level pixel-to-point alignment.

02

Maintains competitive global retrieval performance.

03

Establishes a fine-grained alignment feature space for 3D tasks.

Abstract

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.