TL;DR
POMA-3D introduces a self-supervised 3D representation model using point maps, effectively leveraging 2D priors for diverse 3D scene understanding tasks with geometric inputs.
Contribution
The paper presents POMA-3D, a novel point map-based 3D representation model with a view-to-scene alignment strategy and a new dataset for large-scale pretraining.
Findings
POMA-3D outperforms existing methods on 3D understanding tasks.
The model benefits tasks like question answering, navigation, and scene retrieval.
It demonstrates strong generalization with only geometric inputs.
Abstract
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
