TL;DR
This paper introduces JM3D, a multi-view joint modality approach that enriches 3D representations by integrating multi-view images and hierarchical text, significantly improving zero-shot 3D classification performance.
Contribution
The paper proposes a novel Structured Multimodal Organizer and Joint Multi-modal Alignment to address information degradation and insufficient synergy in 3D understanding.
Findings
Achieves state-of-the-art zero-shot 3D classification accuracy on ModelNet40.
Outperforms previous methods like ULIP by 4.3% on PointMLP.
Improves accuracy by up to 6.5% on PointNet++ in zero-shot settings.
Abstract
In recent years, 3D understanding has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
