Beyond First Impressions: Integrating Joint Multi-modal Cues for   Comprehensive 3D Representation

Haowei Wang; Jiji Tang; Jiayi Ji; Xiaoshuai Sun; Rongsheng Zhang,; Yiwei Ma; Minda Zhao; Lincheng Li; zeng zhao; Tangjie Lv; Rongrong Ji

arXiv:2308.02982·cs.CV·January 26, 2024

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Haowei Wang, Jiji Tang, Jiayi Ji, Xiaoshuai Sun, Rongsheng Zhang,, Yiwei Ma, Minda Zhao, Lincheng Li, zeng zhao, Tangjie Lv, Rongrong Ji

PDF

1 Repo

TL;DR

This paper introduces JM3D, a multi-view joint modality approach that enriches 3D representations by integrating multi-view images and hierarchical text, significantly improving zero-shot 3D classification performance.

Contribution

The paper proposes a novel Structured Multimodal Organizer and Joint Multi-modal Alignment to address information degradation and insufficient synergy in 3D understanding.

Findings

01

Achieves state-of-the-art zero-shot 3D classification accuracy on ModelNet40.

02

Outperforms previous methods like ULIP by 4.3% on PointMLP.

03

Improves accuracy by up to 6.5% on PointNet++ in zero-shot settings.

Abstract

In recent years, 3D understanding has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mr-neko/jm3d
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN