Multi-modal Relation Distillation for Unified 3D Representation Learning

Huiqun Wang; Yiping Bao; Panwang Pan; Zeming Li; Xiao Liu; Ruijie; Yang; Di Huang

arXiv:2407.14007·cs.CV·September 19, 2024

Multi-modal Relation Distillation for Unified 3D Representation Learning

Huiqun Wang, Yiping Bao, Panwang Pan, Zeming Li, Xiao Liu, Ruijie, Yang, Di Huang

PDF

Open Access

TL;DR

This paper introduces Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework that enhances 3D shape representations by capturing intra- and cross-modal relations, leading to state-of-the-art results in zero-shot classification and retrieval.

Contribution

The paper proposes a novel tri-modal relation distillation method that effectively incorporates structural relations across modalities into 3D representations.

Findings

01

Significant improvements in zero-shot classification accuracy.

02

State-of-the-art performance in cross-modality retrieval.

03

Effective distillation of large Vision-Language Models into 3D backbones.

Abstract

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Human Pose and Action Recognition · 3D Surveying and Cultural Heritage