Joint Representation Learning for Text and 3D Point Cloud
Rui Huang, Xuran Pan, Henry Zheng, Haojun Jiang, Zhifeng Xie, Shiji, Song, Gao Huang

TL;DR
This paper introduces Text4Point, a framework that leverages 2D images as a bridge to align 3D point cloud representations with language, improving performance on various 3D understanding tasks.
Contribution
The novel Text4Point framework effectively aligns 3D point clouds with text using image bridging and contrastive learning, addressing data scarcity and irregularity issues.
Findings
Improved performance on point cloud segmentation and detection tasks.
Effective alignment of 3D features with language embeddings.
Versatile framework applicable to multiple 3D tasks.
Abstract
Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Advanced Neural Network Applications
MethodsContrastive Language-Image Pre-training · Contrastive Learning · ALIGN
