TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Fan Li; Zanyi Wang; Zeyi Huang; Guang Dai; Jingdong Wang; Mengmeng Wang

arXiv:2507.14904·cs.CV·September 5, 2025

TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Fan Li, Zanyi Wang, Zeyi Huang, Guang Dai, Jingdong Wang, Mengmeng Wang

PDF

Open Access

TL;DR

TriCLIP-3D introduces a unified, parameter-efficient framework leveraging a 2D CLIP model with adapter fine-tuning to enhance tri-modal 3D visual grounding, reducing complexity and improving performance.

Contribution

The paper presents a novel unified 2D pre-trained multi-modal network for processing RGB, text, and point clouds, simplifying architecture and boosting efficiency in 3D visual grounding.

Findings

01

Reduces trainable parameters by approximately 58%

02

Achieves 6.52% improvement in 3D detection

03

Achieves 6.25% improvement in 3D visual grounding

Abstract

3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis