CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

Siyu Jiao; Haoye Dong; Yuyang Yin; Zequn Jie; Yinlong Qian; Yao Zhao; Humphrey Shi; Yunchao Wei

arXiv:2412.19142·cs.CV·January 13, 2026

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei

PDF

Open Access

TL;DR

CLIP-GS introduces a unified 3D multimodal learning framework using 3D Gaussian Splatting and CLIP, enabling improved 3D understanding and retrieval by integrating texture and shape information.

Contribution

The paper proposes CLIP-GS, a novel framework that combines 3D Gaussian Splatting with CLIP for unified vision-language 3D representation learning, introducing the GS Tokenizer and new training strategies.

Findings

01

Outperforms point cloud models on 3D tasks

02

Enables zero-shot and few-shot 3D classification

03

Achieves versatile multimodal retrieval results

Abstract

Recent works in 3D multimodal learning have made remarkable progress. However, typically 3D multimodal models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training