CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

Guibiao Liao; Jiankun Li; Zhenyu Bao; Xiaoqing Ye; Qing Li; Kanglin Liu

arXiv:2404.14249·cs.CV·June 24, 2025·1 cites

CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding

Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Qing Li, Kanglin Liu

PDF

Open Access 1 Repo

TL;DR

CLIP-GS introduces a novel method combining CLIP and Gaussian Splatting with semantic regularizations to achieve real-time, view-consistent 3D indoor scene understanding with significant accuracy improvements.

Contribution

The paper proposes Semantic Attribute Compactness and 3D Coherent Regularization to enhance semantic consistency and efficiency in CLIP-guided 3D Gaussian Splatting.

Findings

01

Achieves over 100 FPS rendering speed.

02

Improves mIoU by 21.20% on ScanNet.

03

Outperforms state-of-the-art methods on multiple datasets.

Abstract

Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gbliao/clip-gs
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Image Processing and 3D Reconstruction · 3D Shape Modeling and Analysis

MethodsDilated Convolution · Global Average Pooling · 1x1 Convolution · Convolution · Average Pooling · Switchable Atrous Convolution · Contrastive Language-Image Pre-training