CLIP-GS: CLIP-Informed Gaussian Splatting for View-Consistent 3D Indoor Semantic Understanding
Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Qing Li, Kanglin Liu

TL;DR
CLIP-GS introduces a novel method combining CLIP and Gaussian Splatting with semantic regularizations to achieve real-time, view-consistent 3D indoor scene understanding with significant accuracy improvements.
Contribution
The paper proposes Semantic Attribute Compactness and 3D Coherent Regularization to enhance semantic consistency and efficiency in CLIP-guided 3D Gaussian Splatting.
Findings
Achieves over 100 FPS rendering speed.
Improves mIoU by 21.20% on ScanNet.
Outperforms state-of-the-art methods on multiple datasets.
Abstract
Exploiting 3D Gaussian Splatting (3DGS) with Contrastive Language-Image Pre-Training (CLIP) models for open-vocabulary 3D semantic understanding of indoor scenes has emerged as an attractive research focus. Existing methods typically attach high-dimensional CLIP semantic embeddings to 3D Gaussians and leverage view-inconsistent 2D CLIP semantics as Gaussian supervision, resulting in efficiency bottlenecks and deficient 3D semantic consistency. To address these challenges, we present CLIP-GS, efficiently achieving a coherent semantic understanding of 3D indoor scenes via the proposed Semantic Attribute Compactness (SAC) and 3D Coherent Regularization (3DCR). SAC approach exploits the naturally unified semantics within objects to learn compact, yet effective, semantic Gaussian representations, enabling highly efficient rendering (>100 FPS). 3DCR enforces semantic consistency in 2D and 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Image Processing and 3D Reconstruction · 3D Shape Modeling and Analysis
MethodsDilated Convolution · Global Average Pooling · 1x1 Convolution · Convolution · Average Pooling · Switchable Atrous Convolution · Contrastive Language-Image Pre-training
