UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting

Haoyuan Li; Yanpeng Zhou; Tao Tang; Jifei Song; Yihan Zeng; Michael; Kampffmeyer; Hang Xu; Xiaodan Liang

arXiv:2502.17860·cs.CV·February 28, 2025

UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting

Haoyuan Li, Yanpeng Zhou, Tao Tang, Jifei Song, Yihan Zeng, Michael, Kampffmeyer, Hang Xu, Xiaodan Liang

PDF

Open Access 3 Reviews

TL;DR

UniGS introduces a novel 3D Gaussian Splatting-based pretraining framework that enhances multi-modal 3D representations by better capturing scene intricacies and aligning with language and image data, leading to state-of-the-art results.

Contribution

The paper proposes integrating 3D Gaussian Splatting into multi-modal pretraining to improve 3D scene representation and alignment with language and images, introducing a Gaussian-Aware Guidance module.

Findings

01

Achieves +9.36% in zero-shot classification

02

Improves text-driven retrieval by +4.3%

03

Enhances open-world understanding by +7.92%

Abstract

Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The proposed approach can achieve state-of-the-art performance on various challenging datasets, which demonstrates the effectiveness in learning strong cross-model representations.

Weaknesses

1）No detailed explanation was given on how negative samples were selected. Do different tasks need to adjust the negative sample strategy? 2）There are significant differences in the structural and spatial characterization of 3DGS and traditional point cloud data. Does it affect the final performance？

Reviewer 02Rating 6Confidence 4

Strengths

- The proposed method has good performance on various multi-modal datasets. - The experimental results are given on two different datasets COCO and VisDrone.

Weaknesses

- Uni3D has used one model to unify the 3D representations from different models, which can be used to align with image and text. What is the main advantage of proposed method using 3DGS. - The proposed method introduces 3DGS for feature experimentation. Does it increase computational cost. - It seems that most experiment are not inconsistent with the results in Uni3D. In this paper, the performance are relatively poor. What is the difference. - I think that it is better to use the similar setti

Reviewer 03Rating 6Confidence 4

Strengths

1. The improvement gain is significant compared with previous methods, which shows the effectiveness of using 3DGS as the unified 3D representation. 2. The paper introduces an innovative Gaussian-Aware Guidance module that utilizes priors from pre-trained point cloud encoders as an initialization to enhance the learning of 3DGS features. This design is effective since it doesn't require training from scratch but can make use of existing models from a different 3D representation. 3. The intuiti

Weaknesses

1. Figure 2 could provide an overall description of the information flow (like how this pipeline works in general) in the caption. Also, the figure could be improved by adding some diagrams to represent downstream tasks instead of using text only. 2. I think one significant weakness of using 3DGS as a unified 3D representation is that, usually raw data doesn't use this representation, like point cloud from a Lidar sensor. In this way, this method needs to optimize or process a 3DGS using those

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization

MethodsALIGN