3D Vision-Language Gaussian Splatting
Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa, Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu

TL;DR
This paper introduces a 3D vision-language Gaussian splatting model that improves multi-modal scene understanding by better balancing visual and semantic representations, leading to state-of-the-art semantic segmentation results.
Contribution
It presents a novel cross-modal rasterizer and camera-view blending technique to enhance semantic rasterization and reduce over-fitting in 3D vision-language models.
Findings
Achieves state-of-the-art open-vocabulary semantic segmentation performance.
Effectively balances visual and semantic modalities in 3D scene understanding.
Outperforms existing methods by a significant margin.
Abstract
Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over-fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision-language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross-modal rasterizer, using…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is well-written and easy to follow. 2. The designs are well-motivated. 3. The performance shows a significant improvement.
(1) Line 215, the u^i is the fused features or position? (2) In camera interpolation, why use mix-up to generate a feature map instead of using an off-the-shelf OV model to output one? Just like previous methods. (3) Missing a global ablation study to present the importance of 'self-attention', 'semantic indicator'. 'camera interpolation' over your baseline. Now, it is unclear how the baseline stands in comparison with previous methods and the contributions from the three proposed designs. (4
1. The motivation of this paper sounds reasonable. The over-fitting on the color modality may have a negative impact on semantic learning, which is consistent with intuition. 2. The proposed techniques, including the smoothed semantic indicator and the mix-up augmentation, are simple but effective. 3. The performance of the method outperform existing methods by a significant margin.
1. Although the motivation sounds reasonable and the Figure 3 explain the motivation to some extent, I expect more experiments to further study the differences between color and semantics modalities. Beside showing quantitative, qualitative results and ablation of the proposed strategies, the authors should further discuss the deeper mechanisms. For example, the authors can study the relationship between color over-fitting phenomenon and specific scenes. 2. The figure 2 is hard to understand. T
1. The paper provides a profound insight into the challenges posed by semi-opaque media and intricate light transport effects, highlighting the limitations in translating color opacity to the semantic domain. This observation is interesting. 2. The implementation of a single learnable parameter to replace the conventional shared color opacity parameter is both straightforward and efficient. 3. The paper is commendably structured, presenting a well-motivated narrative and a clear methodology.
1. The paper introduces the camera-view blending technique as a method to enhance cross-view semantic consistency. However, it falls short in fully addressing the challenge of different objects sharing similar colors, potentially leading to indistinguishable semantic representations, which the authors claim to have tackled. Further insights on this critical point are needed. 2. In Section 4.4, the experimental results suggest an increase in rendering speed with the proposed pipeline. However, gi
Videos
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
