TL;DR
This paper introduces LUDVIG, a learning-free, efficient method to uplift 2D vision features into 3D Gaussian Splatting representations, enabling fast and effective 3D scene understanding without extensive training.
Contribution
LUDVIG presents a novel, training-free approach using feature aggregation and graph diffusion to convert 2D features into 3D scenes, outperforming traditional reconstruction-based methods in speed and comparable accuracy.
Findings
Achieves competitive segmentation with DINOv2 features without training on segmentation masks.
Demonstrates strong open-vocabulary object segmentation with CLIP features.
Provides significant speed-ups over traditional 3D reconstruction methods.
Abstract
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like…
Peer Reviews
Decision·Submitted to ICLR 2025
The proposed approach to lift 2D features to 3D is efficient and avoids expensive iterative optimization schemes. The approach appears to be simple to implement and could be effective for downstream tasks such as 3D segmentation of stationary scenes where multiple views of the scene are available.
My main concerns about this work revolves around: (1) Low novelty: - It was unclear to me to what extent the main idea of lifting features from images to 3DGS point couds was novel compared to existing approaches in the literature that have explored scene editing given a 3DGS reconstruction of a scene. The weighted averaging and aggregation scheme described here appears to be very similar to what was proposed in prior work such as Chen et al. 2024. - The paper mostly focuses on the 3D segmenta
+ The proposed scheme of connecting per-pixel 2D features and Gaussians are simple and intuitive. + The segmentation can be directly done without iterative optimization on a trained Gaussian. + The treatment on incorporating with DINOv2 feature into segmentation is nice, as it induces comparable results with the variant using the more tailored-for SAM.
- There is neither limitation/failure nor future work discussion in the submission, what is the boarder impact of the work for the community? - The submission lacks report on running time.
1. The learning-free feature uplifting method is both simple and effective, achieving strong results without training. 2. Experiments with SAM and DINOv2 demonstrate the method’s efficiency, yielding performance comparable to training-based approaches. 3. High Computational Efficiency: LUDVIG bypasses the costly and time-consuming optimization steps typical in 3D reconstruction methods, making it highly efficient. 4. Versatile Input Adaptability: The proposed method adapts seamlessly to various
1. While the method is straightforward, it relies on hand-crafted processes, such as the segmentation score calculation and the graph diffusion process. These manual strategies may raise concerns about robustness, particularly in complex, real-world scenarios. 2. Certain sections, like Sec. 4.2, are challenging to follow. For example, the construction of 2D feature maps from DINOv2 is not clearly outlined. Including diagrams or visual aids could greatly enhance understanding and clarify complex
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Diffusion
