LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie; Romain Menegaux; Michael Arbel; Diane Larlus; Julien Mairal

arXiv:2410.14462·cs.CV·July 29, 2025

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel, Diane Larlus, Julien Mairal

PDF

3 Reviews

TL;DR

This paper introduces LUDVIG, a learning-free, efficient method to uplift 2D vision features into 3D Gaussian Splatting representations, enabling fast and effective 3D scene understanding without extensive training.

Contribution

LUDVIG presents a novel, training-free approach using feature aggregation and graph diffusion to convert 2D features into 3D scenes, outperforming traditional reconstruction-based methods in speed and comparable accuracy.

Findings

01

Achieves competitive segmentation with DINOv2 features without training on segmentation masks.

02

Demonstrates strong open-vocabulary object segmentation with CLIP features.

03

Provides significant speed-ups over traditional 3D reconstruction methods.

Abstract

We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

The proposed approach to lift 2D features to 3D is efficient and avoids expensive iterative optimization schemes. The approach appears to be simple to implement and could be effective for downstream tasks such as 3D segmentation of stationary scenes where multiple views of the scene are available.

Weaknesses

My main concerns about this work revolves around: (1) Low novelty: - It was unclear to me to what extent the main idea of lifting features from images to 3DGS point couds was novel compared to existing approaches in the literature that have explored scene editing given a 3DGS reconstruction of a scene. The weighted averaging and aggregation scheme described here appears to be very similar to what was proposed in prior work such as Chen et al. 2024. - The paper mostly focuses on the 3D segmenta

Reviewer 02Rating 5Confidence 3

Strengths

+ The proposed scheme of connecting per-pixel 2D features and Gaussians are simple and intuitive. + The segmentation can be directly done without iterative optimization on a trained Gaussian. + The treatment on incorporating with DINOv2 feature into segmentation is nice, as it induces comparable results with the variant using the more tailored-for SAM.

Weaknesses

- There is neither limitation/failure nor future work discussion in the submission, what is the boarder impact of the work for the community? - The submission lacks report on running time.

Reviewer 03Rating 6Confidence 3

Strengths

1. The learning-free feature uplifting method is both simple and effective, achieving strong results without training. 2. Experiments with SAM and DINOv2 demonstrate the method’s efficiency, yielding performance comparable to training-based approaches. 3. High Computational Efficiency: LUDVIG bypasses the costly and time-consuming optimization steps typical in 3D reconstruction methods, making it highly efficient. 4. Versatile Input Adaptability: The proposed method adapts seamlessly to various

Weaknesses

1. While the method is straightforward, it relies on hand-crafted processes, such as the segmentation score calculation and the graph diffusion process. These manual strategies may raise concerns about robustness, particularly in complex, real-world scenarios. 2. Certain sections, like Sec. 4.2, are challenging to follow. For example, the construction of 2D feature maps from DINOv2 is not clearly outlined. Including diagrams or visual aids could greatly enhance understanding and clarify complex

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels · Diffusion