Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

David Shavin; Sagie Benaim

arXiv:2602.06032·cs.CV·February 12, 2026

Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

David Shavin, Sagie Benaim

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces Splat and Distill, a framework that enhances 2D vision foundation models with 3D awareness by using a fast, feed-forward 3D reconstruction pipeline to improve downstream 3D tasks.

Contribution

The paper presents a novel feed-forward 3D lifting method that replaces slow optimization, enabling efficient 3D-aware distillation for 2D models.

Findings

01

Significant improvements in monocular depth estimation.

02

Enhanced surface normal and multi-view correspondence accuracy.

03

Better semantic richness in 2D features.

Abstract

Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well-written and easy to follow. 2. Based on MVSplat, a feed-forward 3DGS method, SnD is more efficient than previous methods that require optimization. 3. SnD outperforms baselines in various downstream tasks, including depth estimation, normal estimation, and semantic segmentation.

Weaknesses

1. The authors should add an ablation study without student-teacher framework. For example, a potential experiment is finetuning a model with feature rendering loss, similar to Fit3D. Currently, it is unclear to me why student-teacher framework is necessary. 2. The improvement that mask-aware feature lifting brings is not explicit. 3. Quantitative evaluation of multi-view feature correspondence should be added, instead of using visualization only.

Reviewer 02Rating 2Confidence 4

Strengths

- The method demonstrates clear performance gains over DINOv2, the baseline framework it builds upon. - The use of a feed-forward 3D reconstruction module to inject geometric awareness into 2D foundation models seems a promising and timely research direction.

Weaknesses

The evaluation setup is limited, as DINOv2 is seldom used directly for downstream tasks. It typically serves as a feature backbone for task-specific heads. Hence, direct comparison against DINOv2 provides a very limited insight. Demonstrating improvements when replacing DINOv2 with the proposed method within state-of-the-art pipelines (e.g., VGGT) would make the results substantially more compelling. Many recent methods seem to be missing from the comparisons: - DINOv3 - Sarıyıldız, M.B., Weinz

Reviewer 03Rating 6Confidence 3

Strengths

* Creating a 3DGS representation on-the-fly for 3D-aware feature distillation is a very nice idea. * The experimental scope and the demonstrated results are laudable. The approach demonstrates consistent and often significant improvement across all tasks. * The approach is relatively simple and leads to representations, which can be used standalone (without the need for concatenating them with the baseline features in evaluation).

Weaknesses

* Overall, it is unclear where the gains come from. Conceptually, the approach designs a view-invariant representation and uses segmentation masks. This explains the gains on semantic segmentation, which requires view-invariance, but not the 3D-aware tasks (e.g. depth), where the representation is covariant with the camera pose. * The ablation study is not very informative, also in the context of the above point. Even the worst configuation here (“without blending”) already outperforms the base

Code & Models

Models

🤗
david-shavin/SnD
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Human Pose and Action Recognition