DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by   Distilling Neural Fields and Foundation Model Features

Letian Wang; Seung Wook Kim; Jiawei Yang; Cunjun Yu; Boris Ivanovic,; Steven L. Waslander; Yue Wang; Sanja Fidler; Marco Pavone; Peter Karkus

arXiv:2406.12095·cs.CV·November 1, 2024

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic,, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

PDF

Open Access 1 Video

TL;DR

DistillNeRF is a self-supervised framework that learns 3D scene understanding from sparse 2D images by distilling neural fields and foundation model features, enabling improved scene reconstruction and semantic understanding.

Contribution

It introduces a novel architecture combining neural radiance fields and foundation model feature distillation for 3D scene perception from limited views.

Findings

01

Outperforms state-of-the-art self-supervised methods in scene reconstruction and depth estimation.

02

Enables zero-shot 3D semantic occupancy prediction.

03

Supports open-world scene understanding with foundation model features.

Abstract

We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in outdoor autonomous driving scenes. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs with limited view overlap, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets from them, which helps our model to learn enhanced 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features· slideslive

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · 3D Surveying and Cultural Heritage

MethodsContrastive Language-Image Pre-training