Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, Jan Eric, Lenssen

TL;DR
This paper introduces a 3D-aware fine-tuning method for 2D visual models, enhancing their understanding of 3D structures and improving performance on tasks like segmentation and depth estimation across diverse datasets.
Contribution
The authors propose a novel approach to incorporate 3D awareness into 2D foundation models via semantic feature lifting and re-rendering, which improves downstream task performance.
Findings
Enhanced features improve semantic segmentation accuracy.
Transferable improvements across multiple indoor and out-of-domain datasets.
Simple linear probing suffices to leverage the enhanced features.
Abstract
Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization
