FeatSharp: Your Vision Model Features, Sharper

Mike Ranzinger; Greg Heinrich; Pavlo Molchanov; Jan Kautz; Bryan Catanzaro; Andrew Tao

arXiv:2502.16025·cs.CV·July 4, 2025

FeatSharp: Your Vision Model Features, Sharper

Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao

PDF

Open Access 1 Repo 2 Models

TL;DR

FeatSharp introduces a cost-effective method to upsample low-resolution vision encoder features, enhancing detail preservation for improved performance in perception tasks and model training.

Contribution

It presents a novel technique for coherently upsampling vision encoder features, enabling better detail retention without high computational costs.

Findings

01

Improves feature map resolution in vision encoders.

02

Enhances performance on perception tasks.

03

Facilitates richer distillation targets.

Abstract

The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$ px, while the "high-resolution" versions are around $378 - 448$ px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nvlabs/radio
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies

MethodsContrastive Language-Image Pre-training