TL;DR
BRIXEL is a knowledge distillation method that enables cheaper dense feature extraction by training smaller models to replicate high-resolution features, outperforming baseline models across tasks.
Contribution
Introduces BRIXEL, a simple distillation approach that improves dense feature maps at lower computational cost, applicable across various model families.
Findings
BRIXEL outperforms baseline DINOv3 models on downstream tasks.
Applying BRIXEL yields substantial performance gains across different dense-feature extractors.
Code and models are publicly available at the provided GitHub link.
Abstract
Vision foundation models achieve strong performance on both global and locally dense downstream tasks. Pretrained on large images, the recent DINOv3 model family is able to produce very fine-grained dense feature maps, enabling state-of-the-art performance. However, computing these feature maps requires the input image to be available at very high resolution, as well as large amounts of compute due to the squared complexity of the transformer architecture. To address these issues, we propose BRIXEL, a simple knowledge distillation approach that has the student learn to reproduce its own feature maps at higher resolution. Despite its simplicity, BRIXEL outperforms the baseline DINOv3 models by large margins on downstream tasks when the resolution is kept fixed. We also apply BRIXEL to other recent dense-feature extractors and show that it yields substantial performance gains across model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
