Object-level Self-Distillation for Vision Pretraining

\c{C}a\u{g}lar H{\i}zl{\i}; \c{C}a\u{g}atay Y{\i}ld{\i}z; Pekka Marttinen

arXiv:2506.05409·cs.CV·June 9, 2025

Object-level Self-Distillation for Vision Pretraining

\c{C}a\u{g}lar H{\i}zl{\i}, \c{C}a\u{g}atay Y{\i}ld{\i}z, Pekka Marttinen

PDF

Open Access

TL;DR

This paper introduces Object-level Self-Distillation (ODIS), a novel vision pretraining method that focuses on individual objects within images, improving representation quality especially in complex, scene-rich datasets.

Contribution

ODIS shifts self-distillation from image-level to object-level, utilizing object-aware cropping and masked attention to enhance transformer-based visual representations.

Findings

01

Achieves 82.6% k-NN accuracy on ImageNet1k with ViT-Large.

02

Improves representations at both image and patch levels.

03

Transforms scene-level tasks into simpler object-level sub-tasks.

Abstract

State-of-the-art vision pretraining methods rely on image-level self-distillation from object-centric datasets such as ImageNet, implicitly assuming each image contains a single object. This assumption does not always hold: many ImageNet images already contain multiple objects. Further, it limits scalability to scene-centric datasets that better mirror real-world complexity. We address these challenges by introducing Object-level Self-DIStillation (ODIS), a pretraining approach that shifts the self-distillation granularity from whole images to individual objects. Using object-aware cropping and masked attention, ODIS isolates object-specific regions, guiding the transformer toward semantically meaningful content and transforming a noisy, scene-level task into simpler object-level sub-tasks. We show that this approach improves visual representations both at the image and patch levels.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications