Self-supervised pretraining for an iterative image size agnostic vision transformer
Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool

TL;DR
This paper introduces a self-supervised learning framework for image-size agnostic vision transformers, enabling efficient large-scale pretraining across various resolutions while maintaining constant computational costs.
Contribution
It presents a novel sequential-to-global SSL method based on DINO, supported by an integral-image patch extraction, to pretrain vision transformers that are resolution agnostic.
Findings
Achieved competitive ImageNet-1K performance across resolutions.
Maintained constant computational budget regardless of input size.
Enabled large-scale pretraining for resolution-agnostic vision encoders.
Abstract
Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
