Self-supervised pretraining for an iterative image size agnostic vision transformer

Nedyalko Prisadnikov; Danda Pani Paudel; Yuqian Fu; Luc Van Gool

arXiv:2604.20392·cs.CV·April 23, 2026

Self-supervised pretraining for an iterative image size agnostic vision transformer

Nedyalko Prisadnikov, Danda Pani Paudel, Yuqian Fu, Luc Van Gool

PDF

TL;DR

This paper introduces a self-supervised learning framework for image-size agnostic vision transformers, enabling efficient large-scale pretraining across various resolutions while maintaining constant computational costs.

Contribution

It presents a novel sequential-to-global SSL method based on DINO, supported by an integral-image patch extraction, to pretrain vision transformers that are resolution agnostic.

Findings

01

Achieved competitive ImageNet-1K performance across resolutions.

02

Maintained constant computational budget regardless of input size.

03

Enabled large-scale pretraining for resolution-agnostic vision encoders.

Abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.