ResFormer: Scaling ViTs with Multi-Resolution Training
Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang

TL;DR
ResFormer introduces a multi-resolution training framework for Vision Transformers, enhancing their ability to perform well across a wide range of input resolutions through scale consistency and adaptive positional embeddings.
Contribution
It presents a novel multi-resolution training method with scale consistency loss and a global-local positional embedding strategy for improved resolution scalability in ViTs.
Findings
ResFormer-B-MR achieves 75.86% Top-1 accuracy at 96 resolution.
ResFormer outperforms DeiT-B by 48% at low resolution.
The framework extends effectively to segmentation, detection, and video tasks.
Abstract
Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
