ResFormer: Scaling ViTs with Multi-Resolution Training

Rui Tian; Zuxuan Wu; Qi Dai; Han Hu; Yu Qiao; Yu-Gang Jiang

arXiv:2212.00776·cs.CV·April 4, 2023

ResFormer: Scaling ViTs with Multi-Resolution Training

Rui Tian, Zuxuan Wu, Qi Dai, Han Hu, Yu Qiao, Yu-Gang Jiang

PDF

Open Access 1 Repo

TL;DR

ResFormer introduces a multi-resolution training framework for Vision Transformers, enhancing their ability to perform well across a wide range of input resolutions through scale consistency and adaptive positional embeddings.

Contribution

It presents a novel multi-resolution training method with scale consistency loss and a global-local positional embedding strategy for improved resolution scalability in ViTs.

Findings

01

ResFormer-B-MR achieves 75.86% Top-1 accuracy at 96 resolution.

02

ResFormer outperforms DeiT-B by 48% at low resolution.

03

The framework extends effectively to segmentation, detection, and video tasks.

Abstract

Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruitian12/resformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection