AdaptFormer: Adapting Vision Transformers for Scalable Visual   Recognition

Shoufa Chen; Chongjian Ge; Zhan Tong; Jiangliu Wang; Yibing Song; Jue; Wang; Ping Luo

arXiv:2205.13535·cs.CV·October 18, 2022·261 cites

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue, Wang, Ping Luo

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

AdaptFormer is a lightweight, plug-and-play method that efficiently adapts pre-trained Vision Transformers to various image and video recognition tasks, significantly improving transferability without full fine-tuning.

Contribution

It introduces lightweight modules that enhance ViT transferability with minimal additional parameters, outperforming fully fine-tuned models across multiple visual recognition benchmarks.

Findings

01

Achieves about 10% and 19% relative improvement on specific video datasets.

02

Adds less than 2% extra parameters to ViTs.

03

Largely improves ViT performance in target domains.

Abstract

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Layer Normalization · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Adam