Vision Transformer Adapter for Dense Predictions
Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu, Qiao

TL;DR
This paper introduces ViT-Adapter, a simple module that enhances plain Vision Transformers for dense prediction tasks, achieving state-of-the-art results without additional training data or vision-specific inductive biases.
Contribution
The paper proposes ViT-Adapter, enabling plain ViT models to perform competitively on dense prediction tasks by incorporating a pre-training-free, image-related inductive bias.
Findings
Achieved 60.9 box AP and 53.0 mask AP on COCO test-dev.
ViT-Adapter matches performance of vision-specific transformers.
No extra detection data needed for state-of-the-art results.
Abstract
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors
MethodsLinear Layer · Adam · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adapter · Absolute Position Encodings · Softmax · Dropout · Transformer
