Vision Transformer Adapter for Dense Predictions

Zhe Chen; Yuchen Duan; Wenhai Wang; Junjun He; Tong Lu; Jifeng Dai; Yu; Qiao

arXiv:2205.08534·cs.CV·February 14, 2023·204 cites

Vision Transformer Adapter for Dense Predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu, Qiao

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces ViT-Adapter, a simple module that enhances plain Vision Transformers for dense prediction tasks, achieving state-of-the-art results without additional training data or vision-specific inductive biases.

Contribution

The paper proposes ViT-Adapter, enabling plain ViT models to perform competitively on dense prediction tasks by incorporating a pre-training-free, image-related inductive bias.

Findings

01

Achieved 60.9 box AP and 53.0 mask AP on COCO test-dev.

02

ViT-Adapter matches performance of vision-specific transformers.

03

No extra detection data needed for state-of-the-art results.

Abstract

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Vision Transformer Adapter for Dense Predictions· slideslive

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · CCD and CMOS Imaging Sensors

MethodsLinear Layer · Adam · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adapter · Absolute Position Encodings · Softmax · Dropout · Transformer