Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou; Yujun Shi; Bingyi Kang; Weihao Yu; Zihang Jiang; Yuan Li,; Xiaojie Jin; Qibin Hou; Jiashi Feng

arXiv:2106.03714·cs.CV·June 8, 2021·41 cites

Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li,, Xiaojie Jin, Qibin Hou, Jiashi Feng

PDF

Open Access 1 Repo

TL;DR

This paper introduces 'refiner', a simple method to improve self-attention in Vision Transformers by expanding attention diversity and local pattern augmentation, significantly boosting image classification accuracy.

Contribution

The paper proposes a novel, straightforward scheme called refiner that refines self-attention maps in ViTs, enhancing their data efficiency and accuracy.

Findings

01

Achieves 86% top-1 accuracy on ImageNet with 81M parameters.

02

Refiner improves attention diversity and local pattern recognition.

03

Significantly enhances ViT performance without complex architecture changes.

Abstract

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhoudaquan/Refiner_ViT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Currency Recognition and Detection · CCD and CMOS Imaging Sensors

MethodsSoftmax · Linear Layer