Refiner: Refining Self-attention for Vision Transformers
Daquan Zhou, Yujun Shi, Bingyi Kang, Weihao Yu, Zihang Jiang, Yuan Li,, Xiaojie Jin, Qibin Hou, Jiashi Feng

TL;DR
This paper introduces 'refiner', a simple method to improve self-attention in Vision Transformers by expanding attention diversity and local pattern augmentation, significantly boosting image classification accuracy.
Contribution
The paper proposes a novel, straightforward scheme called refiner that refines self-attention maps in ViTs, enhancing their data efficiency and accuracy.
Findings
Achieves 86% top-1 accuracy on ImageNet with 81M parameters.
Refiner improves attention diversity and local pattern recognition.
Significantly enhances ViT performance without complex architecture changes.
Abstract
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Currency Recognition and Detection · CCD and CMOS Imaging Sensors
MethodsSoftmax · Linear Layer
