PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Boyu Chen; Peixia Li; Baopu Li; Chuming Li; Lei Bai; Chen Lin; Ming; Sun; Junjie Yan; Wanli Ouyang

arXiv:2108.03428·cs.CV·August 10, 2021·20 cites

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

Boyu Chen, Peixia Li, Baopu Li, Chuming Li, Lei Bai, Chen Lin, Ming, Sun, Junjie Yan, Wanli Ouyang

PDF

Open Access

TL;DR

This paper introduces PSViT, a novel vision transformer architecture that reduces redundancy through token pooling and attention sharing, leading to improved accuracy and efficiency in image recognition.

Contribution

The paper proposes a new ViT model with token pooling and attention sharing, automatically learned as hyper-parameters, to enhance feature representation and speed-accuracy trade-off.

Findings

01

Achieves up to 6.6% accuracy improvement on ImageNet.

02

Effectively reduces redundancy in tokens and attention maps.

03

Enhances feature representation and computational efficiency.

Abstract

In this paper, we observe two levels of redundancies when applying vision transformers (ViT) for image recognition. First, fixing the number of tokens through the whole network produces redundant features at the spatial level. Second, the attention maps among different transformer layers are redundant. Based on the observations above, we propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy, effectively enhancing the feature representation ability, and achieving a better speed-accuracy trade-off. Specifically, in our PSViT, token pooling can be defined as the operation that decreases the number of tokens at the spatial level. Besides, attention sharing will be built between the neighboring transformer layers for reusing the attention maps having a strong correlation among adjacent layers. Then, a compact set of the possible combinations for different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Neural Network Applications · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Feedforward Network · Attention Dropout · Data-efficient Image Transformer