You Only Need Less Attention at Each Stage in Vision Transformers

Shuoxi Zhang; Hanpeng Liu; Stephen Lin; Kun He

arXiv:2406.00427·cs.CV·June 4, 2024

You Only Need Less Attention at Each Stage in Vision Transformers

Shuoxi Zhang, Hanpeng Liu, Stephen Lin, Kun He

PDF

Open Access

TL;DR

This paper introduces LaViT, a vision transformer that reduces computational costs by computing attention only at certain stages and using attention transformations, achieving efficiency without sacrificing performance.

Contribution

The paper proposes a novel ViT architecture that computes less attention at each stage and uses attention transformations, addressing efficiency and saturation issues.

Findings

01

Achieves superior efficiency over traditional ViTs.

02

Performs well across classification, detection, and segmentation tasks.

03

Reduces computational complexity while maintaining high accuracy.

Abstract

The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsSoftmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention