Do You Even Need Attention? A Stack of Feed-Forward Layers Does   Surprisingly Well on ImageNet

Luke Melas-Kyriazi

arXiv:2105.02723·cs.CV·May 7, 2021·80 cites

Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Luke Melas-Kyriazi

PDF

Open Access 2 Repos

TL;DR

This paper investigates whether attention mechanisms are essential in vision transformers by replacing attention layers with feed-forward layers, revealing that non-attention components significantly contribute to their strong performance on ImageNet.

Contribution

The study demonstrates that a feed-forward-only architecture can achieve competitive accuracy, challenging the assumption that attention is crucial for vision transformer success.

Findings

01

Feed-forward-only models reach 74.9% top-1 accuracy on ImageNet.

02

Attention layers may not be the primary factor in vision transformer performance.

03

Other components like patch embedding could be more influential.

Abstract

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsDropout · Softmax · Dense Connections