LocalViT: Analyzing Locality in Vision Transformers
Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, Michele Magno, Luca, Benini, Luc Van Gool

TL;DR
This paper systematically investigates the role of locality mechanisms in vision transformers, demonstrating that adding locality improves performance across various architectures with minimal additional computational cost.
Contribution
The paper introduces a simple yet effective locality mechanism into vision transformers and validates its benefits through extensive experiments and architecture generalization.
Findings
Locality mechanisms improve transformer performance on ImageNet.
Proper design choices enhance the effectiveness of locality integration.
Locality-enhanced transformers outperform baseline models with minimal extra cost.
Abstract
The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for information exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsLocalViT · Convolution
