LocalViT: Analyzing Locality in Vision Transformers

Yawei Li; Kai Zhang; Jiezhang Cao; Radu Timofte; Michele Magno; Luca; Benini; Luc Van Gool

arXiv:2104.05707·cs.CV·February 13, 2025·284 cites

LocalViT: Analyzing Locality in Vision Transformers

Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, Michele Magno, Luca, Benini, Luc Van Gool

PDF

Open Access 2 Repos

TL;DR

This paper systematically investigates the role of locality mechanisms in vision transformers, demonstrating that adding locality improves performance across various architectures with minimal additional computational cost.

Contribution

The paper introduces a simple yet effective locality mechanism into vision transformers and validates its benefits through extensive experiments and architecture generalization.

Findings

01

Locality mechanisms improve transformer performance on ImageNet.

02

Proper design choices enhance the effectiveness of locality integration.

03

Locality-enhanced transformers outperform baseline models with minimal extra cost.

Abstract

The aim of this paper is to study the influence of locality mechanisms in vision transformers. Transformers originated from machine translation and are particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking is a locality mechanism for information exchange within a local region. In this paper, locality mechanism is systematically investigated by carefully designed controlled experiments. We add locality to vision transformers into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsLocalViT · Convolution