Locally Shifted Attention With Early Global Integration
Shelly Sheynin, Sagie Benaim, Adam Polyak, Lior Wolf

TL;DR
This paper introduces a novel vision transformer approach that combines local and global attention layers early in the network, enabling efficient long-range interactions and improved image classification performance.
Contribution
It proposes a method with local and global attention layers that support early data-dependent localization, enhancing efficiency and accuracy in vision transformers.
Findings
Outperforms convolutional and transformer-based methods on CIFAR10, CIFAR100, and ImageNet.
Supports data-dependent localization at early layers.
Achieves superior classification accuracy with lower computational cost.
Abstract
Recent work has shown the potential of transformers for computer vision applications. An image is first partitioned into patches, which are then used as input tokens for the attention mechanism. Due to the expensive quadratic cost of the attention mechanism, either a large patch size is used, resulting in coarse-grained global interactions, or alternatively, attention is applied only on a local region of the image, at the expense of long-range interactions. In this work, we propose an approach that allows for both coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. At the core of our method is the application of local and global attention layers. In the local attention layer, we apply attention to each patch and its local shifts, resulting in virtually located local patches, which are not bound to a single, specific location.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
