LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate
Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green

TL;DR
LookHere introduces a novel position encoding method for vision transformers that enhances their ability to generalize and extrapolate to larger images, significantly improving high-resolution image classification performance.
Contribution
We propose LookHere, a new position encoding that restricts attention heads to fixed, directed fields of view, improving extrapolation and generalization of vision transformers to high-resolution images.
Findings
Improves classification accuracy by 1.6% on ImageNet.
Reduces adversarial attack success rate by 5.4%.
Outperforms 2D-RoPE by 21.7% on extrapolated ImageNet images.
Abstract
High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating. We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · Sparse Evolutionary Training · Dense Connections · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Residual Connection · Vision Transformer
