LookHere: Vision Transformers with Directed Attention Generalize and   Extrapolate

Anthony Fuller; Daniel G. Kyrollos; Yousef Yassin; James R. Green

arXiv:2405.13985·cs.CV·October 31, 2024

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate

Anthony Fuller, Daniel G. Kyrollos, Yousef Yassin, James R. Green

PDF

Open Access 1 Repo 1 Video

TL;DR

LookHere introduces a novel position encoding method for vision transformers that enhances their ability to generalize and extrapolate to larger images, significantly improving high-resolution image classification performance.

Contribution

We propose LookHere, a new position encoding that restricts attention heads to fixed, directed fields of view, improving extrapolation and generalization of vision transformers to high-resolution images.

Findings

01

Improves classification accuracy by 1.6% on ImageNet.

02

Reduces adversarial attack success rate by 5.4%.

03

Outperforms 2D-RoPE by 21.7% on extrapolated ImageNet images.

Abstract

High-resolution images offer more information about scenes that can improve model accuracy. However, the dominant model architecture in computer vision, the vision transformer (ViT), cannot effectively leverage larger images without finetuning -- ViTs poorly extrapolate to more patches at test time, although transformers offer sequence length flexibility. We attribute this shortcoming to the current patch position encoding methods, which create a distribution shift when extrapolating. We propose a drop-in replacement for the position encoding of plain ViTs that restricts attention heads to fixed fields of view, pointed in different directions, using 2D attention masks. Our novel method, called LookHere, provides translation-equivariance, ensures attention head diversity, and limits the distribution shift that attention heads face when extrapolating. We demonstrate that LookHere…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

greencubic/lookhere
pytorchOfficial

Videos

LookHere: Vision Transformers with Directed Attention Generalize and Extrapolate· slideslive

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Infrared Target Detection Methodologies

MethodsAttention Is All You Need · Sparse Evolutionary Training · Dense Connections · Softmax · Layer Normalization · Linear Layer · Multi-Head Attention · Residual Connection · Vision Transformer