SPARO: Selective Attention for Robust and Compositional Transformer   Encodings for Vision

Ankit Vani; Bac Nguyen; Samuel Lavoie; Ranjay Krishna; Aaron Courville

arXiv:2404.15721·cs.CV·September 17, 2024

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision

Ankit Vani, Bac Nguyen, Samuel Lavoie, Ranjay Krishna, Aaron Courville

PDF

Open Access 1 Repo

TL;DR

SPARO introduces a novel attention mechanism that partitions transformer encodings into separate concept-attended slots, enhancing robustness, compositionality, and interpretability in vision models like CLIP and DINO.

Contribution

The paper proposes SPARO, an architectural prior that improves transformer representations by separately attending over individual concepts, leading to better downstream performance and interpretability.

Findings

01

Improves ImageNet accuracy by up to 14% with CLIP

02

Enhances robustness and compositionality benchmarks

03

Enables targeted concept intervention for performance gains

Abstract

Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ankitkv/sparo-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training