SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani, Bac Nguyen, Samuel Lavoie, Ranjay Krishna, Aaron Courville

TL;DR
SPARO introduces a novel attention mechanism that partitions transformer encodings into separate concept-attended slots, enhancing robustness, compositionality, and interpretability in vision models like CLIP and DINO.
Contribution
The paper proposes SPARO, an architectural prior that improves transformer representations by separately attending over individual concepts, leading to better downstream performance and interpretability.
Findings
Improves ImageNet accuracy by up to 14% with CLIP
Enhances robustness and compositionality benchmarks
Enables targeted concept intervention for performance gains
Abstract
Selective attention helps us focus on task-relevant aspects in the constant flood of our sensory input. This constraint in our perception allows us to robustly generalize under distractions and to new compositions of perceivable concepts. Transformers employ a similar notion of attention in their architecture, but representation learning models with transformer backbones like CLIP and DINO often fail to demonstrate robustness and compositionality. We highlight a missing architectural prior: unlike human perception, transformer encodings do not separately attend over individual concepts. In response, we propose SPARO, a read-out mechanism that partitions encodings into separately-attended slots, each produced by a single attention head. Using SPARO with CLIP imparts an inductive bias that the vision and text modalities are different views of a shared compositional world with the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Softmax · Vision Transformer · self-DIstillation with NO labels · Contrastive Language-Image Pre-training
