Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects
Michael A. Lepori, Alexa R. Tartaglini, Wai Keen Vong, Thomas Serre,, Brenden M. Lake, Ellie Pavlick

TL;DR
This paper investigates how vision transformers perform relational reasoning tasks, revealing a two-stage process involving perception and relation comparison, and highlights their potential to learn abstract visual relations.
Contribution
The study uncovers a two-stage processing mechanism in ViTs for relational tasks, demonstrating their capacity to learn abstract relations and providing insights for improving model interpretability.
Findings
ViTs exhibit perceptual and relational processing stages.
ViTs can learn to represent abstract visual relations.
Failures in either stage hinder generalization.
Abstract
Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
Topics3D Surveying and Cultural Heritage
MethodsFocus
