Beyond the Doors of Perception: Vision Transformers Represent Relations   Between Objects

Michael A. Lepori; Alexa R. Tartaglini; Wai Keen Vong; Thomas Serre,; Brenden M. Lake; Ellie Pavlick

arXiv:2406.15955·cs.CV·November 26, 2024

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects

Michael A. Lepori, Alexa R. Tartaglini, Wai Keen Vong, Thomas Serre,, Brenden M. Lake, Ellie Pavlick

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how vision transformers perform relational reasoning tasks, revealing a two-stage process involving perception and relation comparison, and highlights their potential to learn abstract visual relations.

Contribution

The study uncovers a two-stage processing mechanism in ViTs for relational tasks, demonstrating their capacity to learn abstract relations and providing insights for improving model interpretability.

Findings

01

ViTs exhibit perceptual and relational processing stages.

02

ViTs can learn to represent abstract visual relations.

03

Failures in either stage hinder generalization.

Abstract

Though vision transformers (ViTs) have achieved state-of-the-art performance in a variety of settings, they exhibit surprising failures when performing tasks involving visual relations. This begs the question: how do ViTs attempt to perform tasks that require computing visual relations between objects? Prior efforts to interpret ViTs tend to focus on characterizing relevant low-level visual features. In contrast, we adopt methods from mechanistic interpretability to study the higher-level visual algorithms that ViTs use to perform abstract visual reasoning. We present a case study of a fundamental, yet surprisingly difficult, relational reasoning task: judging whether two visual entities are the same or different. We find that pretrained ViTs fine-tuned on this task often exhibit two qualitatively different stages of processing despite having no obvious inductive biases to do so: 1) a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alexatartaglini/relational-circuits
pytorchOfficial

Videos

Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects· slideslive

Taxonomy

Topics3D Surveying and Cultural Heritage

MethodsFocus