SAViR-T: Spatially Attentive Visual Reasoning with Transformers

Pritish Sahu; Kalliopi Basioti; Vladimir Pavlovic

arXiv:2206.09265·cs.CV·June 23, 2022·1 cites

SAViR-T: Spatially Attentive Visual Reasoning with Transformers

Pritish Sahu, Kalliopi Basioti, Vladimir Pavlovic

PDF

Open Access 1 Repo

TL;DR

SAViR-T introduces a transformer-based model that explicitly encodes spatial semantics in visual reasoning tasks, achieving state-of-the-art results on multiple RPM benchmarks and natural image datasets.

Contribution

The paper presents SAViR-T, a novel transformer architecture that models intra- and inter-image dependencies using spatially attentive tokens for improved visual reasoning.

Findings

01

Sets new state-of-the-art on RPM benchmarks

02

Outperforms prior models significantly

03

Effective on both synthetic and natural images

Abstract

We present a novel computational model, "SAViR-T", for the family of visual reasoning problems embodied in the Raven's Progressive Matrices (RPM). Our model considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies, highly relevant for the visual reasoning task. Token-wise relationship, modeled through a transformer-based SAViR-T architecture, extract group (row or column) driven representations by leveraging the group-rule coherence and use this as the inductive bias to extract the underlying rule representations in the top two row (or column) per token in the RPM. We use this relation representations to locate the correct choice image that completes the last row or column for the RPM. Extensive experiments across both synthetic RPM benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kalbasioti/visual-reasoning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques

MethodsProbability Guided Maxout