SAViR-T: Spatially Attentive Visual Reasoning with Transformers
Pritish Sahu, Kalliopi Basioti, Vladimir Pavlovic

TL;DR
SAViR-T introduces a transformer-based model that explicitly encodes spatial semantics in visual reasoning tasks, achieving state-of-the-art results on multiple RPM benchmarks and natural image datasets.
Contribution
The paper presents SAViR-T, a novel transformer architecture that models intra- and inter-image dependencies using spatially attentive tokens for improved visual reasoning.
Findings
Sets new state-of-the-art on RPM benchmarks
Outperforms prior models significantly
Effective on both synthetic and natural images
Abstract
We present a novel computational model, "SAViR-T", for the family of visual reasoning problems embodied in the Raven's Progressive Matrices (RPM). Our model considers explicit spatial semantics of visual elements within each image in the puzzle, encoded as spatio-visual tokens, and learns the intra-image as well as the inter-image token dependencies, highly relevant for the visual reasoning task. Token-wise relationship, modeled through a transformer-based SAViR-T architecture, extract group (row or column) driven representations by leveraging the group-rule coherence and use this as the inductive bias to extract the underlying rule representations in the top two row (or column) per token in the RPM. We use this relation representations to locate the correct choice image that completes the last row or column for the RPM. Extensive experiments across both synthetic RPM benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsProbability Guided Maxout
