Visual Spatial Reasoning
Fangyu Liu, Guy Emerson, Nigel Collier

TL;DR
This paper introduces a new dataset for visual spatial reasoning that highlights the challenges current vision-and-language models face in understanding complex spatial relations, revealing significant performance gaps.
Contribution
The paper presents VSR, a novel dataset with diverse spatial relations and linguistic phenomena, and evaluates model limitations in capturing relational information.
Findings
Models achieve only around 70% accuracy compared to over 95% human performance.
Performance on specific relations does not correlate with training data size.
Models struggle with orientation-based spatial relations.
Abstract
Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Categorization, perception, and language
MethodsVision-and-Language Transformer · VisualBERT · Learning Cross-Modality Encoder Representations from Transformers · Contrastive Language-Image Pre-training
