Visual Spatial Reasoning

Fangyu Liu; Guy Emerson; Nigel Collier

arXiv:2205.00363·cs.CL·March 23, 2023·1 cites

Visual Spatial Reasoning

Fangyu Liu, Guy Emerson, Nigel Collier

PDF

Open Access 4 Repos 2 Models 5 Datasets

TL;DR

This paper introduces a new dataset for visual spatial reasoning that highlights the challenges current vision-and-language models face in understanding complex spatial relations, revealing significant performance gaps.

Contribution

The paper presents VSR, a novel dataset with diverse spatial relations and linguistic phenomena, and evaluates model limitations in capturing relational information.

Findings

01

Models achieve only around 70% accuracy compared to over 95% human performance.

02

Performance on specific relations does not correlate with training data size.

03

Models struggle with orientation-based spatial relations.

Abstract

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Categorization, perception, and language

MethodsVision-and-Language Transformer · VisualBERT · Learning Cross-Modality Encoder Representations from Transformers · Contrastive Language-Image Pre-training