Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language   Models

Navid Rajabi; Jana Kosecka

arXiv:2308.09778·cs.CV·March 7, 2024

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Navid Rajabi, Jana Kosecka

PDF

Open Access

TL;DR

This paper investigates the limitations of current vision-language models in understanding spatial relations and proposes a compositional approach that improves their ability to reason about spatial clauses by grounding objects and their locations.

Contribution

The work introduces a fine-grained, compositional method for spatial reasoning that enhances grounding and reasoning capabilities in existing vision-language models.

Findings

01

Models show poor grounding of objects affecting spatial reasoning.

02

The proposed approach improves spatial clause ranking accuracy.

03

Grounding object locations enhances model interpretability.

Abstract

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsLearning Cross-Modality Encoder Representations from Transformers · Focus