Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models
Navid Rajabi, Jana Kosecka

TL;DR
This paper investigates the limitations of current vision-language models in understanding spatial relations and proposes a compositional approach that improves their ability to reason about spatial clauses by grounding objects and their locations.
Contribution
The work introduces a fine-grained, compositional method for spatial reasoning that enhances grounding and reasoning capabilities in existing vision-language models.
Findings
Models show poor grounding of objects affecting spatial reasoning.
The proposed approach improves spatial clause ranking accuracy.
Grounding object locations enhances model interpretability.
Abstract
Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed that these models lack fine-grained understanding, such as the ability to count and recognize verbs, attributes, or relationships. The focus of this work is to study the understanding of spatial relations. This has been tackled previously using image-text matching (e.g., Visual Spatial Reasoning benchmark) or visual question answering (e.g., GQA or VQAv2), both showing poor performance and a large gap compared to human performance. In this work, we show qualitatively (using explainability tools) and quantitatively (using object detectors) that the poor object localization "grounding" ability of the models is a contributing factor to the poor image-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsLearning Cross-Modality Encoder Representations from Transformers · Focus
