Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models
Akshar Tumu, Varad Shinde, Parisa Kordjamshidi

TL;DR
This paper evaluates vision-language models' spatial reasoning abilities using Referring Expression Comprehension, revealing their strengths and weaknesses in understanding complex, ambiguous, and negated spatial expressions.
Contribution
It introduces the Referring Expression Comprehension task as a new platform for analyzing spatial grounding in vision-language models, focusing on complex and ambiguous spatial language.
Findings
Models struggle with complex spatial expressions.
Performance varies across spatial categories.
Challenges are influenced by model architecture and expression complexity.
Abstract
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
