Exploring Spatial Language Grounding Through Referring Expressions
Akshar Tumu, Parisa Kordjamshidi

TL;DR
This paper evaluates vision-language models' spatial reasoning abilities using Referring Expression Comprehension, revealing their strengths and weaknesses in understanding complex, ambiguous, and negated spatial expressions.
Contribution
It introduces Referring Expression Comprehension as a new platform for assessing spatial grounding in VLMs, focusing on complex and negated spatial language.
Findings
Models struggle with complex spatial expressions
Performance varies across spatial categories
Negation and ambiguity pose significant challenges
Abstract
Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · EFL/ESL Teaching and Learning
