Exploring Spatial Language Grounding Through Referring Expressions

Akshar Tumu; Parisa Kordjamshidi

arXiv:2502.04359·cs.CL·February 10, 2025

Exploring Spatial Language Grounding Through Referring Expressions

Akshar Tumu, Parisa Kordjamshidi

PDF

Open Access

TL;DR

This paper evaluates vision-language models' spatial reasoning abilities using Referring Expression Comprehension, revealing their strengths and weaknesses in understanding complex, ambiguous, and negated spatial expressions.

Contribution

It introduces Referring Expression Comprehension as a new platform for assessing spatial grounding in VLMs, focusing on complex and negated spatial language.

Findings

01

Models struggle with complex spatial expressions

02

Performance varies across spatial categories

03

Negation and ambiguity pose significant challenges

Abstract

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Language, Metaphor, and Cognition · EFL/ESL Teaching and Learning