Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee, Freda Shi, Parisa, Kordjamshidi, Joyce Chai, Ziqiao Ma

TL;DR
This paper introduces COMFORT, a new evaluation protocol for assessing how well vision-language models understand and reason about spatial language and frames of reference, revealing significant limitations in current models.
Contribution
The paper presents COMFORT, a systematic evaluation method for spatial reasoning in VLMs, and demonstrates their shortcomings in robustness, flexibility, and cross-cultural understanding.
Findings
VLMs show some alignment with English spatial conventions.
Models lack robustness and consistency in spatial reasoning.
Cross-lingual tests reveal models favor English conventions.
Abstract
Spatial expressions in situated communication can be ambiguous, as their meanings vary depending on the frames of reference (FoR) adopted by speakers and listeners. While spatial language understanding and reasoning by vision-language models (VLMs) have gained increasing attention, potential ambiguities in these models are still under-explored. To address this issue, we present the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs. We evaluate nine state-of-the-art VLMs using COMFORT. Despite showing some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language
MethodsSoftmax · Attention Is All You Need · ALIGN
