Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding

Xuefei Sun; Doncey Albin; Cecilia Mauceri; Dusty Woods; and Christoffer Heckman

arXiv:2505.12194·cs.RO·May 20, 2025

Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding

Xuefei Sun, Doncey Albin, Cecilia Mauceri, Dusty Woods, and Christoffer Heckman

PDF

Open Access

TL;DR

Spatial-LLaVA is a multimodal large language model enhanced with spatial referring expressions, trained on a new dataset, that significantly improves spatial reasoning in visual understanding tasks, especially in zero-shot scenarios.

Contribution

The paper introduces Spatial-LLaVA and the SUN-Spot v2.0 dataset, enabling better spatial referring expression understanding in multimodal models, surpassing previous methods by 3.15% on a benchmark.

Findings

01

Outperforms previous methods by 3.15% on zero-shot spatial reasoning

02

Enables precise understanding of spatial referring expressions

03

Applicable to autonomous navigation and robotics tasks

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable abilities in comprehending visual input alongside text input. Typically, these models are trained on extensive data sourced from the internet, which are sufficient for general tasks such as scene understanding and question answering. However, they often underperform on specialized tasks where online data is scarce, such as determining spatial relationships between objects or localizing unique target objects within a group of objects sharing similar features. In response to this challenge, we introduce the SUN-Spot v2.0 dataset1, now comprising a total of 90k image-caption pairs and additional annotations on the landmark objects. Each image-caption pair utilizes Set-of-Marks prompting as an additional indicator, mapping each landmark object in the image to the corresponding object mentioned in the caption. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling