Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding
Xuefei Sun, Doncey Albin, Cecilia Mauceri, Dusty Woods, and Christoffer Heckman

TL;DR
Spatial-LLaVA is a multimodal large language model enhanced with spatial referring expressions, trained on a new dataset, that significantly improves spatial reasoning in visual understanding tasks, especially in zero-shot scenarios.
Contribution
The paper introduces Spatial-LLaVA and the SUN-Spot v2.0 dataset, enabling better spatial referring expression understanding in multimodal models, surpassing previous methods by 3.15% on a benchmark.
Findings
Outperforms previous methods by 3.15% on zero-shot spatial reasoning
Enables precise understanding of spatial referring expressions
Applicable to autonomous navigation and robotics tasks
Abstract
Multimodal large language models (MLLMs) have demonstrated remarkable abilities in comprehending visual input alongside text input. Typically, these models are trained on extensive data sourced from the internet, which are sufficient for general tasks such as scene understanding and question answering. However, they often underperform on specialized tasks where online data is scarce, such as determining spatial relationships between objects or localizing unique target objects within a group of objects sharing similar features. In response to this challenge, we introduce the SUN-Spot v2.0 dataset1, now comprising a total of 90k image-caption pairs and additional annotations on the landmark objects. Each image-caption pair utilizes Set-of-Marks prompting as an additional indicator, mapping each landmark object in the image to the corresponding object mentioned in the caption. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
