Lost in Space: Probing Fine-grained Spatial Understanding in Vision and   Language Resamplers

Georgios Pantazopoulos; Alessandro Suglia; Oliver Lemon; Arash Eshghi

arXiv:2404.13594·cs.CV·April 23, 2024

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the spatial understanding capabilities of visual prompts generated by resamplers in vision-language models, revealing that joint training enhances spatial encoding and suggesting the need for object-aware pretraining objectives.

Contribution

The study introduces diagnostic classifiers to evaluate spatial information in visual prompts and demonstrates that joint training improves spatial encoding in resamplers.

Findings

01

Visual prompts lack spatial info when frozen during training.

02

Joint training significantly improves spatial encoding.

03

Object-aware pretraining could enhance spatial understanding.

Abstract

An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a `visual prompt' which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use \textit{diagnostic classifiers} to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gpantaz/probing-resamplers
noneOfficial

Videos

Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers· underline

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Semantic Web and Ontologies