Vision language models have difficulty recognizing virtual objects
Tyler Tran, Sangeet Khemlani, J.G. Trafton

TL;DR
This paper investigates the ability of vision language models to understand virtual objects not visually present in images, revealing their current limitations in scene comprehension and spatial reasoning.
Contribution
The study introduces a novel evaluation method using virtual objects to test VLMs' scene understanding, highlighting their inadequacies in processing unseen objects.
Findings
VLMs struggle to recognize virtual objects in scenes.
Current models show limited reasoning about spatial relations involving virtual objects.
Evaluation reveals significant gaps in scene comprehension capabilities.
Abstract
Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems
