Evaluating Vision-Language Models on Bistable Images
Artemis Panagopoulou, Coby Melkin, Chris Callison-Burch

TL;DR
This paper extensively evaluates vision-language models on bistable images, revealing biases, differences from human perception, and the influence of prompts and labels, with all resources openly available.
Contribution
It provides the most comprehensive analysis to date of vision-language models' responses to bistable images, including a new dataset and insights into model biases and language influence.
Findings
Most models prefer one interpretation over another
Models show minimal variance under image manipulations
Models differ from human perception and biases
Abstract
Bistable images, also known as ambiguous or reversible images, present visual stimuli that can be seen in two distinct interpretations, though not simultaneously by the observer. In this study, we conduct the most extensive examination of vision-language models using bistable images to date. We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation. We evaluated twelve different models in both classification and generative tasks across six model architectures. Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another among the models, and minimal variance under image manipulations, with few exceptions on image rotations. Additionally, we compared the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Religious Tourism and Spaces
