Visual Persuasion: What Influences Decisions of Vision-Language Models?
Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

TL;DR
This paper introduces a framework for analyzing the visual preferences of vision-language models by systematically perturbing images and observing decision changes, revealing their vulnerabilities and biases.
Contribution
It develops a method to infer VLMs' visual utility through choice-based perturbations and visual prompt optimization, enabling interpretability and safety auditing.
Findings
Optimized visual edits significantly influence model choices.
Identified consistent visual themes affecting preferences.
Framework enables proactive safety and bias detection.
Abstract
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
