Do Vision-Language Models Understand Visual Persuasiveness?

Gyuwon Park

arXiv:2511.17036·cs.CL·November 24, 2025

Do Vision-Language Models Understand Visual Persuasiveness?

Gyuwon Park

PDF

Open Access

TL;DR

This paper investigates whether vision-language models truly understand visual persuasion by analyzing their ability to predict human judgments, revealing limitations in linking objects to communicative intent and proposing strategies to improve reasoning.

Contribution

The paper introduces a high-consensus dataset and a taxonomy of Visual Persuasive Factors, providing new insights into VLMs' understanding of visual persuasion and testing intervention strategies.

Findings

01

High-level semantic cues are the strongest predictor of persuasiveness.

02

VLMs tend to over-predict high persuasiveness and struggle with low/mid-level features.

03

Concise, object-grounded rationales improve model performance.

Abstract

Recent advances in vision-language models (VLMs) have enabled impressive multi-modal reasoning and understanding. Yet, whether these models truly grasp visual persuasion-how visual cues shape human attitudes and decisions-remains unclear. To probe this question, we construct a high-consensus dataset for binary persuasiveness judgment and introduce the taxonomy of Visual Persuasive Factors (VPFs), encompassing low-level perceptual, mid-level compositional, and high-level semantic cues. We also explore cognitive steering and knowledge injection strategies for persuasion-relevant reasoning. Empirical analysis across VLMs reveals a recall-oriented bias-models over-predict high persuasiveness-and weak discriminative power for low/mid-level features. In contrast, high-level semantic alignment between message and object presence emerges as the strongest predictor of human judgment. Among…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Visual Attention and Saliency Detection