Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning
Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, and Yohan Jo

TL;DR
This paper investigates how Multimodal Large Language Models (MLLMs) can be trained and evaluated to better understand and explain their predictions about visual persuasion, highlighting the importance of faithful rationales.
Contribution
It introduces a new framework for evaluating rationale faithfulness and demonstrates that supervised fine-tuning with diverse rationales improves prediction and explanation quality.
Findings
Diverse rationales improve persuasiveness prediction.
Prediction performance alone does not ensure rationale faithfulness.
Rationale-to-decision sensitivity aligns with human preferences.
Abstract
Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
