Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour,, Jack Hessel, Youngjae Yu

TL;DR
This paper introduces VisArgs, a dataset and benchmark for evaluating AI's ability to understand visual arguments, emphasizing the challenge of selective vision in interpreting images within argumentative contexts.
Contribution
The paper presents a new dataset, VisArgs, with annotated visual and commonsense premises, and proposes three tasks to assess AI understanding of visual arguments, highlighting current model limitations.
Findings
AI models struggle with visual premise localization and identification.
Providing relevant visual premises improves model accuracy.
Humans outperform AI in understanding visual arguments.
Abstract
Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction. Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition
