Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge
Brendan Park, Madeline Janecek, Naser Ezzati-Jivan, Yifeng Li, and Ali, Emami

TL;DR
This paper introduces WinoVis, a new multimodal dataset for testing pronoun disambiguation in text-to-image models, revealing current limitations and guiding future research in visual reasoning.
Contribution
The paper presents WinoVis, a novel dataset and evaluation framework for assessing multimodal pronoun disambiguation in text-to-image models, utilizing GPT-4 and DAAM for analysis.
Findings
Stable Diffusion 2.0 achieves 56.7% accuracy on WinoVis
Models show only marginal improvement over random guessing
Error analysis highlights key challenges for future research
Abstract
Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArt, Technology, and Culture
MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
