Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Brendan Park; Madeline Janecek; Naser Ezzati-Jivan; Yifeng Li; and Ali; Emami

arXiv:2405.16277·cs.CL·June 4, 2024

Picturing Ambiguity: A Visual Twist on the Winograd Schema Challenge

Brendan Park, Madeline Janecek, Naser Ezzati-Jivan, Yifeng Li, and Ali, Emami

PDF

Open Access

TL;DR

This paper introduces WinoVis, a new multimodal dataset for testing pronoun disambiguation in text-to-image models, revealing current limitations and guiding future research in visual reasoning.

Contribution

The paper presents WinoVis, a novel dataset and evaluation framework for assessing multimodal pronoun disambiguation in text-to-image models, utilizing GPT-4 and DAAM for analysis.

Findings

01

Stable Diffusion 2.0 achieves 56.7% accuracy on WinoVis

02

Models show only marginal improvement over random guessing

03

Error analysis highlights key challenges for future research

Abstract

Large Language Models (LLMs) have demonstrated remarkable success in tasks like the Winograd Schema Challenge (WSC), showcasing advanced textual common-sense reasoning. However, applying this reasoning to multimodal domains, where understanding text and images together is essential, remains a substantial challenge. To address this, we introduce WinoVis, a novel dataset specifically designed to probe text-to-image models on pronoun disambiguation within multimodal contexts. Utilizing GPT-4 for prompt generation and Diffusion Attentive Attribution Maps (DAAM) for heatmap analysis, we propose a novel evaluation framework that isolates the models' ability in pronoun disambiguation from other visual processing challenges. Evaluation of successive model versions reveals that, despite incremental advancements, Stable Diffusion 2.0 achieves a precision of 56.7% on WinoVis, only marginally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArt, Technology, and Culture

MethodsLinear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections