Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu

TL;DR
This paper introduces UNPIE, a benchmark dataset with visual puns to evaluate how multimodal models can resolve lexical ambiguities using visual cues, demonstrating improved performance over text-only models.
Contribution
The paper presents UNPIE, a novel multimodal benchmark with 1,000 puns and visual explanations, to assess models' ability to resolve lexical ambiguity using visual context.
Findings
Visual and Socratic models outperform text-only models on pun disambiguation.
Model performance improves with task complexity when visual cues are provided.
UNPIE enables systematic evaluation of multimodal literacy in NLP.
Abstract
Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Subtitles and Audiovisual Media
