Can visual language models resolve textual ambiguity with visual cues?   Let visual puns tell you!

Jiwan Chung; Seungwon Lim; Jaehyun Jeon; Seungbeen Lee; Youngjae Yu

arXiv:2410.01023·cs.CV·October 24, 2024

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!

Jiwan Chung, Seungwon Lim, Jaehyun Jeon, Seungbeen Lee, Youngjae Yu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces UNPIE, a benchmark dataset with visual puns to evaluate how multimodal models can resolve lexical ambiguities using visual cues, demonstrating improved performance over text-only models.

Contribution

The paper presents UNPIE, a novel multimodal benchmark with 1,000 puns and visual explanations, to assess models' ability to resolve lexical ambiguity using visual context.

Findings

01

Visual and Socratic models outperform text-only models on pun disambiguation.

02

Model performance improves with task complexity when visual cues are provided.

03

UNPIE enables systematic evaluation of multimodal literacy in NLP.

Abstract

Humans possess multimodal literacy, allowing them to actively integrate information from various modalities to form reasoning. Faced with challenges like lexical ambiguity in text, we supplement this with other modalities, such as thumbnail images or textbook illustrations. Is it possible for machines to achieve a similar multimodal understanding capability? In response, we present Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Puns serve as the ideal subject for this evaluation due to their intrinsic ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. We pose three multimodal challenges with the annotations to assess different aspects of multimodal literacy; Pun Grounding, Disambiguation, and Reconstruction. The results indicate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiwanchung/visualpun_unpie
noneOfficial

Datasets

jiwan-chung/VisualPun_UNPIE
dataset· 24 dl
24 dl

Videos

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!· underline

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Subtitles and Audiovisual Media