"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Naen Xu; Jiayi Sheng; Changjiang Li; Chunyi Zhou; Yuyuan Li; Tianyu Du; Jun Wang; Zhihui Fu; Jinbao Li; Shouling Ji

arXiv:2604.05930·cs.CL·April 8, 2026

"I See What You Did There": Can Large Vision-Language Models Understand Multimodal Puns?

Naen Xu, Jiayi Sheng, Changjiang Li, Chunyi Zhou, Yuyuan Li, Tianyu Du, Jun Wang, Zhihui Fu, Jinbao Li, Shouling Ji

PDF

TL;DR

This paper introduces MultiPun, a dataset and evaluation framework to assess vision-language models' ability to understand multimodal puns, revealing current limitations and proposing strategies for improvement.

Contribution

The paper presents the first systematic study of multimodal pun understanding in VLMs, including a new dataset, evaluation methods, and strategies to enhance pun comprehension.

Findings

01

Most models struggle to distinguish genuine puns from distractors.

02

Prompt and model-level strategies improve pun understanding by 16.5% in F1 scores.

03

MultiPun dataset enables rigorous benchmarking of multimodal pun comprehension.

Abstract

Puns are a common form of rhetorical wordplay that exploits polysemy and phonetic similarity to create humor. In multimodal puns, visual and textual elements synergize to ground the literal sense and evoke the figurative meaning simultaneously. Although Vision-Language Models (VLMs) are widely used in multimodal understanding and generation, their ability to understand puns has not been systematically studied due to a scarcity of rigorous benchmarks. To address this, we first propose a multimodal pun generation pipeline. We then introduce MultiPun, a dataset comprising diverse types of puns alongside adversarial non-pun distractors. Our evaluation reveals that most models struggle to distinguish genuine puns from these distractors. Moreover, we propose both prompt-level and model-level strategies to enhance pun comprehension, with an average improvement of 16.5% in F1 scores. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.