PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
Kun Ouyang, Yuanxin Liu, Shicheng Li, Yi Liu, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

TL;DR
This paper introduces PunchBench, a comprehensive benchmark for evaluating multimodal large language models' ability to understand humor and sarcasm in image-caption pairs, addressing existing limitations and proposing a new improvement strategy.
Contribution
It presents PunchBench, a novel benchmark with diverse questions and domain coverage, and introduces SC-CoQ, a strategy to improve punchline comprehension in MLLMs.
Findings
Significant gap between MLLMs and humans in punchline comprehension.
SC-CoQ strategy improves MLLMs' performance on PunchBench.
Enhanced evaluation accuracy by generating synonymous and antonymous captions.
Abstract
Multimodal punchlines, which involve humor or sarcasm conveyed in image-caption pairs, are a popular way of communication on online multimedia platforms. With the rapid development of multimodal large language models (MLLMs), it is essential to assess their ability to effectively comprehend these punchlines. However, existing benchmarks on punchline comprehension suffer from three major limitations: 1) language shortcuts that allow models to solely rely on text, 2) lack of question diversity, and 3) narrow focus on a specific domain of multimodal content (e.g., cartoon). To address these limitations, we introduce a multimodal \textbf{Punch}line comprehension \textbf{Bench}mark, named \textbf{PunchBench}, which is tailored for accurate and comprehensive evaluation of punchline comprehension. To enhance the evaluation accuracy, we generate synonymous and antonymous captions by modifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSubtitles and Audiovisual Media · Speech and dialogue systems
MethodsFocus
