I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes
Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli

TL;DR
This paper benchmarks multimodal large language models on their ability to interpret figurative meaning in memes, revealing biases and challenges in faithful explanation generation.
Contribution
It provides a comprehensive evaluation of state-of-the-art MLLMs on figurative meme understanding and introduces a human evaluation of their explanations.
Findings
Models tend to over-associate memes with figurative meaning.
Correct predictions often lack faithful explanations.
All models show bias towards figurative interpretation.
Abstract
Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHumor Studies and Applications · Language, Metaphor, and Cognition · Misinformation and Its Impacts
