Understanding Figurative Meaning through Explainable Visual Entailment
Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, Smaranda Muresan

TL;DR
This paper introduces a new explainable visual entailment task to assess how well large vision-language models understand figurative language in images and captions, highlighting their current limitations and error types.
Contribution
It presents the V-FLUTE dataset for figurative meaning understanding and analyzes the challenges faced by models in generalizing from literal to figurative content.
Findings
VLMs struggle with figurative meaning, especially in images.
Models often hallucinate or reason incompletely on figurative tasks.
The dataset enables systematic evaluation of figurative language understanding.
Abstract
Large Vision-Language Models (VLMs) have demonstrated strong capabilities in tasks requiring a fine-grained understanding of literal meaning in images and text, such as visual question-answering or visual entailment. However, there has been little exploration of the capabilities of these models when presented with images and captions containing figurative meaning, such as metaphors or humor. To close this gap, we propose a new task framing the figurative meaning understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a caption (hypothesis) and justify the predicted label with a textual explanation. The figurative phenomena can be present in the image, in the caption, or both. Using a human-AI collaboration approach, we build the accompanying expert-verified dataset V-FLUTE, containing 6,027 {image, caption,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
