Understanding Figurative Meaning through Explainable Visual Entailment

Arkadiy Saakyan; Shreyas Kulkarni; Tuhin Chakrabarty; Smaranda Muresan

arXiv:2405.01474·cs.CL·February 18, 2025

Understanding Figurative Meaning through Explainable Visual Entailment

Arkadiy Saakyan, Shreyas Kulkarni, Tuhin Chakrabarty, Smaranda Muresan

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

This paper introduces a new explainable visual entailment task to assess how well large vision-language models understand figurative language in images and captions, highlighting their current limitations and error types.

Contribution

It presents the V-FLUTE dataset for figurative meaning understanding and analyzes the challenges faced by models in generalizing from literal to figurative content.

Findings

01

VLMs struggle with figurative meaning, especially in images.

02

Models often hallucinate or reason incompletely on figurative tasks.

03

The dataset enables systematic evaluation of figurative language understanding.

Abstract

Large Vision-Language Models (VLMs) have demonstrated strong capabilities in tasks requiring a fine-grained understanding of literal meaning in images and text, such as visual question-answering or visual entailment. However, there has been little exploration of the capabilities of these models when presented with images and captions containing figurative meaning, such as metaphors or humor. To close this gap, we propose a new task framing the figurative meaning understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a caption (hypothesis) and justify the predicted label with a textual explanation. The figurative phenomena can be present in the image, in the caption, or both. Using a human-AI collaboration approach, we build the accompanying expert-verified dataset V-FLUTE, containing 6,027 {image, caption,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

asaakyan/V-FLUTE
noneOfficial

Models

🤗
asaakyan/LLaVA-1.5-7b-eViL-VFLUTE-lora
model· 3 dl
3 dl

Datasets

ColumbiaNLP/V-FLUTE
dataset· 32 dl
32 dl

Videos

Understanding Figurative Meaning through Explainable Visual Entailment· underline

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling