Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

Elena Pitta; Tom Kouwenhoven; Tessa Verhoef

arXiv:2507.17467·cs.CV·July 24, 2025

Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls

Elena Pitta, Tom Kouwenhoven, Tessa Verhoef

PDF

Open Access

TL;DR

This paper critically examines the Visual Entailment task as a probe for vision-language understanding in multimodal models, revealing its strengths and limitations through extensive experiments and analysis.

Contribution

It provides a comprehensive evaluation of VE's effectiveness as a diagnostic tool and highlights factors influencing model performance, such as prompt design and visual information access.

Findings

01

Three-shot inference outperforms zero-shot baseline.

02

Additional examples can introduce noise, affecting performance.

03

Fine-tuning achieves 83.3% accuracy on e-SNLI-VE and produces meaningful explanations.

Abstract

This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Metaphor, and Cognition