Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls
Elena Pitta, Tom Kouwenhoven, Tessa Verhoef

TL;DR
This paper critically examines the Visual Entailment task as a probe for vision-language understanding in multimodal models, revealing its strengths and limitations through extensive experiments and analysis.
Contribution
It provides a comprehensive evaluation of VE's effectiveness as a diagnostic tool and highlights factors influencing model performance, such as prompt design and visual information access.
Findings
Three-shot inference outperforms zero-shot baseline.
Additional examples can introduce noise, affecting performance.
Fine-tuning achieves 83.3% accuracy on e-SNLI-VE and produces meaningful explanations.
Abstract
This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition
