VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank,, Iacer Calixto, Albert Gatt

TL;DR
VALSE is a new benchmark for evaluating vision and language models on their ability to understand and ground specific linguistic phenomena in visual data, enabling more detailed assessments of their linguistic and visual reasoning capabilities.
Contribution
The paper introduces VALSE, a comprehensive benchmark with six tests targeting linguistic phenomena, supporting valid foil construction, and providing a new tool for fine-grained evaluation of V&L models.
Findings
Current models struggle with most linguistic phenomena.
VALSE reveals gaps in models' visio-linguistic grounding abilities.
Benchmark facilitates future improvements in V&L models.
Abstract
We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
