Probing Conceptual Understanding of Large Visual-Language Models

Madeline Schiappa; Raiyaan Abdullah; Shehreen Azad; Jared; Claypoole; Michael Cogswell; Ajay Divakaran; Yogesh Rawat

arXiv:2304.03659·cs.CV·April 29, 2024·1 cites

Probing Conceptual Understanding of Large Visual-Language Models

Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared, Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat

PDF

Open Access 1 Repo

TL;DR

This paper introduces new benchmarks to evaluate whether large visual-language models truly understand visual content conceptually, revealing their current limitations and potential improvements.

Contribution

It proposes novel cognitive science-inspired benchmarks for assessing conceptual understanding in V+L models and analyzes their performance and insights for future enhancements.

Findings

01

Most models fail to demonstrate conceptual understanding.

02

Cross-attention improves conceptual learning.

03

Transformers excel at color and shape, CNNs at texture and patterns.

Abstract

In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Maddy12/UnderstandingVisualTextModels
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition

Methodsfail · Contrastive Language-Image Pre-training