Probing Conceptual Understanding of Large Visual-Language Models
Madeline Schiappa, Raiyaan Abdullah, Shehreen Azad, Jared, Claypoole, Michael Cogswell, Ajay Divakaran, Yogesh Rawat

TL;DR
This paper introduces new benchmarks to evaluate whether large visual-language models truly understand visual content conceptually, revealing their current limitations and potential improvements.
Contribution
It proposes novel cognitive science-inspired benchmarks for assessing conceptual understanding in V+L models and analyzes their performance and insights for future enhancements.
Findings
Most models fail to demonstrate conceptual understanding.
Cross-attention improves conceptual learning.
Transformers excel at color and shape, CNNs at texture and patterns.
Abstract
In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition
Methodsfail · Contrastive Language-Image Pre-training
