Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

TL;DR
Pre-trained vision-language models can learn and discover visual concepts like color and texture, which can be extracted using their interface with prompts, enabling better interpretability and reasoning.
Contribution
The paper introduces a new framework for identifying and ranking visual concepts learned by VLMs, addressing previous conflicting evaluation strategies.
Findings
VLMs learn diverse visual concepts that describe objects accurately.
The proposed CDL framework effectively discovers concepts based on mutual information.
Quantitative and human evaluations confirm the quality of discovered concepts.
Abstract
Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
