When are Lemons Purple? The Concept Association Bias of Vision-Language Models
Yutaro Yamada, Yingtian Tang, Yoyo Zhang, Ilker Yildirim

TL;DR
This paper investigates the Concept Association Bias (CAB) in vision-language models like CLIP, revealing how it affects zero-shot classification and VQA performance, and showing that training methods influence the presence of CAB.
Contribution
It identifies and characterizes the Concept Association Bias in vision-language models and analyzes how different training losses impact this bias.
Findings
CAB causes models to treat concepts as interchangeable, affecting predictions.
Strong concept associations reduce zero-shot classification accuracy.
Autoregressive training reduces or eliminates CAB.
Abstract
Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such performance does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout
