When are Lemons Purple? The Concept Association Bias of Vision-Language   Models

Yutaro Yamada; Yingtian Tang; Yoyo Zhang; Ilker Yildirim

arXiv:2212.12043·cs.CV·April 16, 2024·6 cites

When are Lemons Purple? The Concept Association Bias of Vision-Language Models

Yutaro Yamada, Yingtian Tang, Yoyo Zhang, Ilker Yildirim

PDF

Open Access

TL;DR

This paper investigates the Concept Association Bias (CAB) in vision-language models like CLIP, revealing how it affects zero-shot classification and VQA performance, and showing that training methods influence the presence of CAB.

Contribution

It identifies and characterizes the Concept Association Bias in vision-language models and analyzes how different training losses impact this bias.

Findings

01

CAB causes models to treat concepts as interchangeable, affecting predictions.

02

Strong concept associations reduce zero-shot classification accuracy.

03

Autoregressive training reduces or eliminates CAB.

Abstract

Large-scale vision-language models such as CLIP have shown impressive performance on zero-shot image classification and image-to-text retrieval. However, such performance does not realize in tasks that require a finer-grained correspondence between vision and language, such as Visual Question Answering (VQA). As a potential cause of the difficulty of applying these models to VQA and similar tasks, we report an interesting phenomenon of vision-language models, which we call the Concept Association Bias (CAB). We find that models with CAB tend to treat input as a bag of concepts and attempt to fill in the other missing concept crossmodally, leading to an unexpected zero-shot prediction. We demonstrate CAB by showing that CLIP's zero-shot classification performance greatly suffers when there is a strong concept association between an object (e.g. eggplant) and an attribute (e.g. color…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout