Color in Visual-Language Models: CLIP deficiencies
Guillem Arias, Ramon Baldrich, Maria Vanrell

TL;DR
This paper investigates how CLIP encodes color, revealing biases and deficiencies in its color representation, and analyzes neuron-level features to understand and improve its color understanding capabilities.
Contribution
It identifies specific biases and neuron-level mechanisms in CLIP's color encoding, proposing directions for refining multimodal color representations.
Findings
CLIP shows bias against achromatic stimuli in color labeling.
CLIP tends to prioritize text over visual information in color recognition.
Neuron analysis reveals a dominance of text-selective neurons and fewer multi-modal color neurons.
Abstract
This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
