Color in Visual-Language Models: CLIP deficiencies

Guillem Arias; Ramon Baldrich; Maria Vanrell

arXiv:2502.04470·cs.CV·February 10, 2025

Color in Visual-Language Models: CLIP deficiencies

Guillem Arias, Ramon Baldrich, Maria Vanrell

PDF

TL;DR

This paper investigates how CLIP encodes color, revealing biases and deficiencies in its color representation, and analyzes neuron-level features to understand and improve its color understanding capabilities.

Contribution

It identifies specific biases and neuron-level mechanisms in CLIP's color encoding, proposing directions for refining multimodal color representations.

Findings

01

CLIP shows bias against achromatic stimuli in color labeling.

02

CLIP tends to prioritize text over visual information in color recognition.

03

Neuron analysis reveals a dominance of text-selective neurons and fewer multi-modal color neurons.

Abstract

This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence. After performing different experiments on synthetic datasets created for this task, we conclude that CLIP is able to attribute correct color labels to colored visual stimulus, but, we come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, thus white, gray and black are rarely assigned as color labels; and (b) the tendency to prioritize text over other visual information. Here we prove it is highly significant in color labelling through an exhaustive Stroop-effect test. With the aim to find the causes of these color deficiencies, we analyse the internal representation at the neuron level. We conclude that CLIP presents an important amount of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training