ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee

TL;DR
This paper introduces ColorFoil, a benchmark to evaluate large vision and language models' ability to perceive colors, revealing significant gaps in their robustness and color discrimination capabilities in zero-shot settings.
Contribution
The paper presents a new benchmark, ColorFoil, for assessing color perception in V&L models and evaluates seven models, highlighting their strengths and weaknesses in color recognition.
Findings
ViLT and BridgeTower outperform others in color perception.
CLIP-based models and GroupViT struggle with distinct color differentiation.
Models show limited robustness in complex linguistic and visual attribute understanding.
Abstract
With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language
MethodsAttention Is All You Need · Dense Connections · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding · Adam · Dropout
