Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade,, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR
This paper investigates the limitations of CLIP models in multi-object scenarios, revealing biases towards object size and order, and demonstrates how these biases affect performance in image-caption matching and generation tasks.
Contribution
It introduces controlled datasets to analyze CLIP's biases and extends the analysis to Stable Diffusion, providing new insights into training-induced biases in vision-language models.
Findings
CLIP's image encoder favors larger objects
Text encoder prioritizes first-mentioned objects
Biases impact image-caption matching and generation
Abstract
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Subtitles and Audiovisual Media
MethodsDiffusion · Contrastive Language-Image Pre-training
