Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A   Controlled High-Resolution Study

Reza Abbasi; Ali Nazari; Aminreza Sefid; Mohammadali Banayeeanzade,; Mohammad Hossein Rohban; Mahdieh Soleymani Baghshah

arXiv:2502.19828·cs.CV·February 28, 2025

Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study

Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade,, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

PDF

Open Access

TL;DR

This paper investigates the limitations of CLIP models in multi-object scenarios, revealing biases towards object size and order, and demonstrates how these biases affect performance in image-caption matching and generation tasks.

Contribution

It introduces controlled datasets to analyze CLIP's biases and extends the analysis to Stable Diffusion, providing new insights into training-induced biases in vision-language models.

Findings

01

CLIP's image encoder favors larger objects

02

Text encoder prioritizes first-mentioned objects

03

Biases impact image-caption matching and generation

Abstract

Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Subtitles and Audiovisual Media

MethodsDiffusion · Contrastive Language-Image Pre-training