CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object   Representation

Reza Abbasi; Ali Nazari; Aminreza Sefid; Mohammadali Banayeeanzade,; Mohammad Hossein Rohban; Mahdieh Soleymani Baghshah

arXiv:2502.19842·cs.CV·March 4, 2025

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade,, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper critically examines CLIP's multi-object representations, revealing biases towards object size and order, and analyzes their origins and impact on zero-shot classification and image generation.

Contribution

It provides a detailed analysis of CLIP's limitations in multi-object scenarios, introduces the ComCO dataset, and traces biases to training data and process.

Findings

01

Text encoder favors first-mentioned objects.

02

Image encoder prefers larger objects.

03

Performance drops with caption rephrasing.

Abstract

Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clip-oscope/clip-oscope
pytorchOfficial

Datasets

clip-oscope/simco-comco
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsDiffusion · Contrastive Language-Image Pre-training