Investigating the Limitation of CLIP Models: The Worst-Performing   Categories

Jie-Jing Shao; Jiang-Xin Shi; Xiao-Wen Yang; Lan-Zhe Guo; Yu-Feng Li

arXiv:2310.03324·cs.CV·October 6, 2023·2 cites

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li

PDF

Open Access

TL;DR

This paper investigates the poor performance of CLIP models in certain categories, introduces a new metric to identify these categories, and proposes a prompt ensemble method that significantly improves worst-category accuracy without manual tuning.

Contribution

It proposes the Class-wise Matching Margin (CMM) to measure inference confusion and uses large language models to enrich category descriptions, enabling automatic prompt ensemble to boost worst-category accuracy.

Findings

01

Worst categories can have accuracy as low as 0% despite high overall performance.

02

The proposed CMM effectively identifies the worst-performing categories.

03

Ensemble prompts based on enriched descriptions improve worst-category accuracy to 5.2%.

Abstract

Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training