Investigating the Limitation of CLIP Models: The Worst-Performing Categories
Jie-Jing Shao, Jiang-Xin Shi, Xiao-Wen Yang, Lan-Zhe Guo, Yu-Feng Li

TL;DR
This paper investigates the poor performance of CLIP models in certain categories, introduces a new metric to identify these categories, and proposes a prompt ensemble method that significantly improves worst-category accuracy without manual tuning.
Contribution
It proposes the Class-wise Matching Margin (CMM) to measure inference confusion and uses large language models to enrich category descriptions, enabling automatic prompt ensemble to boost worst-category accuracy.
Findings
Worst categories can have accuracy as low as 0% despite high overall performance.
The proposed CMM effectively identifies the worst-performing categories.
Ensemble prompts based on enriched descriptions improve worst-category accuracy to 5.2%.
Abstract
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts, enabling zero-shot recognition on downstream tasks. It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts. However, we found that their performance in the worst categories is significantly inferior to the overall performance. For example, on ImageNet, there are a total of 10 categories with class-wise accuracy as low as 0\%, even though the overall performance has achieved 64.1\%. This phenomenon reveals the potential risks associated with using CLIP models, particularly in risk-sensitive applications where specific categories hold significant importance. To address this issue, we investigate the alignment between the two modalities in the CLIP model and propose the Class-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsContrastive Language-Image Pre-training
