CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination
Kaicheng Yang, Tiancheng Gu, Xiang An, Haiqiang Jiang, Xiangzi Dai,, Ziyong Feng, Weidong Cai, Jiankang Deng

TL;DR
CLIP-CID introduces an efficient distillation method that reduces data bias and leverages cluster-instance discrimination to transfer knowledge from large CLIP models to smaller ones, achieving state-of-the-art results.
Contribution
The paper presents a novel distillation mechanism combining image semantic balancing and cluster-instance discrimination for vision-language models.
Findings
Reduces training data by 43.7% while maintaining performance.
Achieves state-of-the-art results on downstream tasks.
Enhances semantic understanding in smaller models.
Abstract
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. However, the effectiveness of CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. Although knowledge distillation has been widely applied in single modality models, how to efficiently expand knowledge distillation to vision-language foundation models with extensive data remains relatively unexplored. In this paper, we introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model. We initially propose a simple but efficient image semantic balance method to reduce transfer learning bias and improve distillation efficiency. This method filters out 43.7% of image-text pairs from the LAION400M while maintaining superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Control Systems Optimization · Chemical Synthesis and Reactions · Phytochemical Studies and Bioactivities
MethodsKnowledge Distillation · Contrastive Language-Image Pre-training
