CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang; Zhulin An; Libo Huang; Junyu Bi; Xinqiang Yu; Han; Yang; Boyu Diao; Yongjun Xu

arXiv:2307.12732·cs.CV·May 8, 2024·2 cites

CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han, Yang, Boyu Diao, Yongjun Xu

PDF

Open Access 1 Repo

TL;DR

This paper investigates various strategies for distilling smaller CLIP models from larger ones, demonstrating that simple feature mimicry and contrastive learning significantly enhance performance on image classification and retrieval tasks.

Contribution

It introduces multiple distillation strategies for CLIP models, highlighting the effectiveness of feature mimicry and contrastive learning in model compression.

Findings

01

Feature mimicry with MSE loss is highly effective.

02

Interactive contrastive learning improves student model performance.

03

Distilled models outperform original CLIP in zero-shot tasks.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

winycg/clip-kd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · Contrastive Language-Image Pre-training