CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation
Jeannie Chung, Hanna Jang, Ingyeong Yang, Uiwon Hwang, Jaehyeong Sim

TL;DR
This paper introduces CLIP-RD, a relational knowledge distillation framework that enhances lightweight CLIP models by explicitly modeling multi-directional relational dependencies, leading to improved structural alignment and performance.
Contribution
It proposes two novel methods, VRD and XRD, to better preserve the teacher's relational embedding structure during distillation.
Findings
CLIP-RD outperforms existing distillation methods by 0.8%p.
Joint modeling of relational structures improves student embedding fidelity.
The framework enhances zero-shot generalization of lightweight CLIP models.
Abstract
CLIP aligns image and text embeddings via contrastive learning and demonstrates strong zero-shot generalization. Its large-scale architecture requires substantial computational and memory resources, motivating the distillation of its capabilities into lightweight student models. However, existing CLIP distillation methods do not explicitly model multi-directional relational dependencies between teacher and student embeddings, limiting the student's ability to preserve the structural relationships encoded by the teacher. To address this, we propose a relational knowledge distillation framework that introduces two novel methods, Vertical Relational Distillation (VRD) and Cross Relational Distillation (XRD). VRD enforces consistency of teacher-student distillation strength across modalities at the distribution level, while XRD imposes bidirectional symmetry on cross-modal teacher-student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
