TL;DR
This paper introduces DARK, a contrastive knowledge distillation method that improves extreme model compression for vision-language tasks by encouraging structured decorrelation and repulsion of non-target similarities.
Contribution
DARK decomposes distillation loss into diagonal and off-diagonal terms, transitioning from imitation to repulsion, enabling effective compression of large models into smaller, efficient ones.
Findings
Student matches or exceeds teacher on zero-shot benchmarks.
DARK induces structured decorrelation, reducing inter-class confusion.
Efficiently compresses a 427M-parameter model into a 75M-parameter model.
Abstract
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
