Wasserstein Contrastive Representation Distillation
Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, Lawrence, Carin

TL;DR
WCoRD introduces a novel knowledge distillation method using Wasserstein distance to improve feature transfer and generalization, outperforming existing approaches in various distillation tasks.
Contribution
The paper proposes Wasserstein Contrastive Representation Distillation (WCoRD), a new KD technique leveraging Wasserstein distance for enhanced global and local feature transfer.
Findings
WCoRD outperforms state-of-the-art methods in privileged information distillation.
WCoRD achieves better model compression and cross-modal transfer results.
The method effectively captures structural knowledge and improves feature generalization.
Abstract
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former. Existing work, e.g., using Kullback-Leibler divergence for distillation, may fail to capture important structural knowledge in the teacher network and often lacks the ability for feature generalization, particularly in situations when teacher and student are built to address different classification tasks. We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for KD. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. The primal form is used for local contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning · Knowledge Distillation
