Comparative Knowledge Distillation
Alex Wilf, Alex Tianyi Xu, Paul Pu Liang, Alexander Obolenskiy, Daniel, Fried, Louis-Philippe Morency

TL;DR
This paper introduces Comparative Knowledge Distillation (CKD), a novel method that reduces reliance on teacher model inferences in knowledge distillation, improving efficiency and performance especially when teacher calls are limited.
Contribution
The paper proposes CKD, a new approach inspired by educational comparison principles, which enhances student learning without frequent teacher inferences and extends to group comparisons.
Findings
CKD outperforms existing KD methods in limited teacher inference settings.
CKD effectively utilizes comparison principles to improve student model performance.
Extending CKD to groups of samples further enhances learning efficiency.
Abstract
In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The method proposed is simple, efficient and is shown to perform better than the baselines on the few-shot scenarios by a good margin (Table 1/2). 2. The authors have done a nice analysis experiment to investigate the learnt representations by the student, showing that the correlations between the student's outputs is very similar to that of the teacher's outputs, even though this was never explicitly part of the objective. (Table 4)
1. The authors have motivated the Few-Teacher-Inference (FTI) setting and have proposed a method for that setting. However, there is no explanation for why the proposed distillation objective is better suited for it compared to the other baselines. This is especially relevant because there have been works which have tried to distill the *relationship* between the samples; beyond RKD, which the paper has cited and compared, there are also [1, 2, 3] (I am not asking the authors to compare to these
1. This paper is well-organized and clearly written. 2. Comparison with different methods are reported.
1. Some statements are over-claimed in this paper. For example, it claims "a novel learning paradigm ... mimic the teacher’s difference in representation between the same samples." Actually, it is just the relation-based knowledge distillation which distills the relations between samples from the teacher. It looks me that this learning paradigm is not novel. 2. Experiments are conducted in special settings where the baselines are pretty low, e.g., CRD has the performance of 29.37 on CIFAR with
The empirical results reported by CKD are impressive in the few-teacher inference setting considering how straightforward the loss formulation is. I found that the manuscript is well-written and does a good job of explaining the CKD method. This paper provides further evidence that KD approaches that mimic teacher outputs are not an effective distillation method, and supports relation-based/contrastive KD efforts for distillation. I similarly find this idea of few-teacher-inference knowledge di
My biggest concern is that this work is eerily similar to the relational knowledge distillation work (Park et al., 2019) barring the high-dimensional KD loss in CKD. I can appreciate that having a higher-dimensional loss formulation might be effective in tightly capturing relationships between two samples as compared to a single number as is the case in [1], however, I would have expected there to be further analysis on why this single differentiating factor elicits such a big change in empirica
* Improving knowledge distillation in the setting of a constrained number of teacher calls is well-motivated. * The idea to obtain more information from a limited number of points through k-wise comparisons is interesting.
* The paper lacks clarity in key parts of the presentation. For example, the method itself is not clear: for $k=2$ in the stochastic optimization setting, are the pairwise comparisons done on a per-batch setting? * The method is not well-motivated or explained in context of prior works in Relational Knowledge Distillation. The loss here is KL divergence between $\mathrm{softmax}(\hat z_i - \hat z_j)$ and $\mathrm{softmax}(z_i - z_j)$, and the subtraction is motivated in Sec. 5.3. But, why does
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Online Learning and Analytics · Neural Networks and Applications
MethodsKnowledge Distillation
