TL;DR
This paper introduces VRM, a relation-based knowledge distillation method that uses virtual relation matching with affinity graphs to provide richer guidance and regularization, achieving state-of-the-art results across multiple datasets.
Contribution
The paper revives relation-based KD by proposing virtual relation matching with affinity graphs and dynamic pruning, significantly improving performance over existing methods.
Findings
VRM achieves 74.0% accuracy for ResNet50-to-MobileNetV2 on ImageNet.
VRM improves DeiT-T by 14.44% on CIFAR-100.
VRM outperforms previous relation-based KD methods across datasets.
Abstract
Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as their instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relation-based methods, including their susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to richer guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The section on pilot study is interesting and insightful. I hope the authors add more large-scale experiments (e.g., ImageNet experiments) to further validate the claims in this section. 2. The proposed ideas are technically sound, although they are simple and straightforward. 3. The further analysis section (Section 4.4) is interesting to read.
1. My biggest concern regarding this paper is whether it delivers a meaningful contribution to the community in 2024 given its current experimental setup. Specifically: - The proposed methods have only been validated on weak baselines, whose training configurations (described in Appendix A.3) are from several yeas ago. In recent years, lots of advancements have been made in developing better training methodologies for training high-performing image classification models, including improved optim
1. The integration of virtual views to enrich relational information represents a fresh perspective on enhancing relation-based KD. 2. The extensive experiments across various datasets and architectures, along with ablation studies, strongly support the validity of VRM’s design choices. 3.The edge pruning mechanism effectively mitigates the impact of spurious relations, allowing VRM to generalize well to test sets. 4. VRM performs well on heterogeneous teacher-student pairs, indicating versatili
1. Previous GNN-based KD methods, such as [1][2], have shown strong results on various tasks. Without comparisons in the related work or experimental sections, it is difficult to fully assess VRM's effectiveness. 2. Currently, each prediction is assigned only one virtual view. Further analysis of how the number of virtual views affects KD performance and computational cost would provide a more complete evaluation of VRM’s design. 3. Although VRM includes a strategy for pruning unreliable edges,
1. The authors propose two reasons to explain the unsatisfactory performance of previous relation distillation methods, which then leads to the development of the view-augmented version to address the two drawbacks. The motivation is solid and convincing. 2. The current method greatly advances the performance of relation-matching-based distillation methods and achieves on par with or even surpasses some of the recent instance-matching methods. The potential of VRM on heterogeneous distillation
1. While this study has the potential to bring the relation-matching distillation back into attention, this method sacrifices the elegance of simplicity and instead introduces numerous new modules and hyperparameters, including the balancing loss weight of each module, the pruning criterion and the number of augmentation views. The additional hyperparameters would impair the simplicity and hinder its application in many other tasks. 2. Although the authors state that the current method is not a
This work proposes a new relation-based KD method, VRM. Its strengths are listed as below: (1) It combines important techniques together and provides a practical framework for relation-based KD. (2) Extensive experiments are conducted to verify the effectiveness of VRM. VRM shows the best performance under most configurations. (3) The paper is easy to follow and the idea is clearly stated.
The weaknesses of the work is listed as below: (1) There are many techniques and details in VRM framework. It makes VRM seem complex and hard to reproduce. For instance, it has many hyper-parameters to tune. (2) As the main contribution of the work is "Virtual Relation Matching", we supposed virtual relation to be an important component for final performance gain. However, according to the ablation study, other techniques including ZSNorm and L2Norm contribute significantly. It weakens the contr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
