Graph-based Knowledge Distillation by Multi-head Attention Network
Seunghyun Lee, Byung Cheol Song

TL;DR
This paper introduces a novel graph-based knowledge distillation method using multi-head attention to transfer dataset-level knowledge from a teacher to a student network, improving performance significantly.
Contribution
It proposes a new KD approach that distills dataset-based knowledge via multi-head attention, capturing intra-data relations for better student performance.
Findings
Achieved 7.05% higher accuracy on CIFAR100 compared to student alone.
Outperformed state-of-the-art methods by 2.46%.
Demonstrated effectiveness of dataset-level knowledge distillation.
Abstract
Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well, it is very important to acquire knowledge that considers intra-data relations. Conventional KD methods have concentrated on distilling knowledge in data units. To our knowledge, any KD methods for distilling information in dataset units have not yet been proposed. Therefore, this paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network. The knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
