Graph-based Knowledge Distillation by Multi-head Attention Network

Seunghyun Lee; Byung Cheol Song

arXiv:1907.02226·cs.LG·July 10, 2019·39 cites

Graph-based Knowledge Distillation by Multi-head Attention Network

Seunghyun Lee, Byung Cheol Song

PDF

Open Access 2 Repos

TL;DR

This paper introduces a novel graph-based knowledge distillation method using multi-head attention to transfer dataset-level knowledge from a teacher to a student network, improving performance significantly.

Contribution

It proposes a new KD approach that distills dataset-based knowledge via multi-head attention, capturing intra-data relations for better student performance.

Findings

01

Achieved 7.05% higher accuracy on CIFAR100 compared to student alone.

02

Outperformed state-of-the-art methods by 2.46%.

03

Demonstrated effectiveness of dataset-level knowledge distillation.

Abstract

Knowledge distillation (KD) is a technique to derive optimal performance from a small student network (SN) by distilling knowledge of a large teacher network (TN) and transferring the distilled knowledge to the small SN. Since a role of convolutional neural network (CNN) in KD is to embed a dataset so as to perform a given task well, it is very important to acquire knowledge that considers intra-data relations. Conventional KD methods have concentrated on distilling knowledge in data units. To our knowledge, any KD methods for distilling information in dataset units have not yet been proposed. Therefore, this paper proposes a novel method that enables distillation of dataset-based knowledge from the TN using an attention network. The knowledge of the embedding procedure of the TN is distilled to graph by multi-head attention (MHA), and multi-task learning is performed to give relational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention