Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students
Chenglin Yang, Lingxi Xie, Siyuan Qiao, Alan Yuille

TL;DR
This paper proposes that training more tolerant, less strict teacher networks with softer supervision signals improves the performance of student networks in generational deep learning, leading to higher accuracy.
Contribution
Introducing a simple method to make teacher networks more tolerant by adding an extra loss term, which enhances student learning and overall accuracy.
Findings
Tolerant teachers produce better students in generational training.
Students outperform competitors despite less powerful teachers.
Method improves accuracy on CIFAR100 and ILSVRC2012.
Abstract
We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics · Teaching and Learning Programming · Educational Leadership and Innovation
