Recurrent knowledge distillation
Silvia L. Pintea, Yue Liu, Jan C. van Gemert

TL;DR
This paper introduces a recurrent knowledge distillation method that compresses deep networks by replacing multiple residual layers with a single recurrent layer, maintaining accuracy while reducing parameters.
Contribution
It proposes three variants of recurrent connections in the student network, enabling significant parameter reduction with minimal accuracy loss.
Findings
Reduced parameter count on CIFAR-10, Scenes, MiniPlaces datasets
Maintained accuracy with fewer parameters
Demonstrated effectiveness of recurrent layers in knowledge distillation
Abstract
Knowledge distillation compacts deep networks by letting a small student network learn from a large teacher network. The accuracy of knowledge distillation recently benefited from adding residual layers. We propose to reduce the size of the student network even further by recasting multiple residual layers in the teacher network into a single recurrent student layer. We propose three variants of adding recurrent connections into the student network, and show experimentally on CIFAR-10, Scenes and MiniPlaces, that we can reduce the number of parameters at little loss in accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
