Moonshine: Distilling with Cheap Convolutions
Elliot J. Crowley, Gavin Gray, Amos Storkey

TL;DR
This paper introduces a method for reducing neural network memory usage through structural distillation, creating a student model that is a simple transformation of the teacher, with minimal accuracy loss.
Contribution
It presents a novel structural distillation approach that simplifies the creation of memory-efficient models without redesigning architectures or tuning hyperparameters.
Findings
Significant memory savings with minimal accuracy loss.
Distilled models outperform directly trained students.
Pareto analysis of memory versus accuracy trade-offs.
Abstract
Many engineers wish to deploy modern neural networks in memory-limited settings; but the development of flexible methods for reducing memory use is in its infancy, and there is little knowledge of the resulting cost-benefit. We propose structural model distillation for memory reduction using a strategy that produces a student architecture that is a simple transformation of the teacher architecture: no redesign is needed, and the same hyperparameters can be used. Using attention transfer, we provide Pareto curves/tables for distillation of residual networks with four benchmark datasets, indicating the memory versus accuracy payoff. We show that substantial memory savings are possible with very little loss of accuracy, and confirm that distillation provides student network performance that is better than training that student architecture directly on data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection
