Model compression via distillation and quantization
Antonio Polino, Razvan Pascanu, Dan Alistarh

TL;DR
This paper introduces two novel methods for compressing deep neural networks by combining quantization and distillation, enabling efficient deployment on resource-limited devices without significant accuracy loss.
Contribution
The paper proposes quantized distillation and differentiable quantization, new techniques that jointly optimize weight quantization and knowledge transfer from larger models.
Findings
Shallow quantized students achieve similar accuracy to full models
Order of magnitude compression with linear speedup
Effective deployment in resource-constrained environments
Abstract
Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Adversarial Robustness in Machine Learning
