Natural Statistics of Network Activations and Implications for Knowledge Distillation
Michael Rotman, Lior Wolf

TL;DR
This paper investigates the natural statistical properties of neural network activations, revealing power law behaviors, and introduces a spectral-based knowledge distillation method that achieves state-of-the-art results.
Contribution
It uncovers the power law nature of activation statistics and proposes a novel spectral loss for improved knowledge distillation.
Findings
Activation statistics follow a power law with depth.
Spectral loss improves knowledge distillation performance.
Achieves state-of-the-art results on multiple benchmarks.
Abstract
In a matter that is analog to the study of natural image statistics, we study the natural statistics of the deep neural network activations at various layers. As we show, these statistics, similar to image statistics, follow a power law. We also show, both analytically and empirically, that with depth the exponent of this power law increases at a linear rate. As a direct implication of our discoveries, we present a method for performing Knowledge Distillation (KD). While classical KD methods consider the logits of the teacher network, more recent methods obtain a leap in performance by considering the activation maps. This, however, uses metrics that are suitable for comparing images. We propose to employ two additional loss terms that are based on the spectral properties of the intermediate activation maps. The proposed method obtains state of the art results on multiple image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
