Should Under-parameterized Student Networks Copy or Average Teacher Weights?
Berfin \c{S}im\c{s}ek, Amire Bendjeddou, Wulfram Gerstner, Johanni, Brea

TL;DR
This paper investigates how under-parameterized neural networks should best approximate a teacher network, revealing that copying or averaging teacher neurons leads to critical points and optimal configurations, with universal structures observed across activation functions.
Contribution
The paper provides theoretical analysis of copy versus average strategies for under-parameterized neural networks and characterizes optimal configurations for shallow networks with erf and ReLU activations.
Findings
Copy-average configurations are critical points under certain conditions.
Optimal configurations involve copying most teacher neurons and averaging the rest.
Gradient flow converges to these critical points in empirical experiments.
Abstract
Any continuous function can be approximated arbitrarily well by a neural network with sufficiently many neurons . We consider the case when itself is a neural network with one hidden layer and neurons. Approximating with a neural network with neurons can thus be seen as fitting an under-parameterized "student" network with neurons to a "teacher" network with neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Advanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques
