Should Under-parameterized Student Networks Copy or Average Teacher   Weights?

Berfin \c{S}im\c{s}ek; Amire Bendjeddou; Wulfram Gerstner; Johanni; Brea

arXiv:2311.01644·cs.LG·January 17, 2024·2 cites

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Berfin \c{S}im\c{s}ek, Amire Bendjeddou, Wulfram Gerstner, Johanni, Brea

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how under-parameterized neural networks should best approximate a teacher network, revealing that copying or averaging teacher neurons leads to critical points and optimal configurations, with universal structures observed across activation functions.

Contribution

The paper provides theoretical analysis of copy versus average strategies for under-parameterized neural networks and characterizes optimal configurations for shallow networks with erf and ReLU activations.

Findings

01

Copy-average configurations are critical points under certain conditions.

02

Optimal configurations involve copying most teacher neurons and averaging the rest.

03

Gradient flow converges to these critical points in empirical experiments.

Abstract

Any continuous function $f^{*}$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$ . We consider the case when $f^{*}$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^{*}$ with a neural network with $n < k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with $k$ neurons. As the student has fewer neurons than the teacher, it is unclear, whether each of the $n$ student neurons should copy one of the teacher neurons or rather average a group of teacher neurons. For shallow neural networks with erf activation function and for the standard Gaussian input distribution, we prove that "copy-average" configurations are critical points if the teacher's incoming vectors are orthonormal and its outgoing weights are unitary. Moreover, the optimum among such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

berfinsimsek/neural-net-regression
noneOfficial

Videos

Should Under-parameterized Student Networks Copy or Average Teacher Weights?· slideslive

Taxonomy

TopicsNeural Networks and Applications · Advanced Memory and Neural Computing · Stochastic Gradient Optimization Techniques