Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think
Christian H.X. Ali Mehmeti-G\"opel, Jan Disselhoff

TL;DR
This paper empirically investigates how deep networks can be simplified by linearizing nonlinear units during training, revealing that much of the network's expressivity is unused but aids early training, with nonlinear units forming structured core-networks.
Contribution
It introduces a method to linearize network units during training, analyzes the impact on performance, and proposes a measure called average path length to characterize network depth after linearization.
Findings
Linearizing early in training causes significant performance drop.
Many nonlinear units can be linearized after training while maintaining high accuracy.
Remaining nonlinear units form structured core-networks depending on task difficulty.
Abstract
We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network's expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsNeural Tangent Kernel
