Experiments with Rich Regime Training for Deep Learning
Xinyan Li, Arindam Banerjee

TL;DR
This paper empirically investigates the rich regime in deep learning, revealing the importance of active parameters in the bottom layers and proposing efficient layer-wise sparse training methods that maintain performance while reducing computation.
Contribution
It introduces static and probabilistic layer-wise sparse SGD methods, demonstrating their effectiveness and efficiency in training deep neural networks in the rich regime.
Findings
Active parameters are concentrated in bottom layers.
Re-initializing active parameters worsens generalization.
Probabilistic LWS-SGD matches vanilla SGD performance.
Abstract
In spite of advances in understanding lazy training, recent work attributes the practical success of deep learning to the rich regime with complex inductive bias. In this paper, we study rich regime training empirically with benchmark datasets, and find that while most parameters are lazy, there is always a small number of active parameters which change quite a bit during training. We show that re-initializing (resetting to their initial random values) the active parameters leads to worse generalization. Further, we show that most of the active parameters are in the bottom layers, close to the input, especially as the networks become wider. Based on such observations, we study static Layer-Wise Sparse (LWS) SGD, which only updates some subsets of layers. We find that only updating the top and bottom layers have good generalization and, as expected, only updating the top layers yields a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsStochastic Gradient Descent
