The Early Phase of Neural Network Training
Jonathan Frankle, David J. Schwab, Ari S. Morcos

TL;DR
This paper investigates the early training phase of deep neural networks, revealing rapid, label-independent changes in weight distributions and the effects of pre-training methods, which enhances understanding of initial learning dynamics.
Contribution
It provides a detailed quantitative analysis of neural network changes during early training, highlighting non-robustness to reinitialization and the impact of pre-training strategies.
Findings
Deep networks are not robust to reinitialization with sign-preserving weights.
Weight distributions become highly non-independent within hundreds of iterations.
Pre-training with blurred inputs or self-supervised tasks approximates early training changes.
Abstract
Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning
