Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence
Aditya Golatkar, Alessandro Achille, Stefano Soatto

TL;DR
This paper reveals that the timing of regularization in training deep neural networks critically influences their generalization, with a transient 'critical period' during which regularization has the most impact on final performance.
Contribution
It demonstrates that regularization effects are primarily determined by when during training they are applied, highlighting a transient 'critical period' that shapes generalization in deep networks.
Findings
Regularization after initial transient has little effect on generalization.
Interrupting regularization can sometimes improve generalization.
The timing of regularization is more important than its presence at convergence.
Abstract
Regularization is typically understood as improving generalization by altering the landscape of local extrema to which the model eventually converges. Deep neural networks (DNNs), however, challenge this view: We show that removing regularization after an initial transient period has little effect on generalization, even if the final loss landscape is the same as if there had been no regularization. In some cases, generalization even improves after interrupting regularization. Conversely, if regularization is applied only after the initial transient, it has no effect on the final solution, whose generalization gap is as bad as if regularization never happened. This suggests that what matters for training deep networks is not just whether or how, but when to regularize. The phenomena we observe are manifest in different datasets (CIFAR-10, CIFAR-100), different architectures (ResNet-18,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Gaussian Processes and Bayesian Inference
