Which Minimizer Does My Neural Network Converge To?
Manuel Nonnenmacher, David Reeb, Ingo Steinwart

TL;DR
This paper investigates how different training procedures and hyperparameters influence the specific minima a neural network converges to, highlighting the effects of initialization size, adaptive optimizers, and overparameterization.
Contribution
It clarifies how initialization, adaptive optimization, and overparameterization affect the converged minimizer, and proposes strategies to mitigate negative impacts.
Findings
Initialization size impacts test performance.
Adaptive optimizers like AdaGrad lead to different minima than GD.
Overparameterization introduces unique sources of error.
Abstract
The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdaGrad
