Geometrical structures of digital fluctuations in parameter space of neural networks trained with adaptive momentum optimization
Igor V. Netay

TL;DR
This paper investigates the geometric behavior of neural network parameters during training with adaptive momentum optimization, revealing how numerical artifacts lead to instability and divergence in both large and small models.
Contribution
It provides a detailed analysis of the geometric structures and numerical artifacts causing instability in adaptive momentum optimization during neural network training.
Findings
Numerical artifacts cause divergence in long-term training.
Parameter trajectories form double twisted spirals in parameter space.
Instability occurs in both large-scale and shallow networks.
Abstract
We present results of numerical experiments for neural networks with stochastic gradient-based optimization with adaptive momentum. This widely applied optimization has proved convergence and practical efficiency, but for long-run training becomes numerically unstable. We show that numerical artifacts are observable not only for large-scale models and finally lead to divergence also for case of shallow narrow networks. We argue this theory by experiments with more than 1600 neural networks trained for 50000 epochs. Local observations show presence of the same behavior of network parameters in both stable and unstable training segments. Geometrical behavior of parameters forms double twisted spirals in the parameter space and is caused by alternating of numerical perturbations with next relaxation oscillations in values for 1st and 2nd momentum.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
