Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Atli Kosson, Bettina Messmer, Martin Jaggi

TL;DR
This paper explores how weight decay induces a rotational equilibrium in neural network training, balancing updates across neurons and layers, and offers insights into optimizer behavior and normalization techniques.
Contribution
It introduces the concept of rotational equilibrium caused by weight decay, providing a new perspective on training dynamics and optimizer effectiveness in deep learning.
Findings
Weight decay leads to a steady state called rotational equilibrium.
Balanced rotation improves normalization and optimizer performance.
Controlling rotation reduces warmup needs and enhances training stability.
Abstract
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation -- a proxy for the effective learning rate -- across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Thermodynamics and Statistical Mechanics
MethodsWeight Standardization · Weight Decay · Adam · Evolved Sign Momentum · Stochastic Gradient Descent · AdamW
