Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers
Hao Chen, Jinghui Yuan, Hanmin Zhang

TL;DR
This paper introduces AdamO, a novel optimizer that decouples norm control from feature learning, reducing radial oscillations and improving generalization and stability in deep networks.
Contribution
It proposes Orthogonal Dynamics Decoupling, a new approach that separates magnitude and direction updates, enhancing optimizer performance over AdamW.
Findings
AdamO improves generalization on vision and language tasks.
AdamO enhances stability without added complexity.
Decoupling dynamics reduces radial oscillations.
Abstract
Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push--pull interaction induces radial oscillations, injecting noise into Adam's second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam's adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks
