Never Saddle for Reparameterized Steepest Descent as Mirror Flow
Tom Jacobs, Chao Zhou, and Rebekka Burkholz

TL;DR
This paper introduces steepest mirror flows to unify and analyze how optimization geometry influences learning dynamics, implicit bias, and sparsity, explaining why Adam variants often outperform SGD in fine-tuning.
Contribution
It provides a theoretical framework for steepest descent methods, revealing their advantages in saddle-point escape and feature learning, especially in the context of Adam and AdamW.
Findings
Steepest descent facilitates saddle-point escape and feature learning.
Gradient descent requires large learning rates to escape saddles, which is uncommon in fine-tuning.
Decoupled weight decay stabilizes feature learning by enforcing new balance equations.
Abstract
How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning.…
Peer Reviews
Decision·ICLR 2026 Poster
The paper provides a seperation result for signGD with coupled and decoupled weight decay and show that they have different regularization properties which is interesting.
a) The paper focuses on deep diagonal reparameterizations which is a product of one-dimensional variables with a particular initialization shape. The setting is restrictive to generalize the results of the paper.
The message that adam escapes saddles better than SGD is believable and may be cool. However, it is not well established. The rest of the claims are not substantiated.
#### How is this about fine-tuning transformers? There is not even a linear diagonal attention there, no one ever claimed that a diagonal network is a good model for a transformer, because it is not. How do you argue this? #### Mirror flow study is incremental and does not adequately support the thesis. While the diagonal‑network analysis is neat, it is very similar to existent ones and does not bring any real novelty to the community. *I believe, it is extremely incremental.* It is way too l
This is an interesting study with insightful findings. Combining mirror descent with reparameterization is a great idea.
The statement of the first contribution is misleading. This is not the first work studying GF and reparametrization. It is probably meant with respect to the family. For this type of work, assumptions are always problematic. They are very simplistic (diagonal networks) but it is very hard to potentially show more general results. In experiments, rather than studying the networks considered in the analysis, they should see if the results hold when the assumptions are not fulfilled (for exampl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Neural Networks and Reservoir Computing
