Never Saddle for Reparameterized Steepest Descent as Mirror Flow

Tom Jacobs; Chao Zhou; and Rebekka Burkholz

arXiv:2603.02064·cs.LG·March 3, 2026

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

Tom Jacobs, Chao Zhou, and Rebekka Burkholz

PDF

Open Access 3 Reviews

TL;DR

This paper introduces steepest mirror flows to unify and analyze how optimization geometry influences learning dynamics, implicit bias, and sparsity, explaining why Adam variants often outperform SGD in fine-tuning.

Contribution

It provides a theoretical framework for steepest descent methods, revealing their advantages in saddle-point escape and feature learning, especially in the context of Adam and AdamW.

Findings

01

Steepest descent facilitates saddle-point escape and feature learning.

02

Gradient descent requires large learning rates to escape saddles, which is uncommon in fine-tuning.

03

Decoupled weight decay stabilizes feature learning by enforcing new balance equations.

Abstract

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The paper provides a seperation result for signGD with coupled and decoupled weight decay and show that they have different regularization properties which is interesting.

Weaknesses

a) The paper focuses on deep diagonal reparameterizations which is a product of one-dimensional variables with a particular initialization shape. The setting is restrictive to generalize the results of the paper.

Reviewer 02Rating 0Confidence 5

Strengths

The message that adam escapes saddles better than SGD is believable and may be cool. However, it is not well established. The rest of the claims are not substantiated.

Weaknesses

#### How is this about fine-tuning transformers? There is not even a linear diagonal attention there, no one ever claimed that a diagonal network is a good model for a transformer, because it is not. How do you argue this? #### Mirror flow study is incremental and does not adequately support the thesis. While the diagonal‑network analysis is neat, it is very similar to existent ones and does not bring any real novelty to the community. *I believe, it is extremely incremental.* It is way too l

Reviewer 03Rating 6Confidence 4

Strengths

This is an interesting study with insightful findings. Combining mirror descent with reparameterization is a great idea.

Weaknesses

The statement of the first contribution is misleading. This is not the first work studying GF and reparametrization. It is probably meant with respect to the family. For this type of work, assumptions are always problematic. They are very simplistic (diagonal networks) but it is very hard to potentially show more general results. In experiments, rather than studying the networks considered in the analysis, they should see if the results hold when the assumptions are not fulfilled (for exampl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Neural Networks and Reservoir Computing