Optimization Insights into Deep Diagonal Linear Networks
Hippolyte Labarri\`ere, Cesare Molinari, Lorenzo Rosasco, Cristian Vega, Silvia Villa

TL;DR
This paper analyzes how gradient flow behaves in deep diagonal linear networks, revealing explicit convergence guarantees and the influence of parametrization and initialization on training efficiency.
Contribution
It provides a tractable analysis of deep diagonal linear networks, showing how their structure leads to well-behaved optimization dynamics and explicit convergence guarantees.
Findings
Gradient flow induces mirror-flow dynamics in effective parameters.
Exponential decay of loss under Polyak-Lojasiewicz condition.
Initialization and scaling influence training speed.
Abstract
Gradient-based methods successfully train highly overparameterized models in practice, even though the associated optimization problems are markedly nonconvex. Understanding the mechanisms that make such methods effective has become a central problem in modern optimization. To investigate this question in a tractable setting, we study Deep Diagonal Linear Networks. These are multilayer architectures with a reparameterization that preserves convexity in the effective parameter, while inducing a nontrivial geometry in the optimization landscape. Under mild initialization conditions, we show that gradient flow on the layer parameters induces a mirror-flow dynamic in the effective parameter space. This structural insight yields explicit convergence guarantees, including exponential decay of the loss under a Polyak-Lojasiewicz condition, and clarifies how the parametrization and…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper introduces a mild technical assumption $\mathcal{A}$ on the initialization which holds almost surely for a random initialization. Under this assumption, the gradient flow for the parameterization of deep diagonal network can be rewritten as mirror flow with a *convex* potential on the linear predictor $\theta$, which is an interesting technical observation. - Under the same assumption, the linear convergence for any loss function $L$ which satisfies $PL$ condition in $\theta$ is est
- The major weakness is the mirror potential is not explicitly defined - even if it not explicitly defined the limiting behavior in the case of large depth or small initialization are not discussed or analyzed which is a major drawback. - The implicit bias of optimization benefits/drawbacks of the depth is not discussed and this weakens the motivation for studying the deep diagonal linear networks.
The presentation in this paper is good, derivations are clear, and theoretical results are well explained.
This paper lacks novelty in the following ways: 1. Theorem 1, showing that GF on $L$-layer diagonal linear networks induces a mirror flow, as authors have acknowledged, is an application of Li et. al., 2022. So the contribution of this theorem is rather weak. 2. Theorem 2 is not new. Min et. al., 2023 (See their Section 4.2) have shown the exponential convergence of GF under the same condition as described in equation $(\mathcal{A})$ with a better lower bound on the rate. **References**: Z L
The study brings a relatively fresh perspective of implicit bias.
-
In general, this paper is well organized, e.g., the authors clearly demonstrate their motivation and contribution. They also clearly develop their notations, definitions, and theorems to support their claims. These efforts make the understanding of this paper fairly straightforward. In addition, the characterization of certain properties of the learning dynamics of deep diagonal linear networks might also be interesting, e.g., the second point of Proposition 1.
Unfortunately, both the technical and theoretical contributions of this paper are rather limited, which I will discuss as follows. 1. The ultimate goal of this paper is to reveal the implicit regularization effect of GF for deep diagonal linear networks. However, the explicit form of the corresponding entropy function for the induced mirror flow dynamics is completely absent. There is even no suggestion about possible properties that the entropy function should have. In addition, the der
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
