Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks
Nikolaos Tsilivis, Eitan Gronich, Julia Kempe, Gal Vardi

TL;DR
This paper investigates how steepest descent algorithms implicitly bias the training of deep homogeneous neural networks, revealing geometric margin dynamics and connections to adaptive methods like Adam.
Contribution
It characterizes the implicit bias of steepest descent with infinitesimal learning rate, linking training trajectories to margin maximization and KKT points.
Findings
Geometric margin increases after perfect training accuracy.
Limit points of training are KKT points of margin maximization.
Connections between steepest descent and adaptive methods like Adam.
Abstract
We study the implicit bias of the general family of steepest descent algorithms with infinitesimal learning rate in deep homogeneous neural networks. We show that: (a) an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy, and (b) any limit point of the training trajectory corresponds to a KKT point of the corresponding margin-maximization problem. We experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of popular adaptive methods (Adam and Shampoo).
Peer Reviews
Decision·ICLR 2025 Poster
**Strong Points:** 1. The paper provides rigorous theoretical analysis of the implicit bias of steepest descent algorithms. The results and proofs extend the l2 norm case (gradient descent) by Lyu&Li to general norms. The theoretical results are clean and some of the extension are nontrivial. 2. The class of optimization algorithms studied in this paper is very general and include several important optimization algorithms as special cases. 3. The paper connects theoretical insights to practi
**Weak Points:** The authors briefly discuss some prior work on implicit bias of optimization algorithms other than GD in the related work section. But I found the inclusion of related work and the discussion here is somewhat insufficient. For example, Gunasckar et al. 18 already studied the implicit bias of several class of optimization algorithms. However, the citation here is very superficial and it is unclear to me how the results in this paper supercede (or compared with) their results. It
1. Understanding the implicit bias of optimization methods is an important problem in deep learning theory, and the paper takes a margin-based view of implicit bias and sheds light on how different methods may optimize different notions of margin for deep models. 2. It is a bit surprising that such an implicit bias of steepest descent can be rigorously proved in the general case. Previously, it is known that this can be proved for linear models, and for deep homogeneous networks, Lyu & Li (2020)
1. While I really appreciate that the authors worked out every technical detail to generalize the proof of Lyu & Li (2020), I also noted that the overall proof outline has not changed much from Lyu & Li (2020), which means this paper does not actually bring a brand new high-level proof idea. This is reasonable since any implicit bias analysis of steepest flow automatically implies an analysis of gradient flow, but the fact that the paper fails to prove the real KKT conditions and only manage to
1. The convergence of the general steepest descent to the general maximum margin solution is novel. 2. The connection between the implicit bias of Signed steepest descent with Adam is interesting.
The set of results and proofs seem to be a straightforward generalization of [Lyu and Li, 2019]. The technical contribution is not strong.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAdam
