Intrinsic training dynamics of deep neural networks
Sibylle Marcotte, Gabriel Peyr\'e, R\'emi Gribonval

TL;DR
This paper investigates the intrinsic training dynamics of deep neural networks, establishing conditions under which gradient flows can be represented in lower-dimensional spaces, and extends these insights to various network architectures and initializations.
Contribution
It introduces a criterion for intrinsic dynamic properties in neural networks, generalizes balanced initializations, and characterizes intrinsic dynamics for deep linear networks and neural ODEs.
Findings
Intrinsic dynamic property relates to conservation laws in factorization.
ReLU networks can be rewritten in lower-dimensional intrinsic dynamics for many initializations.
Relaxed balanced initializations are necessary and sufficient for certain intrinsic dynamics in linear networks.
Abstract
A fundamental challenge in the theory of deep learning is to understand whether gradient-based training can promote parameters belonging to certain lower-dimensional structures (e.g., sparse or low-rank sets), leading to so-called implicit bias. As a stepping stone, motivated by the proof structure of existing implicit bias analyses, we study when a gradient flow on a parameter implies an intrinsic gradient flow on a ``lifted'' variable , for an architecture-related function . We express a so-called intrinsic dynamic property and show how it is related to the study of conservation laws associated with the factorization . This leads to a simple criterion based on the inclusion of kernels of linear maps, which yields a necessary condition for this property to hold. We then apply our theory to general ReLU networks of arbitrary depth and show that,…
Peer Reviews
Decision·ICLR 2026 Poster
This paper makes a solid contribution to the line of work studying the implicit bias of NNs. The reasons are multifold: * The proposed framework is novel and provides a unified approach for establishing that a reparametrization of the weights admits intrinsic dynamics, which is essential to implicit bias analyses. This result is very useful as, to my knowledge, existing literature relies on ad-hoc approaches that come with a difficult-to-track fine-grained variation in assumptions. While the di
A balancing discussion on the limitations of this framework / current results is mostly missing. For example, * What is the main difficulty in extending Theorem 4.3 to arbitrary $r$? * The ReLU architecture and $\phi_{\mathrm{ReLU}}$ pair seems to allow for stronger properties leading to intrinsic dynamics, i.e., intrinsic recoverability, compared to linear neural networks with relaxed balanced $\theta_0$ and $\phi_{\mathrm{Lin}}$. Is it coincidental, or does it point to a deeper-rooted limit
The paper presents a number of ideas that englobe many previous result in the same framework, and seems to also show that one cannot do better than these previous results (there are no more invariants than those already known). Some of the intrinsic dynamics provided are to my knowledge new, though they are all very close to already known works.
This paper is a bit of a typical example of "proof by abstract non-sense", it mainly proves mainly already existing results using very abstract (and arguably quite complex) tools. I appreciate that this type of approach can tell us that there are not more invariants and therefore simplifications than the ones we already knew, but I am not convinced that I need all these tools to find the next invariants on a new model, because it seems that people have been able to identify these invariants naiv
This paper has demonstrated a high level of mathematical rigor. The authors clearly define the involved properties formally, and use a series of lemmas and theorems to formally establish their connections. I find the results provided in Theorem 3.3 technically sound, which provides a viable way to investigate the possibility of studying the low-dimensional intrinsic Riemannian flow. The scope of the framework is somewhat general. In addition, the application to deep linear networks (the relaxed
1. However, in my view, this paper does not provide fundamentally new insight. The authors indeed broaden the scope of the connection between conservation laws and intrinsic metric for low-dimensional flow, however, this idea has been broadly discussed by prior works (e.g., Bah et al., (2022); Marcotte et al. (2023)) and the authors do not simplify the description. Hence, there lacks an adequate motivation for what this new framework can provide. As a result, I think the scope of contribution of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Advanced Graph Neural Networks
