Implicit regularization of deep residual networks towards neural ODEs
Pierre Marion, Yu-Han Wu, Michael E. Sander, G\'erard Biau

TL;DR
This paper demonstrates that deep residual networks implicitly regularize towards neural ODEs during training with gradient flow, providing a mathematical link between discrete networks and their continuous counterparts.
Contribution
It establishes that residual networks initialized as discretizations of neural ODEs remain close to these ODEs throughout training, with convergence results under certain conditions.
Findings
Residual networks initialized as neural ODE discretizations stay close during training.
Gradient flow converges to a global minimum under Polyak-Lojasiewicz condition.
Numerical experiments support the theoretical results.
Abstract
Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with…
Peer Reviews
Decision·ICLR 2024 spotlight
The derived results and the neural ODE limits seem rather interesting. The theoretical analyses seem rigorous and their discussions are quite thorough, though I haven't read the proofs in detail. It could be that I have missed a similar related work in the literature, but I do think this seems to be an important contribution, and thus could spark interesting future work on implicit regularization as well as furthering understanding of residual networks via the toolbox of ODEs. Besides, the paper
- One of the key drawbacks is requiring weight-tying or weight sharing. While results with weight-tied networks don't look bad, still this would have been nice to take care of. Can the authors elaborate on where all and where exactly does it interfere with the proof strategy? - The other issue is that the linear overparameterization $m > c_1 n$ would, in practice, be somewhat unreasonable and is thus a bit of a stretch. Maybe this is a norm in the literature, but do the authors have some ideas
The paper is very well written and easy to follow, where the results and their implications are clearly stated. The authors show weights of a resnet convert to an ODE while considering the dynamics of the network as well, which is new and interesting. Given that the result shows that the dynamics of the resnet architecture converges to that of an ODE, and therefore as the authors mention one can utilize existing generalization bounds for neural ODEs etc can under certain conditions now be app
The results are under the condition that the resnet is trained under clipped gradient flow, which is often not used in practice. However, I do understand that the main contribution of the paper is to characterize the behaviour of the model in the limit, so it shouldn’t affect the main contribution of the work.
1. Most prior works in this domain make simplifying assumptions which prevents their analysis from being directly applicable to practical models. The authors overcome those assumptions
N/A I am giving a score of 6, given my inability to appropriately verify all the proofs in the main text and appendices.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Advanced Numerical Analysis Techniques · Tribology and Lubrication Engineering
