ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit
Louis-Pierre Chaintron, L\'ena\"ic Chizat, Javier Maass

TL;DR
This paper proves that residual neural networks (ResNets) with large depth, width, and embedding dimensions converge to a well-defined infinite limit, providing theoretical guarantees on training dynamics and error bounds.
Contribution
It establishes the convergence of ResNet training dynamics to a large-scale limit in the joint depth, width, and embedding dimension regime, with explicit error rates.
Findings
Error between ResNet and its limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D))
Convergence rate of O(P^(-1/6)) for parameter budget P=Theta(L M D)
Analysis applies to a broad class of architectures including Transformers
Abstract
We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
