ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

Louis-Pierre Chaintron; L\'ena\"ic Chizat; Javier Maass

arXiv:2603.18168·stat.ML·March 23, 2026

ResNets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-scale Limit

Louis-Pierre Chaintron, L\'ena\"ic Chizat, Javier Maass

PDF

Open Access

TL;DR

This paper proves that residual neural networks (ResNets) with large depth, width, and embedding dimensions converge to a well-defined infinite limit, providing theoretical guarantees on training dynamics and error bounds.

Contribution

It establishes the convergence of ResNet training dynamics to a large-scale limit in the joint depth, width, and embedding dimension regime, with explicit error rates.

Findings

01

Error between ResNet and its limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D))

02

Convergence rate of O(P^(-1/6)) for parameter budget P=Theta(L M D)

03

Analysis applies to a broad class of architectures including Transformers

Abstract

We establish convergence of the training dynamics of residual neural networks (ResNets) to their joint infinite depth L, hidden width M, and embedding dimension D limit. Specifically, we consider ResNets with two-layer perceptron blocks in the maximal local feature update (MLU) regime and prove that, after a bounded number of training steps, the error between the ResNet and its large-scale limit is O(1/L + sqrt(D/(L M)) + 1/sqrt(D)). This error rate is empirically tight when measured in embedding space. For a budget of P = Theta(L M D) parameters, this yields a convergence rate O(P^(-1/6)) for the scalings of (L, M, D) that minimize the bound. Our analysis exploits in an essential way the depth-two structure of residual blocks and applies formally to a broad class of state-of-the-art architectures, including Transformers with bounded key-query dimension. From a technical viewpoint, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis