Implicit Regularization of Large Neural Networks via Mean-Field Formulation
Beatrice Acciaio, Jakob Heiss, Gudmund Pammer, Qinxin Yan

TL;DR
This paper introduces a mathematical framework using mean-field theory and stochastic control to explain how early stopping acts as an implicit regularizer in training overparametrized neural networks, linking dynamics to a new probability measure metric.
Contribution
It develops a mean-field and control-based formulation of neural network training dynamics, revealing how early stopping induces implicit regularization through a novel metric on probability measures.
Findings
The dynamics follow a gradient flow on probability measures.
A new metric generalizing Wasserstein-2 distance is introduced.
Non-asymptotic bounds relate regularization to stopping time.
Abstract
We propose a mathematical framework to explain implicit regularization from early stopping during the training of overparametrized neural networks. In the mean-field limit, the parameter distribution evolves according to a gradient flow on the space of probability measures. We show that these dynamics admit an equivalent McKean-Vlasov stochastic control formulation through the corresponding Hamilton-Jacobi-Bellman (HJB) equation. The control viewpoint yields a Dynamic Programming Principle (DPP), which we use to define a new metric on probability measures. This metric can be viewed as a mean-field generalization of the control representation of the Wasserstein-2 distance, and it naturally appears as a regularization term selected by early stopping. We further obtain non-asymptotic bounds describing how the induced regularization depends on the stopping time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · Adversarial Robustness in Machine Learning
