On Lazy Training in Differentiable Programming
Lenaic Chizat (CNRS, UP11), Edouard Oyallon, Francis Bach (LIENS,, SIERRA)

TL;DR
This paper investigates the 'lazy training' phenomenon in neural networks, revealing it as a consequence of model scaling that makes training akin to linearized kernel methods, and shows its limitations in practical deep learning tasks.
Contribution
The work demonstrates that lazy training is not exclusive to over-parameterized networks and provides theoretical bounds and analysis for its occurrence in non-convex optimization.
Findings
Lazy training arises from model scaling, making neural networks behave like linear models.
Theoretical bounds are established for the difference between lazy and linearized training paths.
Lazy training degrades performance in practical deep convolutional neural networks.
Abstract
In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Sparse and Compressive Sensing Techniques
