Multi-scale Feature Learning Dynamics: Insights for Double Descent
Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie

TL;DR
This paper investigates the epoch-wise double descent phenomenon in deep learning, revealing that different features are learned at different times, which explains the non-monotonous test error behavior during training.
Contribution
It provides a theoretical analysis using statistical physics tools to explain epoch-wise double descent and validates findings with numerical experiments and deep neural network observations.
Findings
Double descent arises from features learned at different scales.
Slower-learning features cause the second descent in test error.
Theory accurately predicts empirical and neural network behaviors.
Abstract
A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting from the high-dimensional interactions between the large number of network parameters. Such non-trivial dynamics lead to intriguing behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Gaussian Processes and Bayesian Inference
