Gradient Starvation: A Learning Proclivity in Neural Networks
Mohammad Pezeshki, S\'ekou-Oumar Kaba, Yoshua Bengio, Aaron Courville,, Doina Precup, Guillaume Lajoie

TL;DR
This paper uncovers a fundamental phenomenon called Gradient Starvation in neural networks, explaining how gradient descent can lead to incomplete feature learning and proposing a regularization method to mitigate this issue.
Contribution
It provides a theoretical framework for understanding Gradient Starvation and introduces a novel regularization technique to improve feature diversity and model robustness.
Findings
Gradient Starvation causes neural networks to focus on a subset of features.
Theoretical analysis links feature imbalance to data structure and learning dynamics.
Regularization improves accuracy and robustness in out-of-distribution scenarios.
Abstract
We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks. Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning
