Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks
Zhiwei Bai, Tao Luo, Zhi-Qin John Xu, Yaoyu Zhang

TL;DR
This paper introduces an embedding principle in depth that reveals how the loss landscape of deep neural networks contains all critical points of shallower networks, providing new insights into training dynamics and the effects of batch normalization.
Contribution
The work proposes a critical lifting operator and uncovers an embedding principle that links the loss landscapes of networks of different depths, advancing theoretical understanding of deep learning.
Findings
Local minima can be lifted to saddle points in deeper networks
Batch normalization suppresses critical manifolds from shallower networks
Increasing training data shrinks critical manifolds, speeding up training
Abstract
Understanding the relation between deep and shallow neural networks is extremely important for the theoretical study of deep learning. In this work, we discover an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. The key tool for our discovery is the critical lifting operator proposed in this work that maps any critical point of a network to critical manifolds of any deeper network while preserving the outputs. This principle provides new insights to many widely observed behaviors of DNNs. Regarding the easy training of deep networks, we show that local minimum of an NN can be lifted to strict saddle points of a deeper NN. Regarding the acceleration effect of batch normalization, we demonstrate that batch normalization helps avoid the critical manifolds lifted from shallower NNs by suppressing layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications
MethodsBatch Normalization
