A Theory of Saddle Escape in Deep Nonlinear Networks
Divit Rawal, Michael R. DeWeese

TL;DR
This paper develops a theoretical framework for understanding saddle escape in deep nonlinear networks, revealing how bottleneck layers influence training dynamics and escape times.
Contribution
It introduces an exact identity for layer weight imbalance, classifies activation functions into universality classes, and derives a critical-depth escape time law based on bottleneck layers.
Findings
The escape time scales as ()^{-(r-2)} with the number of bottleneck layers.
The theory matches well with numerical simulations.
Activation functions can be classified into four universality classes based on the derived identity.
Abstract
In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law governed by the number of layers at the bottleneck scale rather than the total depth . We find that this same exponent is recovered under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
