TL;DR
This paper investigates how gradient descent escapes from the initial saddle at the origin in deep ReLU networks, revealing a low-rank bias in the escape directions and saddle-to-saddle dynamics.
Contribution
It introduces the concept of saddle-to-saddle dynamics in deep ReLU networks and characterizes the low-rank bias in the escape directions from the origin saddle.
Findings
Optimal escape directions have a low-rank bias in deeper layers.
The first singular value of each layer's weight matrix is significantly larger than others.
Deep ReLU networks exhibit a sequence of saddle points with increasing bottleneck rank.
Abstract
When a deep ReLU network is initialized with small weights, gradient descent (GD) is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions along which GD leaves the origin, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the -th layer weight matrix is at least larger than any other singular value. We also prove a number of related results about these escape directions. We suggest that deep ReLU networks exhibit saddle-to-saddle dynamics, with GD visiting a sequence of saddles with increasing bottleneck rank (Jacot, 2023).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
