A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
Carlos Couto, Jos\'e Mour\~ao, M\'ario A. T. Figueiredo, Pedro Ribeiro

TL;DR
This paper investigates the Hessian eigenspectrum near optimal points in neural networks, revealing how eigenvalues influence learning dynamics and how the Hessian's properties vary with network type and activation functions.
Contribution
It characterizes the Hessian eigenspectrum in teacher-student models, deriving analytical results for linear networks and empirical insights for non-linear networks, highlighting the Hessian's role in learning performance.
Findings
Small Hessian eigenvalues influence long-term learning.
Hessian spectrum for linear networks follows a convolution of chi-square and Marchenko-Pastur distributions.
Hessian rank acts as an effective number of parameters for polynomial activation networks.
Abstract
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Neural dynamics and brain function
