Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian
Jack Parker-Holder, Luke Metz, Cinjon Resnick, Hengyuan Hu, Adam, Lerer, Alistair Letcher, Alex Peysakhovich, Aldo Pacchiano, Jakob Foerster

TL;DR
Ridge Rider (RR) is a novel method that explores the loss surface of neural networks by following Hessian eigenvectors, enabling the discovery of diverse solutions beyond those found by standard gradient descent.
Contribution
The paper introduces Ridge Rider, a new approach that follows Hessian eigenvectors to find qualitatively different solutions in neural network training.
Findings
RR can find diverse solutions by following Hessian eigenvectors.
Theoretical analysis shows RR effectively spans the loss surface.
Experimental results demonstrate RR's ability to discover solutions SGD may miss.
Abstract
Over the last decade, a single algorithm has changed many facets of our lives - Stochastic Gradient Descent (SGD). In the era of ever decreasing loss functions, SGD and its various offspring have become the go-to optimization tool in machine learning and are a key component of the success of deep neural networks (DNNs). While SGD is guaranteed to converge to a local optimum (under loose assumptions), in some cases it may matter which local optimum is found, and this is often context-dependent. Examples frequently arise in machine learning, from shape-versus-texture-features to ensemble methods and zero-shot coordination. In these settings, there are desired solutions which SGD on 'standard' loss functions will not find, since it instead converges to the 'easy' solutions. In this paper, we present a different approach. Rather than following the gradient, which corresponds to a locally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputability, Logic, AI Algorithms · Artificial Intelligence in Games · Distributed and Parallel Computing Systems
MethodsStochastic Gradient Descent
