Two-Level K-FAC Preconditioning for Deep Learning

Nikolaos Tselepidis; Jonas Kohler; Antonio Orvieto

arXiv:2011.00573·cs.LG·December 8, 2020·1 cites

Two-Level K-FAC Preconditioning for Deep Learning

Nikolaos Tselepidis, Jonas Kohler, Antonio Orvieto

PDF

Open Access 1 Repo

TL;DR

This paper introduces a two-level K-FAC preconditioning method that incorporates global curvature information via a coarse-space correction, aiming to improve convergence in deep learning optimization.

Contribution

It extends the K-FAC optimizer by adding a coarse-space correction to include global Fisher information, inspired by domain decomposition methods.

Findings

01

Improved convergence behavior observed in experiments

02

Enhanced preconditioning with global curvature information

03

Potential for faster training in deep learning models

Abstract

In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research advocates the use of diagonal approximations of the so-called empirical Fisher matrix in stochastic gradient-based algorithms, with the most prominent one arguably being Adam. However, in recent years, several works cast doubt on the theoretical basis of preconditioning with the empirical Fisher matrix, and it has been shown that more sophisticated approximations of the actual Fisher matrix more closely resemble the theoretically well-motivated Natural Gradient Descent. One particularly successful variant of such methods is the so-called K-FAC optimizer, which uses a Kronecker-factored block-diagonal Fisher approximation as preconditioner. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Abdoulaye-Koroko/natural-gradients
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Sparse and Compressive Sensing Techniques

MethodsAdam · Natural Gradient Descent