A Mini-Block Fisher Method for Deep Neural Networks

Achraf Bahamou; Donald Goldfarb; Yi Ren

arXiv:2202.04124·cs.LG·October 28, 2022

A Mini-Block Fisher Method for Deep Neural Networks

Achraf Bahamou, Donald Goldfarb, Yi Ren

PDF

Open Access

TL;DR

The paper introduces a mini-block Fisher (MBF) method for training deep neural networks, combining second-order information with computational efficiency, and demonstrates its effectiveness and convergence properties.

Contribution

It proposes a novel mini-block Fisher preconditioning method that balances second-order accuracy with computational efficiency using GPU parallelism.

Findings

01

MBF improves training efficiency over first-order methods.

02

MBF enhances generalization in autoencoder and CNN tasks.

03

An idealized MBF converges linearly.

Abstract

Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Face and Expression Recognition

MethodsAdaGrad · Adam · Transformer in Transformer