A Mini-Block Fisher Method for Deep Neural Networks
Achraf Bahamou, Donald Goldfarb, Yi Ren

TL;DR
The paper introduces a mini-block Fisher (MBF) method for training deep neural networks, combining second-order information with computational efficiency, and demonstrates its effectiveness and convergence properties.
Contribution
It proposes a novel mini-block Fisher preconditioning method that balances second-order accuracy with computational efficiency using GPU parallelism.
Findings
MBF improves training efficiency over first-order methods.
MBF enhances generalization in autoencoder and CNN tasks.
An idealized MBF converges linearly.
Abstract
Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Face and Expression Recognition
MethodsAdaGrad · Adam · Transformer in Transformer
