AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes; Yanlei Zhang; Eugene Belilovsky; Guy; Wolf; Mahdi S. Hosseini

arXiv:2405.16397·cs.LG·March 12, 2025

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy, Wolf, Mahdi S. Hosseini

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

AdaFisher is an adaptive second-order optimizer that uses a diagonal block-Kronecker approximation of the Fisher information matrix, achieving better convergence and accuracy in training deep neural networks while maintaining computational efficiency.

Contribution

This paper introduces AdaFisher, a novel second-order optimizer that efficiently approximates the Fisher information matrix for improved training of DNNs.

Findings

01

AdaFisher outperforms state-of-the-art optimizers in accuracy.

02

AdaFisher converges faster than traditional methods.

03

It demonstrates robustness across different tasks.

Abstract

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs is still limited due to increased per-iteration computations compared to the first-order methods. We present \emph{AdaFisher}--an adaptive second-order optimizer that leverages a \emph{diagonal block-Kronecker} approximation of the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced \emph{convergence/generalization} capabilities and computational efficiency in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. Kronecker factored preconditioner with diagonal factors hasn't been used with empirical Fisher matrix before. 2. Thorough experimentation.

Weaknesses

1. The preconditioner still requires layer inputs and gradients backpropaged through the layer, which is not always feasible for large training systems. 2. The Adafactor already uses Kronecker factored preconditioner with diagonal factors. There is no comparison against Adafactor. 3. sub-optimal regret bound $O(\log(T)\sqrt{T})$, compared to Shampoo - which is optimal $O(\sqrt{T})$. 4. There are low rank approaches with similar complexity as proposed methods such as EVA [1]. There should be c

Reviewer 02Rating 5Confidence 5

Strengths

Using the Fisher imformation matrix as a precondition is really a good point.

Weaknesses

It seems that the author didn't realize a previous work "Kronecker-Factored Second-Order Optimizers Perform First-Order Descent on Neurons" by Frederik Benzing, which is very similar to this work.

Reviewer 03Rating 6Confidence 4

Strengths

The paper innovatively discovers that the Kronecker factor is diagonally dominant and proposes a diagonal block-Kronecker approximation for the FIM. The resulting AdaFisher optimizer shows good performance. And the paper is well-organized and easy to follow.

Weaknesses

Please refer to Questions.

Code & Models

Repositories

AtlasAnalyticsLab/AdaFisher
pytorchOfficial

Videos

AdaFisher: Adaptive Second Order Optimization via Fisher Information· slideslive

Taxonomy

TopicsNeural Networks and Applications · Metaheuristic Optimization Algorithms Research

MethodsAdaptive Second Order Optimization via Fisher Information · Stochastic Gradient Descent · Adam