MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature
Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko

TL;DR
MAC introduces a computationally efficient second-order optimization method for neural networks by approximating curvature information, outperforming existing methods like KFAC in accuracy and training efficiency.
Contribution
It is the first to apply Kronecker factorization to transformer attention layers and explicitly incorporate attention scores into preconditioning.
Findings
MAC outperforms KFAC in accuracy and training speed.
MAC reduces memory usage compared to existing methods.
MAC converges to global minima under certain conditions.
Abstract
Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we analyze the two components that constitute the layer-wise Fisher information matrix (FIM) used in KFAC: the Kronecker factors related to activations and pre-activation gradients. Based on empirical observations on their eigenspectra, we propose efficient approximations for them, resulting in a computationally efficient optimization method called MAC. To the best of our knowledge, MAC is the first algorithm to apply the Kronecker factorization to the FIM of attention layers used in transformers and explicitly integrate attention scores into the preconditioning. We also study the convergence property of MAC on nonlinear neural networks and provide two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and ELM
MethodsSoftmax · Attention Is All You Need
