Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures
Runa Eschenhagen, Alexander Immer, Richard E. Turner, Frank Schneider,, Philipp Hennig

TL;DR
This paper extends Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimization method, to modern neural networks with weight-sharing layers, demonstrating speedups and efficiency improvements in training diverse architectures.
Contribution
It introduces two variants of K-FAC tailored for weight-sharing layers, providing exact solutions for deep linear networks and practical speedups for training modern architectures.
Findings
K-FAC-reduce is faster than K-FAC-expand.
Both variants reach target validation metrics in fewer steps.
K-FAC variants achieve comparable performance with reduced training time.
Abstract
The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with . Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- and . We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Convolution · 1x1 Convolution · Max Pooling · Kaiming Initialization · Bottleneck Residual Block · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Batch Normalization · Average Pooling
