Kronecker-Factored Approximate Curvature for Modern Neural Network   Architectures

Runa Eschenhagen; Alexander Immer; Richard E. Turner; Frank Schneider,; Philipp Hennig

arXiv:2311.00636·cs.LG·January 12, 2024·2 cites

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

Runa Eschenhagen, Alexander Immer, Richard E. Turner, Frank Schneider,, Philipp Hennig

PDF

Open Access 1 Video

TL;DR

This paper extends Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimization method, to modern neural networks with weight-sharing layers, demonstrating speedups and efficiency improvements in training diverse architectures.

Contribution

It introduces two variants of K-FAC tailored for weight-sharing layers, providing exact solutions for deep linear networks and practical speedups for training modern architectures.

Findings

01

K-FAC-reduce is faster than K-FAC-expand.

02

Both variants reach target validation metrics in fewer steps.

03

K-FAC variants achieve comparable performance with reduced training time.

Abstract

The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with $weight-sharing$ . Kronecker-Factored Approximate Curvature (K-FAC), a second-order optimisation method, has shown promise to speed up neural network training and thereby reduce computational costs. However, there is currently no framework to apply it to generic architectures, specifically ones with linear weight-sharing layers. In this work, we identify two different settings of linear weight-sharing layers which motivate two flavours of K-FAC -- $expand$ and $reduce$ . We show that they are exact for deep linear networks with weight-sharing in their respective setting. Notably, K-FAC-reduce is generally faster than K-FAC-expand, which we leverage to speed up automatic hyperparameter selection via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Convolution · 1x1 Convolution · Max Pooling · Kaiming Initialization · Bottleneck Residual Block · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Batch Normalization · Average Pooling