Tensor Normal Training for Deep Learning Models
Yi Ren, Donald Goldfarb

TL;DR
Tensor Normal Training (TNT) introduces a novel second-order optimization method for deep learning that leverages tensor normal distribution assumptions to efficiently approximate the Fisher matrix, improving training speed and generalization.
Contribution
TNT is the first method to use tensor normal distribution assumptions for efficient natural gradient approximation in deep learning training.
Findings
TNT outperforms first-order methods in optimization speed.
TNT matches the performance of state-of-the-art second-order methods.
TNT requires only slightly more memory and computation than first-order methods.
Abstract
Despite the predominant use of first-order methods for training deep learning models, second-order methods, and in particular, natural gradient methods, remain of interest because of their potential for accelerating training through the use of curvature information. Several methods with non-diagonal preconditioning matrices, including KFAC, Shampoo, and K-BFGS, have been proposed and shown to be effective. Based on the so-called tensor normal (TN) distribution, we propose and analyze a brand new approximate natural gradient method, Tensor Normal Training (TNT), which like Shampoo, only requires knowledge of the shape of the training parameters. By approximating the probabilistically based Fisher matrix, as opposed to the empirical Fisher matrix, our method uses the block-wise covariance of the sampling based gradient as the pre-conditioning matrix. Moreover, the assumption that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTensor decomposition and applications · Advanced Neural Network Applications · Model Reduction and Neural Networks
MethodsDistributed Shampoo · Transformer in Transformer
