Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz,, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

TL;DR
This paper introduces Deep Kernel Shaping (DKS), a method to train deep neural networks without skip connections or normalization layers by controlling the kernel shape at initialization, enabling fast training and good generalization.
Contribution
The paper develops DKS, a novel approach combining parameter initialization, activation transformations, and architectural tweaks to improve training speed and generalization without traditional components.
Findings
Enables SGD training of residual networks without normalization on ImageNet and CIFAR-10.
Achieves training speeds comparable to standard ResNetV2 and Wide-ResNet.
Works effectively with various activation functions, including sigmoid.
Abstract
Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsStochastic Gradient Descent
