Rapid training of deep neural networks without skip connections or   normalization layers using Deep Kernel Shaping

James Martens; Andy Ballard; Guillaume Desjardins; Grzegorz Swirszcz,; Valentin Dalibard; Jascha Sohl-Dickstein; Samuel S. Schoenholz

arXiv:2110.01765·cs.LG·October 6, 2021

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz,, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

PDF

Open Access 2 Repos

TL;DR

This paper introduces Deep Kernel Shaping (DKS), a method to train deep neural networks without skip connections or normalization layers by controlling the kernel shape at initialization, enabling fast training and good generalization.

Contribution

The paper develops DKS, a novel approach combining parameter initialization, activation transformations, and architectural tweaks to improve training speed and generalization without traditional components.

Findings

01

Enables SGD training of residual networks without normalization on ImageNet and CIFAR-10.

02

Achieves training speeds comparable to standard ResNetV2 and Wide-ResNet.

03

Works effectively with various activation functions, including sigmoid.

Abstract

Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

MethodsStochastic Gradient Descent