On weight initialization in deep neural networks

Siddharth Krishna Kumar

arXiv:1704.08863·cs.LG·May 4, 2017·157 cites

On weight initialization in deep neural networks

Siddharth Krishna Kumar

PDF

Open Access

TL;DR

This paper develops a theoretical framework for weight initialization in deep neural networks with non-linear activations, providing new strategies especially for RELU, and explaining limitations of existing methods like Xavier initialization.

Contribution

It introduces a general weight initialization strategy for differentiable activation functions and specifically analyzes RELU, offering insights into proper initialization and the shortcomings of Xavier initialization.

Findings

01

Derived a general initialization strategy for differentiable activations

02

Provided theoretical reasons why Xavier initialization is suboptimal for RELU

03

Enhanced understanding of non-linearities' role in weight initialization

Abstract

A proper initialization of the weights in a neural network is critical to its convergence. Current insights into weight initialization come primarily from linear activation functions. In this paper, I develop a theory for weight initializations with non-linear activations. First, I derive a general weight initialization strategy for any neural network using activation functions differentiable at 0. Next, I derive the weight initialization strategy for the Rectified Linear Unit (RELU), and provide theoretical insights into why the Xavier initialization is a poor choice with RELU activations. My analysis provides a clear demonstration of the role of non-linearities in determining the proper weight initializations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Neural Networks and Applications · Stochastic Gradient Optimization Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia?