Gradient Descent Quantizes ReLU Network Features

Hartmut Maennel; Olivier Bousquet; Sylvain Gelly

arXiv:1803.08367·stat.ML·March 23, 2018·27 cites

Gradient Descent Quantizes ReLU Network Features

Hartmut Maennel, Olivier Bousquet, Sylvain Gelly

PDF

Open Access

TL;DR

This paper analyzes why over-parametrized ReLU neural networks trained with gradient descent tend to concentrate weights in specific directions, leading to finitely many simple functions, which may explain their generalization properties.

Contribution

It uncovers a quantization effect in ReLU networks under small initialization and learning rate, linking network solutions to finitely many simple functions based on input data.

Findings

01

Weights tend to concentrate at a small number of directions.

02

Finitely many simple functions can be realized for given data.

03

Potential explanation for generalization in over-parametrized networks.

Abstract

Deep neural networks are often trained in the over-parametrized regime (i.e. with far more parameters than training examples), and understanding why the training converges to solutions that generalize remains an open problem. Several studies have highlighted the fact that the training procedure, i.e. mini-batch Stochastic Gradient Descent (SGD) leads to solutions that have specific properties in the loss landscape. However, even with plain Gradient Descent (GD) the solutions found in the over-parametrized regime are pretty good and this phenomenon is poorly understood. We propose an analysis of this behavior for feedforward networks with a ReLU activation function under the assumption of small initialization and learning rate and uncover a quantization effect: The weight vectors tend to concentrate at a small number of directions determined by the input data. As a consequence, we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Advanced Neural Network Applications