Generalization in Deep Networks: The Role of Distance from   Initialization

Vaishnavh Nagarajan; J. Zico Kolter

arXiv:1901.01672·cs.LG·January 15, 2019·58 cites

Generalization in Deep Networks: The Role of Distance from Initialization

Vaishnavh Nagarajan, J. Zico Kolter

PDF

Open Access

TL;DR

This paper investigates why deep neural networks generalize well despite their size, proposing that the effective capacity is constrained by the distance from the initial parameters, which is influenced by implicit regularization.

Contribution

It introduces an initialization-dependent notion of model capacity and provides empirical and theoretical evidence linking it to generalization in deep networks.

Findings

01

Model capacity is restricted by the distance from initialization.

02

Empirical evidence shows implicit regularization of the $\, ext{l}_2$ distance.

03

Theoretical arguments support initialization-dependent capacity notions.

Abstract

Why does training deep neural networks using stochastic gradient descent (SGD) result in a generalization error that does not worsen with the number of parameters in the network? To answer this question, we advocate a notion of effective model capacity that is dependent on {\em a given random initialization of the network} and not just the training algorithm and the data distribution. We provide empirical evidences that demonstrate that the model capacity of SGD-trained deep networks is in fact restricted through implicit regularization of {\em the $ℓ_{2}$ distance from the initialization}. We also provide theoretical arguments that further highlight the need for initialization-dependent notions of model capacity. We leave as open questions how and why distance from initialization is regularized, and whether it is sufficient to explain generalization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Domain Adaptation and Few-Shot Learning