TL;DR
This paper explains why over-parameterized neural networks continue to improve in generalization as their size increases, using a new framework based on the Neural Tangent Kernel and finite-size fluctuations.
Contribution
It introduces a novel theoretical framework connecting neural network size, fluctuations, and generalization error, resolving the paradox of improving generalization in over-parameterized models.
Findings
Generalization error decays as a power law with network size.
A jamming transition at a critical size causes divergence in network parameters.
Empirical validation on MNIST and CIFAR datasets supports the theory.
Abstract
Supervised deep learning involves the training of neural networks with a large number of parameters. For large enough , in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as grows past a certain threshold . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with . We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function around its expectation . These affect the generalization error for classification: under natural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Scaling description of generalization with number of parameters in deep learning
Mario Geiger
Institute of Physics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Arthur Jacot
Institute of Mathematics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Stefano Spigler
Institute of Physics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Franck Gabriel
Institute of Mathematics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Levent Sagun
Institute of Physics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Stéphane d’Ascoli
Laboratoire de Physique Statistique, École Normale Supérieure, PSL Research University, 75005 Paris, France
Giulio Biroli
Laboratoire de Physique Statistique, École Normale Supérieure, PSL Research University, 75005 Paris, France
Clément Hongler
Institute of Mathematics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
Matthieu Wyart
Institute of Physics, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
11footnotetext: M.G. and A.J. contributed equally to this work.22footnotetext: E-mail: [email protected], [email protected]
Abstract
Supervised deep learning involves the training of neural networks with a large number of parameters. For large enough , in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsity-based arguments would suggest that the generalization error increases as grows past a certain threshold . Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with . We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations of the neural net output function around its expectation . These affect the generalization error for classification: under natural assumptions, it decays to a plateau value in a power-law fashion . This description breaks down at a so-called jamming transition . At this threshold, we argue that diverges. This result leads to a plausible explanation for the cusp in test error known to occur at . Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond , and averaging their outputs.
Introduction
Deep neural networks (DNNs) have proven to be very successful at a very wide range of tasks. In particular, for supervised learning tasks, they have yielded breakthroughs in various contexts, in particular for image classification [1, 2], speech recognition [3], and automatic translation [4]. Yet, a theoretical framework to understand the remarkable successes of DNNs remains to be constructed, and central questions need to be clarified.
First, supervised learning for a DNN corresponds to adjusting parameters which describe an output function to fit training data points with . In practice, it is done by initializing the parameters randomly and minimizing a (non-convex) loss function using a first-order method (e.g. gradient descent). The dynamics of the training of DNNs, and the question of whether a global minimum is attained are thus a priori delicate, involving the understanding of a complex loss landscape.
Second, DNNs are in practice trained in the so-called over-parametrized regime, where the number of parameters is much larger than the number of data points . Thus, DNNs are used in a regime where their capacity is very large (they can still classify the data even if all their labels are randomized). Surprisingly from the point of view of traditional statistical learning theory [5] DNNs generalize very well in practice, even without an explicit regularization. This thus raises the question of an appropriate framework to understand generalizations of DNNs.
Recent works suggest that the two questions above are closely connected. Numerical and theoretical studies [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] show that in the over-parametrized regime, the loss landscape of DNNs is not rough with isolated minima as initially thought [18, 19], but instead has connected level sets and presents many flat directions, even near its global minimum. In particular, recent works on the over-parametrized regime of DNNs [20, 21, 22, 23] have shown that the landscape around a typical initialization point becomes essentially convex, allowing for convergence to a global minimum during training.
In [16, 17], it has been observed that when optimizing DNNs (using the so-called hinge loss), there is a sharp phase transition — whose location can depend on the chosen dynamics — at some such that for the dynamic process reaches a global minimum of the loss. In particular whenever , the training error (i.e. the total of the loss on the training set) reaches its global minimum. A counter-intuitive aspect of deep learning is that increasing above does not destroy the predictive power by over-fitting the data, but instead appears to improve the generalization performance (i.e. the probability that a data point outside of the training set is correctly classified) [24, 25, 26, 27]. Indeed the test error (the probability of an incorrect classification for an unseen data point) has been observed to decrease as in a slow power-law fashion [17]. In contrast, as , the test error blows up [27, 28, 17] (a phenomenon shown by the blue curve in Fig. 2). In the context of least-squares regression, the improvement of performance with has been linked to the observed diminishing fluctuations of the DNN function after training [29], a result consistent with the notion of stronger implicit regularization with increasing [30, 31]. This raises the question of understanding what controls these fluctuations and how they affect the test error in a classification task.
In this work, we address these questions in the context of classification tasks for fully-connected DNNs with a fixed number of layers , with wide hidden layers. We develop a framework based on a new connection between the limit of DNNs and kernel methods [20]. More precisely, the training of DNNs can be recast as a kernel gradient descent associated with the so-called Neural Tangent Kernel (NTK). In the limit, the NTK becomes deterministic and constant in time. This result explains why the generalization performance converges as , a result previously obtained for single hidden layer neural networks using a different approach [32, 33, 34, 35].
We consider a binary classification task; the DNN output function is used to predict whether a data point belongs to the class depending on the sign of .
First, we introduce an NTK-based framework to study the random fluctuations of the output function at the end of training due to the random initialization of the parameters. We find that (in the over-parametrized regime) the key finite- effect is that the NTK at initialization has random fluctuations around its mean of order , leading to similar fluctuations for .
Second, we consider the fluctuations of the decision boundary (the level set ): we argue that a variation of yields an increase to the test error. We use this asymptotic result to predict the increase in generalization performance yielded by an ensemble averaging on samples of the function (each trained on the data separately) as becomes large, as well as the increase in generalization performance as grows.
Finally, this description breaks down at the transition point , where the random fluctuations of appear to diverge as a power law. We study this divergence through a simple argument on non-linear networks, suggesting that .
Overall, our work introduces a conceptual framework to describe how generalization error in deep learning evolves with the number of parameters. A practical consequence of our analysis is that performing an ensemble average of (both fully-connected and convolutional) DNNs with independent initializations can improve performance significantly: for a given computational envelope, it appears to be best to use several nets of intermediate sizes and to average their outputs.
Related works
After the electronic submission of the present work, and following on [17, 36], other articles have been written on the nature of the “double descent” curve in the generalization error (Fig. 2) [37, 38, 39] and on the asymptotic behavior of wide networks [40, 41, 42, 43]. Very recently in [38], a rigorous derivation of the double descent curve was obtained for the mean square regression of simple functions using random features models. Although the scaling arguments proposed here are not mathematical proofs, they provide a quantitative explanation of the double descent curve in a more general setting, including the regression and classification of empirical data by fully connected deep networks. Our predictions are tested empirically in that setting. Finally, our analysis is based on a scaling estimate of the fluctuations of the NTK at initialization, recently supported by more detailed analysis based on Feynman diagrams and path numbering [40, 41].
1 Setting
1.1 DNN Model and Training
We consider DNNs defining a real-valued output function for , where we aggregate the parameters into . We first consider fully-connected DNNs of layers, where each layer is made of neurons, as in Fig. 1. The output function is constructed recursively as
[TABLE]
is the weight of the synapse from neuron in layer to neuron in layer , and is the bias of neuron in layer , as depicted in Fig. 1. The vector contains all weights and biases. is a non-linear activation function. Empirically we will use the standard ReLU , but any other common nonlinear functions can be used (e.g. the softplus function). Polynomial functions must be avoided, as they do not lead to positive definite kernels, see discussion in [20].
The DNN function is used for binary classification: we aim to find such that for a data point , correctly predicts the label . To do so, we minimize on a dataset the square-hinge cost function
[TABLE]
where and is the so-called margin, fixed to in our numerical tests.
The network is then trained using a first-order method, such as gradient descent, for a maximum running time of , and is stopped as soon as the training loss hits its lowest possible value (typically [math], unless two identical data points have different labels). The jamming transition point is defined as the smallest value of for which we reach the lowest possible loss at the end of training.
Note that the hinge loss leads to results that are very similar to the ones relying on the more commonly used cross-entropy loss [17]. It has the advantage however to stop in finite time in the over-parametrized regime .
1.2 Numerical Setting
We first consider the task of classifying the parity of digits on the MNIST database [44]. For this architecture we consider only the first ten PCA components of the images. We then test our findings with a CNN architecture on the full images in the CIFAR10 dataset.
The DNNs are trained using a full-batch procedure (as opposed to stochastic gradient) described in , for a maximum running time steps.
2 Numerical Results on MNIST
Fig. 2 demonstrates the performance of the above setup for the MNIST dataset: we find that at the end of training, the test error (i.e. the empirical generalization error) reaches a local maximum in a cusp-like fashion near the jamming transition and then slowly decreases as becomes larger. We denote by the average of samples of the function taken with independent initial conditions. Remarkably, in our experiments, ensemble-averaging with leads to a nearly flat test error for ; this supports the hypothesis that the improvement of generalization performance with originates from reduced variance of when gets large, as recently observed for mean-square regression [29]. In addition to this leading finite-size effect, an interesting sub-leading finite-size effect can be observed, as discussed in Section 7.
3 Relationship Between Variance and Generalization in Classification Tasks 111In spirit, this section shares some similarity with the bias variance decomposition developed in [45], except that we consider averaging on initial conditions instead of training set, and that we use the average output function as predictor, rather than applying the majority rule on a set of predictions.
3.1 Regression task
For mean square regression of some target function , the increase of the mean square test error implied by the fluctuations of the output function is readily computed. Let us write , where is the output of the learnt function, averaged over runs with different initial condition. is the relative distance between a single output and this average. Then
[TABLE]
is the contribution to the generalization error due to the fluctuations of the output function. The bar represents averages over different runs or initial conditions. For a measure on , we set . The measure could be for instance the empirical measure on the training set or on the test set.
Our results below apply directly to mean square regression. In the next paragraphs we will argue that a similar quadratic relationship between test error and fluctuations also holds for classification under mild assumptions on the data; so that our results extend to that case as well.
3.2 Classification task
We now provide a heuristic argument relating fluctuations of the output function to generalization performance. For a random function (e.g. a DNN function with random initialization), we denote by the expectation with respect to .
Consider a random smooth function with expectation , and set . Let denote the decision boundaries , and consider a point that is being classified differently by and , i.e. , as illustrated in Figure 3. Imagine drawing the shortest segment passing through that starts from a point in and ends in . If its length is small, then the signed distance between and is . Note that for smooth activation functions, the smoothness of DNN output function is guaranteed and for ReLU-based DNNs, the output function is smooth outside of the training points (see S.I.). We show direct measurements of in Section A of S.I., supporting that this estimate still holds and becomes more and more accurate as .
Next, we introduce the typical distance along the boundary:
[TABLE]
where the average is taken over all the test data classified differently by and . As numerically shown in S.I., is very well estimated by where is the uniform measure on all the test set.
We then denote by the difference between the true test error of and that of . Under reasonable assumptions 333We assume that the true test error is a smooth function of the decision boundary. This holds true if the probability distributions to find data of different labels are themselves smooth functions of the input (this is the case, for instance, if the input data have Gaussian noise). it can be expanded by considering a small perturbation of the decision boundary of (that can consist of unconnected parts):
[TABLE]
The fact that , suggests that . This suggests in turn that in average the true test error increases quadratically with the norm of fluctuations :
[TABLE]
Note that if displays a minimal true test error, the decision boundary is optimal: and for all , implying that the prefactor in Eq. (5) must be positive 444The pre-factor could be zero if the optimal boundary is degenerate, a situation that will not occur generically if the data have e.g. Gaussian noise.. If the true test error is small, the decision boundary will tend to be close to the ideal one, so that the prefactor in Eq. (5) will still be positive. 555We expect this to be the case for the MNIST model we consider for which the test error is a few percents.
Eq. (5) is a result on the ensemble average of the true test error. Yet, our data in Fig. 2 supports that the test error is a self-averaging quantity: the test error of a given output function (blue points) lies close to its average (blue line).
4 Asymptotic generalization as
Using the tools of the previous section, we can now study how an ensemble average of networks behaves in the limit. The central limit theorem and the law of large numbers imply that while converges to a constant. Thus and for the true test errors and of and , we have . These predictions are confirmed in Fig. 4.
5 Asymptotic Generalization as
We now study the fluctuations of throughout training for large networks using the NTK [20]. At initialization , is a random function whose limiting distribution as is an explicit Gaussian [46, 47, 48]. These types of fluctuations do not vanish as : the variance of at initialization is essentially constant in 666In our setup, the output variance at initialization is smaller than one. It is possible to suppress the randomness of at initialization by training . We have observed that it does not qualitatively affects our results..
However, during the DNN training, the fluctuations of will shrink around the training points [20]. At the end of training, outside of the training points, the fluctuations due to the random initialization of the parameters manifest themselves in two ways: from the randomness of the initialization point in function space and from the randomness of the learning dynamics. The first one is essentially independent of . Hence, to understand the way the fluctuations of the function at convergence decrease with , we must thus study the random fluctuations of the training process. The gradient descent dynamics of is described by the NTK :
[TABLE]
where is the derivative of the output of the network with respect to one parameter and the sum is over all the network’s parameters. For a general cost , the function follows the kernel gradient of the cost during training
[TABLE]
The NTK is random at initialization and varies during training. However as the number of neurons in each hidden layer goes to infinity, the NTK converges to a deterministic limit which stays constant throughout training [20]. In this limit, the training corresponds to that of a kernel method (i.e. the output evolves along the vector space spanned by the functions ). The random fluctuations of the training process have now themselves two sources: the random fluctuations of the NTK at initialization, and the evolution of the NTK during training. On the one hand, we have that the variation of the NTK during training is of order , as is suggested by [49]:
[TABLE]
( is the Frobenius norm of the Gram matrix computed over the training set). On the other hand, the random fluctuations of the NTK at initialization are of order
[TABLE]
Eq. (8) can be readily obtained by re-writing Eq. (6) as a sum on neurons and using the central limit theorem, as sketched in S.I. and tested empirically in [49]. From the above, we see that dominant source of random fluctuations during training is due to the randomness of the NTK at initialization and is of order .
Because the NTK describes the behaviour of the function during training, and because the time to converge to a minimum of the loss converges to a constant as , from Eq. (7) we expect the variance of the NTK to induce some variance of the same order to the function at the end of training: this is proven in the case of the mean square loss in the S.I. Hence, the random fluctuations of the kernel leads to fluctuations of of order , and we predict:
[TABLE]
where the residual variance \Big{\langle}|\!|f_{\infty}-\bar{f}_{\infty}|\!|_{\mu}\Big{\rangle} is due to the fact that we consider a finite dataset. In our setting, since our dataset is large, this residual term is negligible, leading one to:
[TABLE]
as checked in Fig.5.
We expect the fluctuations of to be of the size as those of , leading to . This result is consistent with our observations, as shown in Fig. 6.A, in which we find empirically that is much larger than . For the true test errors of , from the decision boundary discussion, we get
[TABLE]
where indicates the typical distance between the decision boundaries and , as supported by Fig. 6.B. The fluctuations of the decision boundary can be approximated by , as supported by Fig. 6.C, leading to . We then obtain the key prediction
[TABLE]
Since we measure both and independently, we can test the prediction for the leading exponent without any fitting parameters, and indeed confirm that asymptotically is of order as shown in Fig. 6.D.
Finally we estimate the evolution of test error with . We have:
[TABLE]
where denotes the true test error of as (notice that is still random, due to the random initialization and the fact that we have a finite dataset). The first term was estimated above, and turns out to be the dominant one for large datasets. The last term is independent of , and cancels the first term for asymptotically large (unaccessible in our numerics).
We provide a scaling argument to estimate the size of the second term. For large , we expect the difference between and to stem from (i) the evolution of the kernel with time (which corresponds to learning features) and (ii) the fact that the relationship between the kernel and the function at infinite time is not linear, as described for the mean square loss in Eq. (17) of the S.I. Both effects are , i.e. much smaller than the fluctuations of around its mean. The typical distance between the interfaces and is thus small and . According to Eq. (4) we get:
[TABLE]
Thus cannot be neglected a priori. Overall, we get:
[TABLE]
a form indeed consistent with observation as shown in Fig. 2.
For MNIST, both for FC and CNN (below), we always find , consistent with the notion that the dominant effect of finite is the increase in fluctuations of the output.
Note that a direct fit of the test error vs gives an apparent exponent smaller than [17], reflecting that (i) power-law fits are less precise when the value for the asymptote (here the value of ) is a fitting parameter and (ii) that correction to scaling needs to be incorporated for a good comparison with the theory (a fact that ultimately stems from the large correction to scaling of shown in Fig. 6.A).
6 Vicinity of the jamming transition
The asymptotic description for generalization in the large limit is not qualitatively useful for , where a cusp in test error is found. In the perceptron, the simplest network without hidden layers, the cusp in the test error at the jamming point is also observed and predicted analytically [50, 51, 52, 53, 54, 55]. Here instead, we argue that this cusp is induced by a divergence of at when no regularization is used, as apparent in Fig. 7.A (no such divergence happens in the perceptron where is generally imposed). Indeed following our argument of Section 3, this effect must lead to singular fluctuations of the decision boundary at , suggesting a singular behavior for the true test error. This phenomenon shares some similarity with the norm divergence that occurs in linear networks with mean square loss for which [27, 28]. Yet, for losses better suited for classification such as the hinge loss, we argue that this explosion occurs at a different location with a different exponent.
Consider the hinge loss defined in Eq. (1). For , the DNN is able to reach the global minimum of the loss, therefore all must be negative, i.e. all patterns must satisfy . The parameter plays the role of a margin above which we are confident about the network’s prediction. Because we do not use regularization on the norm , the precise choice of does not affect . Indeed the weights can always adjust during learning so as to multiply by any scalar , effectively reducing the margin by a factor , making the data easier to fit. By contrast, if a regularization is imposed to fix (which may be hard to implement in practice), then must be an increasing function of . We assume that this function is differentiable in its argument around zero, a fact know to be true for the perceptron [56, 57], thus . Now consider our learning scheme (no regularization) for a network with , with initial conditions such that before learning . Initially, the effective margin is large with . Yet, all data can be fitted and the loss brought to zero if the norm increases so that , corresponding to where . At later times, the loss is zero and the dynamics stops.
This predicted inverse relation is tested in Fig. 7.B. It is important to note that, as it is the case for any critical points, working at finite times cuts off a true singularity: as illustrated in Fig. 7.B becomes more and more singular as grows. This effect also causes a shift of the transition where the loss vanishes, that converges asymptotically to a well-defined value in the limit as documented in [16]. is therefore defined when displays a power law as function of .
Note that for other losses like the cross-entropy, the dynamics never stops completely but becomes extremely slow [15]. In such cases, we expect that asymptotically as soon as , although this singularity should build up logarithmically slowly in time. For finite learning times we expect that a singularity will occur near , but will be blurred as for the hinge loss if .
7 Subleading Finite-Size Effect
For a given computational envelope, it appears be more efficient to take a value of slightly bigger than , and to perform ensemble-averaging to reduce the variance. Quite remarkably, as shown in Figure 2, an additional effect appears to take place after ensemble-averaging: taking only slightly bigger than is not only more efficient from a computational point of view, but it also yields to a slightly better generalization performance than . This corresponds to the middle term in Equation 13.
This could be viewed as supporting the classical intuition that keeping the models sparse by controlling the number of parameters is useful, when one averages over differently initialized networks and once the network is large enough. This effect appears stronger for CNN architecture, as confirmed in Section 8.
This effect could be explained by an evolution of the NTK during training. It suggests the possibility that (with ensembling) DNNs at finite perform better than their kernel method counterparts. It hence appears to be both a very promising direction for future theoretical research and to be of practical interest.
8 Extension to Convolutional Networks
In this section, we test the generality of our findings for Convolutional Networks (CNNs) used for classification. We train the CNN on the CIFAR10 dataset which consists of 50,000 training and 10,000 test images of 32 by 32 resolution. Each image is labeled by one of the ten possible classes. The architecture is a vanilla model with 3 convolutional and 1 fully-connected layers. Each convolutional layer has channels and the output of the CNN is a -dimensional vector (see S.I. for more details). The loss function is linear-hinge . We vary from to . For each value of , we train models with independent random initial conditions. For each , the learning rate throughout is fixed at . The jamming transition occurs just before . Soon after the transition, at , the mean performances are between . The performance of the ensemble averaging is , and the average accuracy of the widest models is a little bit less than . Peak performance is achieved by ensembling with , yielding a value of , while the average performance without ensembling is lowest at with a value of .
9 Conclusion
We have provided a description for the evolution of the generalization performance of fixed-depth fully-connected deep neural networks, as a function of their number of parameters . In the asymptotic regime of very large , we find empirically that the network output displays reduced fluctuations with . We have argued that this scaling behavior is expected from the finite fluctuations of the Neural Tangent Kernel known to control the dynamics at . Next we have provided a general argument relating fluctuations of the network output function to decreasing generalization performance, from which we predicted for the test error , consistent with our observation on MNIST. Overall this approach explains the surprising finding that generalization keeps improving with the number of parameters.
We have then argued that this description breaks down at below which the training set is not fitted. For the hinge loss where this jamming transition is akin to a critical point, and in the case where no regularization (such as early stopping) is used, we observe the apparent divergence . We have argued, based on reasonable assumptions, that , consistent with our observations. This predicted blow up of the norm of explains the spike in the error observed at .
Our analysis furthermore suggests that optimal generalization does not require to take much larger than : since improvement of generalization with stems from reduced variance in the output function, near-optimal generalization is readily obtained by performing an ensemble average of networks with fixed, e.g. taken to be a few times . The usefulness of averaging breaks down near , where the variance of is too large. This suggests that given a computational envelope, it is best from a generalization performance point of view to ensemble slightly beyond the jamming transition point. This is a result of practical importance which needs to be tested in a wide range of architectures and datasets.
Acknowledgements
We thank Marco Baity-Jesi, Carolina Brito, Chiara Cammarota, Taco S. Cohen, Silvio Franz, Yann LeCun, Florent Krzakala, Riccardo Ravasio, Andrew Saxe, Pierfrancesco Urbani and Lenka Zdeborova for helpful discussions.
This work was partially supported by the grant from the Simons Foundation (#454935 Giulio Biroli, #454953 Matthieu Wyart). M.W. thanks the Swiss National Science Foundation for support under Grant No. 200021-165509. C.H. acknowledges support from the ERC SG Constamis, the NCCR SwissMAP, the Blavatnik Family Foundation and the Latsis Foundation. We thank the KITP and the National Science Foundation under Grant No. NSF PHY-1748958 for hosting us while this manuscript was written.
Appendix A Materials and methods
Here follow some details on the initialization and training dynamics used for the fully-connected networks. The weights of the network are initialized according to the random orthogonal scheme [58] and all biases are initialized to zero. The network is not optimized using vanilla gradient descent, as learning was then too slow to acquire appropriate statistics. Instead we used ADAM [59] with full batch and learning rate set to in order to have a smooth dynamics for all values of . The exponent has been empirically chosen so that the number of steps to converge is independent of [20]. The excellent match between theory and predictions support that our conclusions are robust for a range of choices of learning dynamics.
For convolutional networks the parameters are initialized with the standard Xavier initialization and training minimizes a linear-hinge loss777As in Eq. (1) without the square, namely . with stochastic gradient descent, with learning rate equal to — being the number of channels — and batch size . Momentum, weight decay, or data augmentation were not used.
Appendix B Robustness of the boundaries distance estimate
Fig.9 shows that the linear estimate for the distance between two decision boundaries, , holds for ReLU nonlinear function and improves as .
Fig.10 illustrates the validity of the estimate of the typical distance between two boundary decisions presented in the main text , where corresponds to the uniform measure on all the test points.
Appendix C Central limit theorem of the NTK
In this section, we present a heuristic for the finite-size effects that are displayed by the NTK at initialization: informally, this is the Central Limit Theorem counterpart to the NTK asymptotic result, which can be viewed as a law of large numbers. A rigorous derivation, including the behavior during training, is beyond the scope of this paper.
The NTK can be re-written as:
[TABLE]
where is the activity of neuron when data is shown, while is its pre-activity and is the set of neurons in the layer preceding . The first bracket converges to a well-defined limit described by a so-called activation kernel, see [46, 47, 20]. The second bracket has fluctuations of size comparable to its mean. The normalization is chosen such that each layer contributes a finite amount to the kernel, so that the mean is of order . For a given hidden layer, the contributions of two neurons can be shown to have a covariance that is positive and decays as , and thus does not affect the scaling expected from the Central Limit Theorem for uncorrelated variables. For a rectangular network (i.e. where all hidden layers with the same size), this suggests that fluctuations associated with the contribution of one layer to the kernel is of order .
Appendix D Fluctuations of output function for the mean square error loss
In this section, we discuss the fluctuations of the output function after training for the mean square error loss: . We first investigate the variance of in the limit , then we explain the deviations due to finite size effects, at last we discuss the hing loss case.
D.1 Infinite width
Let us first study the variance of in the limit . In this limit, the function at initialization is a centered Gaussian process described by a covariance kernel . During training, the dynamics of is described by a deterministic kernel (the large limit NTK) :
[TABLE]
If the NTK is positive definite (which is proven when the inputs all lie on the unit circle and the non-linearity is not a polynomial function), the network reaches a global minimum at the end of training . In particular the values of the function on training set are deterministic: . The values of the function outside the training set can be studied using the vector of values of on the training set . Denoting by the empirical Gram matrix:
[TABLE]
so that
[TABLE]
These two terms represent the fact that the network needs to learn the labels and forget the random initialization. We can therefore give a formula for the values outside the training set, using the vector
[TABLE]
The first two terms are random, but they partly cancel each other, their sum is a centered Gaussian distribution with zero variance on the training set and a small variance for points close to the training set: the more training data points used, the lower the variance at initialization. The last term is equal to the kernel regression on with respect to the NTK, it is not random.
This shows that even in the infinite-width limit, has some variance which is due to the variance of at initialization. Yet, in the setup where the number of data points is large enough, the variance due to initialization almost vanishes during training and the scaling of the variance due to finite-size effects in will appear in the last term.
Finally, note that Eq.16 of this S.M. implies that is smooth if both and are smooth functions of (this implication holds true for other choices of loss function). is smooth if the activation function is smooth [20], and so does which is then a Gaussian function of smooth covariance . For Relu neurons, displays a cusp at while is smooth, so is smooth except on the training set, as supported by Figure 1 of this S.M.
D.2 Finite width
For a finite width , the training is also described by the NTK which is random at initialization and varies during training because it depends on the parameters. The integral formula becomes
[TABLE]
However the noise at initialization is of order , whereas the rate of change is only of order . We can therefore make the approximation
[TABLE]
Assuming that there are enough parameters such that the Gram matrix is invertible, we can again decompose the integral into two terms:
[TABLE]
giving that
[TABLE]
Here again the first two terms almost cancel each other, but the third term is random due to the randomness of the NTK which is of order , as needed.
D.3 Hinge Loss
For the hinge loss setup, we do not have such a strong constraint on the value of the function on the training set as for regression, but we still know that they must satisfy the margin constraints
[TABLE]
The vector is therefore random for the hinge loss as a result of the random initialization of and the fluctuations of the NTK. Again it is natural to assume the first type of fluctuations to be subdominant and the second type to be of order .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems , pages 1097–1105, 2012.
- 2[2] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature , 521(7553):436, 2015.
- 3[3] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine , 29(6):82–97, 2012.
- 4[4] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27 , pages 3104–3112. Curran Associates, Inc., 2014.
- 5[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations , 2017.
- 6[6] C Daniel Freeman and Joan Bruna. Topology and geometry of deep rectified network optimization landscapes. International Conference on Learning Representations , 2017.
- 7[7] Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys. ar Xiv preprint ar Xiv:1802.06384 , 2018.
- 8[8] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems , pages 1729–1739, 2017.
