TL;DR
This paper develops a theoretical framework to quantify deep neural network generalization error based on data complexity and network smoothness, validated through experiments on image datasets.
Contribution
It introduces the cover complexity measure and inverse modulus of continuity to analyze neural network generalization, linking theoretical bounds with empirical observations.
Findings
Expected error scales with the square root of the number of classes.
Test loss correlates with neural network smoothness during training.
Network size affects smoothness, but dataset size does not.
Abstract
The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of the modulus of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and…
| Data Set | Variants | Input dim () | Output dim () | |||||
| MNIST | Original | 784 | 10 | .8480 | .1053 | 1.442 | .01 | .0032 |
| CIFAR-10 | Original | 3072 | 10 | .8332 | .0163 | 10.23 | .45 | .1423 |
| CIFAR-10 | Grey | 1024 | 10 | .8486 | .0125 | 12.11 | .53 | .1676 |
| CIFAR-10 | Conv | 1024 | 10 | .9505 | .0094 | 5.280 | .18 | .0569 |
| SVHN | Original | 3072 | 10 | .9034 | .0076 | 12.68 | .49 | .1550 |
| SVHN | Grey | 1024 | 10 | .9117 | .0084 | 10.48 | .56 | .1771 |
| SVHN | Conv | 1024 | 10 | .9632 | .0123 | 2.995 | .23 | .0727 |
| CIFAR-100 | Original (coarse) | 3072 | 20 | .8337 | .0185 | 9.012 | .62 | .1386 |
| CIFAR-100 | Grey (coarse) | 1024 | 20 | .8541 | .0132 | 11.08 | .72 | .1610 |
| CIFAR-100 | Conv (coarse) | 1024 | 20 | .9626 | .0070 | 5.326 | .40 | .0894 |
| COIL-20 | Original | 16384 | 20 | .9176 | .2385 | .3453 | .03 | .0067 |
| CIFAR-100 | Original (fine) | 3072 | 100 | .8337 | .0270 | 6.149 | .73 | .0730 |
| CIFAR-100 | Grey (fine) | 1024 | 100 | .8541 | .0198 | 7.380 | .81 | .0810 |
| CIFAR-100 | Conv (fine) | 1024 | 100 | .9457 | .0136 | 4.000 | .52 | .0520 |
| COIL-100 | Original | 49152 | 100 | .9430 | .1944 | .2930 | .01 | .0010 |
| 10 | .285 | .972 | .045 | 1.0 | 0.80 | 0.38 |
| 20 | .246 | .988 | .041 | 1.0 | 1.00 | 0.69 |
| 40 | .182 | .994 | .041 | 1.0 | 1.00 | 0.85 |
| 80 | .127 | .997 | .038 | 1.0 | 1.00 | 0.92 |
| Data Set | Version | Best Error | ||||||||||||
| MNIST | Original | .02 | .02 | .05 | .02 | .02 | .04 | .02 | .02 | .04 | .01 | .02 | .03 | .01 |
| CIFAR-10 | Original | .47 | .46 | .52 | .48 | .46 | .51 | .47 | .45 | .50 | .47 | .45 | .49 | .45 |
| CIFAR-10 | Grey | .55 | .55 | .63 | .55 | .54 | .62 | .54 | .53 | .61 | .54 | .53 | .59 | .53 |
| CIFAR-10 | Conv | .18 | .18 | .19 | .19 | .18 | .19 | .18 | .18 | .18 | .18 | .18 | .18 | .18 |
| SVHN | Original | .80 | .59 | .49 | .80 | .73 | .60 | .80 | .69 | .51 | .80 | .72 | .64 | .49 |
| SVHN | Grey | .80 | .64 | .56 | .80 | .76 | .66 | .80 | .64 | .58 | .80 | .75 | .66 | .56 |
| SVHN | Conv | .27 | .23 | .23 | .31 | .24 | .23 | .31 | .24 | .23 | .69 | .25 | .24 | .23 |
| CIFAR-100 | Original(coarse) | .64 | .64 | .69 | .64 | .63 | .68 | .63 | .62 | .67 | .63 | .62 | .66 | .62 |
| CIFAR-100 | Grey(coarse) | .74 | .73 | .79 | .74 | .73 | .78 | .74 | .72 | .78 | .74 | .72 | .77 | .72 |
| CIFAR-100 | Conv(coarse) | .40 | .41 | .45 | .41 | .41 | .44 | .40 | .41 | .44 | .41 | .40 | .43 | .40 |
| COIL-20 | Original | .08 | .05 | .05 | .07 | .06 | .05 | .08 | .05 | .05 | .07 | .05 | .03 | .03 |
| CIFAR-100 | Original(fine) | .75 | .75 | .82 | .75 | .74 | .81 | .74 | .73 | .80 | .75 | .73 | .79 | .73 |
| CIFAR-100 | Grey(fine) | .83 | .83 | .90 | .83 | .83 | .90 | .82 | .81 | .88 | .83 | .81 | .87 | .81 |
| CIFAR-100 | Conv(fine) | .53 | .54 | .64 | .53 | .54 | .63 | .52 | .52 | .61 | .53 | .52 | .59 | .52 |
| COIL-100 | Original | .02 | .01 | .01 | .03 | .02 | .01 | .03 | .02 | .01 | .02 | .02 | .01 | .01 |
| Input 256 256 Output | |||
| Input 256 256 256 Output | |||
| Input 512 512 Output | |||
| Input 512 512 512 Output |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness
Pengzhan Jin111Pengzhan Jin and Lu Lu contributed equally to this work.
Lu Lu222Pengzhan Jin and Lu Lu contributed equally to this work.
Yifa Tang
George Em Karniadakis
LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
Division of Applied Mathematics, Brown University, Providence, RI 02912, USA
Abstract
The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of the modulus of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and neural network smoothness. Although most of the analysis is general and not specific to neural networks, we validate our theoretical assumptions and results numerically for neural networks by several data sets of images. The numerical results confirm that the expected error of trained networks scaled with the square root of the number of classes has a linear relationship with respect to the CC. We also observe a clear consistency between test loss and neural network smoothness during the training process. In addition, we demonstrate empirically that the neural network smoothness decreases when the network size increases whereas the smoothness is insensitive to training dataset size.
keywords:
Neural networks , Generalization error , Learnability , Data distribution , Cover complexity , Neural network smoothness
††journal:
1 Introduction
In the last 15 years, deep learning, i.e., deep neural networks (NNs), has been used very effectively in diverse applications, such as image classification (Krizhevsky et al., 2012), natural language processing (Maas et al., 2013), and game playing (Silver et al., 2016). Despite this remarkable success, our theoretical understanding of deep learning is lagging behind. The accuracy of NNs can be characterized by dividing the expected error into three main types: approximation (also called expressivity), optimization, and generalization (Bottou and Bousquet, 2008; Bottou, 2010), see Fig. 1. The well-known universal approximation theorem was obtained by Cybenko (1989) and Hornik et al. (1989) almost three decades ago stating that feed-forward neural nets can approximate essentially any function if their size is sufficiently large. In the past several years, there have been numerous studies that analyze the landscape of the non-convex objective functions, and the optimization process by stochastic gradient descent (SGD) (Lee et al., 2016; Liao and Poggio, 2017; Lu et al., 2018; Allen-Zhu et al., 2018b; Du et al., 2018; Lu et al., 2019). Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization, which is the focus of this study.
The classical analysis of generalization is based on controlling the complexity of the function class, i.e., model complexity, by managing the bias-variance trade-off (Friedman et al., 2001). However, this type of analysis is not able to explain the small generalization gap between training and test performance of neural networks learned by SGD in practice, considering the fact that deep neural networks often have far more model parameters than the number of samples they are trained on, and have sufficient capacity to memorize random labels (Neyshabur et al., 2014; Zhang et al., 2016). To explain this phenomenon, several approaches have been recently developed by many researchers. The first approach is characterizing neural networks with some other low “complexity” instead of the traditional Vapnik-Chervonenkis (VC) dimension (Bartlett et al., 2017b) or Rademacher complexity (Bartlett and Mendelson, 2002), such as path-norm (Neyshabur et al., 2015), margin-based bounds (Sokolić et al., 2017; Bartlett et al., 2017a; Neyshabur et al., 2017b), Fisher-Rao norm (Liang et al., 2017), and more (Neyshabur et al., 2019; Wei and Ma, 2019). The second approach is to analyze some good properties of SGD or its variants, including its stability (Hardt et al., 2015; Kuzborskij and Lampert, 2017; Gonen and Shalev-Shwartz, 2017; Chen et al., 2018), robustness (Sokolic et al., 2016; Sokolić et al., 2017), implicit biases/regularization (Poggio et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018; Nagarajan and Kolter, 2019b), and the structural properties (e.g., sharpness) of the obtained minimizers (Keskar et al., 2016; Dinh et al., 2017; Zhang et al., 2018). The third approach relies on overparameterization, e.g., sufficiently overparameterized networks can learn the ground truth with a small generalization error using SGD from random initialization (Li and Liang, 2018; Allen-Zhu et al., 2018a; Arora et al., 2019; Cao and Gu, 2019). There are also other approaches, such as compression (Arora et al., 2018; Baykal et al., 2018; Zhou et al., 2018; Cheng et al., 2018), Fourier analysis (Rahaman et al., 2018; Xu et al., 2019), “double descent” risk curve (Belkin et al., 2018), PAC-Bayesian framework (Neyshabur et al., 2017b; Nagarajan and Kolter, 2019a), and information bottleneck (Shwartz-Ziv and Tishby, 2017; Saxe et al., 2019).
However, most theoretical bounds fail to explain the performance of neural networks in practice (Neyshabur et al., 2017a; Arora et al., 2018). To get non-vacuous and tight enough bounds to be practically meaningful, some problem-specific factors should be taken into consideration, such as the low complexity (i.e., data-dependent analysis) (Dziugaite and Roy, 2017; Kawaguchi et al., 2017), or properties of the trained neural networks (Sokolić et al., 2017; Arora et al., 2018; Wei and Ma, 2019). In this study, to achieve a practically meaningful bound, our analysis relies on the data distribution and the smoothness of the trained neural network. The analysis proposed in this study provides guarantees on the generalization error, and theoretical insights to guide the practical application.
As shown in Fig. 1, the optimization error is correlated with the loss value (for notation simplicity, the term “loss” indicates “empirical loss”), while the approximation error depends on the network size. In addition, a small loss requires a sufficient approximation ability, i.e., a large network size, which in turn leads to a small approximation error. If we assume a sufficiently small loss, which usually holds in practice, then the expected error mainly depends on the generalization error. Hence, we study the expected error/accuracy directly. In particular, we propose a mathematical framework to analyze the expected accuracy of neural networks for classification problems. We introduce the concepts of total cover (TC), self cover (SC), mutual cover (MC) and cover difference (CD) to represent the data distribution, and then we use the concept of cover complexity (CC) as a measure of the complexity of classification problems. On the other hand, the smoothness of a neural network is characterized by the inverse of the modulus of continuity . Because computing is not tractable in general, we propose an estimation using the spectral norm of the weight matrices of the neural network. The main terminologies are illustrated in Fig 2. By combining the properties of the data distribution and the smoothness of neural networks, we derive a lower bound for the expected accuracy, i.e., an upper bound for the expected classification error.
Subsequently, we test our theoretical bounds on several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), and SVHN (Netzer et al., 2011). Our numerical results not only confirm our theoretical bounds, but also provide insights into the optimization process and the learnability of neural networks. In particular, we find that:
The best accuracy that can be achieved in practice (i.e., optimized by stochastic gradient descent) by fully-connected networks is approximately linear with respect to the cover complexity of the data set.
- 2.
The trend of the expected accuracy is consistent with the smoothness of the neural network, which provides a new “early stopping” strategy by monitoring the smoothness of the neural network.
- 3.
The neural network smoothness decreases when the network depth and width increases, with the effects of depth more significant than that of width.
- 4.
The neural network smoothness is insensitive to the training dataset size, and is bounded from below by a positive constant. This point makes our theoretical result (Theorem 3.20) specifically pertinent to deep neural networks.
The paper is organized as follows. After setting up notation and terminology in Section 2, we present the main theoretical bounds for the accuracy based on the data distribution and the smoothness of neural networks in Section 3, while all proofs are deferred to the appendix. In Section 4, we provide the numerical results for several data sets. In Section 5 we include a discussion, and in Section 6 we summarize our findings.
2 Preliminaries
Before giving the main results, we introduce the necessary notation and terminology. Without loss of generality, we assume that the space we need to classify is
[TABLE]
where is the dimensionality, and the points in this space are classified into categories, i.e., there are labels . We denote the probability measure on by , i.e., for a measurable set , is the probability of a random sample belonging to .
2.1 Ideal label function
For the problem setup, we assume that every sample has at least one true label, and one sample may have multiple true labels. Taking image classification as an example, each image has at least one correct label. A fuzzy image or an image with more than one object in it may have multiple possible correct labels, and as long as the prediction is one of these labels, we consider the prediction to be correct.
It is intuitive that when two samples are close enough, they should have similar labels, which means that the ideal label function should be continuous. The continuity of a mapping depends on the topology of both domain and image space. For the domain of the ideal label function, we choose the standard topology induced by the Euclidean metric. As for the topology of the image space, we define it as follows. We first define the label set and the topology on it.
Definition 2.1** (Topology).**
Let
[TABLE]
be the label set. Define the topology on to be
[TABLE]
where for , and thus constitutes a topological space.
Analogous to the Euclidean metric topology, is viewed as the open “ball” centered at , and arbitrary unions of the “balls” are defined as the open sets, see A for an example. With this choice of the topology, a function is continuous if and only if
[TABLE]
Next we give the definition of the ideal label function according to this topological space.
Definition 2.2** (Ideal label function).**
An idea label function is a continuous function
[TABLE]
where is equipped with the Euclidean metric topology and with the topology from Definition 2.1. This continuity holds if and only if
[TABLE]
Eq. (1) means that two neighboring points would have some common labels. Based on the topological space defined above, it is easy to show that Eq. (1) is equivalent to continuity. The reason why we consider a multi-label setup for classification problems is that it allows for the continuity property in Eq. (1), which is impossible in the setup of a single label set, unless the label function is constant. In addition, the multi-label setup introduces a smooth transition, i.e., a buffer domain, between two domains of different labels, while the transition is sharp in the single label setup. In the following proposition, we show that if two samples are close enough, they must share at least one common label.
Proposition 2.3** (Separation gap).**
* We denote the supremum of as the separation gap , which is used in the sequel.*
Proof.
The proof can be found in B. ∎
To understand the geometric interpretation of , we consider the following special case: the label of each sample is either a single label set, such as , or the full label set if it is not uniquely identifiable.
Proposition 2.4** (Geometric interpretation of separation gap).**
If the label of each sample is either a single label set or the full label set , then is the smallest distance between two different single label points, i.e.,
[TABLE]
Proof.
The proof can be found in C. ∎
2.2 Cover complexity of data set
In this subsection, we introduce a quantity to measure the difficulty of learning a training data set
[TABLE]
First, we give some notations and propositions.
With the measure , the probability of the neighborhood of the training set with radius of is defined as
[TABLE]
where is the open ball centered at with radius of , see Fig. 2A. Obviously, is a monotone non-decreasing function, (since ), and when , see Fig. 2B. To represent the global behavior of , we use the integral of with respect to :
[TABLE]
Hence, considers both the number and location of the data points, and also the probability distribution of the space. The value is larger if the number of data points is increased and also if the probability distribution is more concentrated around , which we call the “coverability” of . We can increase by adding more data points or redistribute their locations. Next, we introduce the formal definition for the “coverability”.
Definition 2.5** (Coverability).**
Let be a data set from a domain with probability measure . We define the following for the coverability of .
- (i)
The total cover (TC) is
[TABLE]
Thus, . 2. (ii)
The cover difference (CD) is
[TABLE]
where is the number of categories, and and represent the subset and probability measure of the label respectively, i.e.,
[TABLE]
with . Here, is called self cover (SC), and is called mutual cover (MC). 3. (iii)
The cover complexity (CC) is
[TABLE]
for .
Remark 2.6**.**
CD is defined as the difference between the mean of SC and the mean of MC, since each category occurs with the same probability () in the data sets mostly used in practice. If there are some categories occurring more frequently than others, then it is straightforward to extend this definition by using the mean weighted by the probability of each category.
In image classification, the dimension of the image space is very high, and thus the data points are quite sparse. However, due to the fact that images actually live on a manifold of low dimension, the probability density around is actually high, which makes the TC to be meaningful. In our next result, we derive a lower bound of by .
Proposition 2.7**.**
Let be a data set. and are defined as above. Then we have
[TABLE]
Proof.
The proof can be found in D. ∎
From this proposition, we know that for a fixed , is close to 1 if is large enough. However, the probability distribution is usually given in practice, and we can only control the number of samples. The following theorem shows that can be arbitrary close to 1 when enough samples are available.
Theorem 2.8**.**
Let be a data set of size drawn according to . Then there exists a non-increasing function satisfying , and for any , there exists an
[TABLE]
such that
[TABLE]
holds with probability at least when .
Proof.
The proof and some other results regarding TC can be found in E. ∎
The reason why CD is introduced is that TC does not consider the labels of the data points. However, data points of the same label should be clustered in an easily learnable data set. is the difference of self cover and mutual cover, which considers the distributions of each label. By normalizing TC with CD, the cover complexity is able to measure the difficulty of learning a data set. The difficulty of a problem should be translation-independent and scale-independent. It is easy to see that is independence of translation, and the following proposition shows that it is also scale-independent.
Proposition 2.9** (Scale independence).**
* is scale-independent, i.e., if all the data points are scaled by the same factor less than 1, then is unchanged.*
Proof.
The proof can be found in F. ∎
2.3 Setup for accuracy analysis
The setup for accuracy analysis is as follows.
Definition 2.10**.**
If is a continuous mapping, then the mapping
[TABLE]
is still continuous, where represents the i-th component of , and is the exponential function applied componentwise to . We have , and . For convenience, we directly consider the case that and , and we call such mapping the normalized continuous positive mapping.
Remark 2.11**.**
A neural network with softmax nonlinearity is a normalized continuous positive mapping.
Different from the accuracy usually used in classification problems, we define a stronger accuracy called -accuracy as follows.
Definition 2.12** (-accuracy at ).**
Let be a normalized continuous positive mapping. For , we say that is -accurate at point if
[TABLE]
Definition 2.13** (-accuracy on ).**
Let be a normalized continuous positive mapping. The -accuracy of on a sample space is defined as
[TABLE]
where is -accurate at x$$\}.
Definition 2.14** (-accuracy on ).**
Let be a normalized continuous positive mapping. The -accuracy of on a data set is defined as
[TABLE]
where is -accurate at , and and are the TC of and , respectively.
Definition 2.15** (Expected accuracy).**
Let be a normalized continuous positive mapping. The expected accuracy of on a sample space is defined as
[TABLE]
where .
We note that the -accuracy of on represents the expected -accuracy, and the -accuracy of on represents the empirical -accuracy.
Finally, we define a non-decreasing function to describe the smoothness of .
Definition 2.16** (Smoothness).**
Let be a continuous mapping. Then is uniformly continuous due to the compactness of , i.e.
[TABLE]
We denote the supremum of satisfying the above requirement by . It is easy to see that is equal to the inverse of the modulus of continuity of .
For low dimensional problems, we can directly compute by brute force. However, for high dimensional problems, it is intractable to compute , and thus we give the following lower bound of for a fully-connected ReLU-network with softmax as the activation function in the last layer, which is also the main network structure considered through this work.
A fully-connected neural network is defined as follows:
[TABLE]
[TABLE]
[TABLE]
where is the number of neurons in the layer ( and ), and is the activation function. Then for the ReLU activation function, we have
[TABLE]
where is the spectral norm, and represent the Lipschitz constants of and mapping from to , respectively. is a constant less than , and thus is ignored in our numerical examples. We note that although the lower bound of depends exponentially on the neural net depth, itself does not necessarily scale exponentially with the network depth.
3 Lower bounds for the expected accuracy
In this section, we present a theoretical analysis of the lower bound for the expected accuracy as well as an upper bound for the expected error.
Proposition 3.17**.**
Let be a normalized continuous positive mapping. Suppose that is a single label training set, i.e. . For any , we have
[TABLE]
where .
Proof.
The proof can be found in G. ∎
Proposition 3.17 shows that the expected -accuracy of can be bounded by the empirical -accuracy and the TC of the training set. We can see that tends to 1 when and tend to 1. Next we derive a bound for the accuracy by taking into account the loss function.
Theorem 3.18** (Lower bound of -accuracy).**
Let be a normalized continuous positive mapping. Suppose that is a single label training set, and . For any , if the maximum cross entropy loss
[TABLE]
then we have
[TABLE]
where is the cross entropy loss that , , and is defined in Proposition 2.3.
Proof.
The proof can be found in H. ∎
Theorem 3.18 reveals that the expected accuracy is related to the total cover , separation gap , neural network smoothness , and loss value . We will show numerically in Section 4 that increases first and then decreases during the training of neural networks. The following theorem states that the maximum value of is bounded by the empirical separation gap.
Theorem 3.19** (Empirical separation gap).**
Let be a normalized continuous positive mapping. Suppose that is a single label training set. For any , if , then we have
[TABLE]
where
[TABLE]
is called the empirical separation gap, i.e., the smallest distance between two differently labeled training points.
Proof.
The proof can be found in I. ∎
Besides the upper bound, the lower bound of is also important to the accuracy. We have observed that in practice the low bound of exists (Figs. 5, 6, 7 and 8), which indicates the existence of in the following theorem. Based on this observation, we have the following theorem for the accuracy.
Theorem 3.20** (Lower bound of accuracy).**
Assume that there exists a constant , such that
[TABLE]
holds for any single label training set and a corresponding suitable trained network on such that , then we have the following conclusions for the expected accuracy and the expected error :
- (i)
*with the same condition of Theorem 3.18, *
, 2. (ii)
** 3. (iii)
**
where , and .
Proof.
(i) is the conclusion of Theorem 3.18. The proof of (ii) and (iii) can be found in J. ∎
Remark 3.21**.**
We have the following remarks for this main theorem:
In (ii), we rewrite to emphasize by introducing a coefficient term . Although this seems artificial, our numerical experiments empirically show that in practice the linear scaling between and is indeed satisfied.
- 2.
This theorem is not specific to neural networks, but rather holds for any trained model satisfying the assumptions required in the theorem. While the assumptions are not true for a general machine learning algorithm, we show numerically that in practice neural networks satisfy the assumptions.
- 3.
The current proof of the theorem relies on the assumption of the maximum cross entropy , which could be hard to satisfy in practice, since it will be large even if only a single training sample is misclassified. However, the assumption of the maximum cross entropy is possible to be relaxed according to our experiments.
Here, the cover complexity consists of two parts, one represents the richness of the whole training set while the other part describes the degree of separation between different labeled subsets. As for , both the denominator and numerator seem to have a positive correlation with respect to separation level. What we wish is that is almost close to a constant with high probability and the expected error is mainly determined by , which approximately represents the complexity level of the data set. We will provide more information in detail in the section concerning the numerical results.
4 Numerical results
In this section, we use numerical simulations to test the accuracy of neural networks in terms of the data distribution (cover complexity), and neural network smoothness. In addition, we study the effects of the network size and training dataset size on the smoothness. The codes are published in GitHub (https://github.com/jpzxshi/generalization).
4.1 Data distribution
In this subsection, we explore how affects the expected error . In our experiments, we test several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), SVHN (Netzer et al., 2011). In addition to the original data set, we also create some variants: (1) the images of grey color, (2) the images extracted from a convolutional layer after training the original data set using a convolutional neural network (CNN), (3) combine several categories into one category to reduce the number of total categories, see Table 1 and details in K.
For a training data set , we estimate by the proportion of the test data points within the balls with radius centered at training data points, i.e.,
[TABLE]
and then is obtained by Definition 2.5. Similarly, we estimate and then compute . Next for each data set, we train fully-connected neural networks with different hyperparameters, and record the best error we observed, see the details in K. The cover complexity and the best error for each data set are shown in Table 1.
These data sets are divided into three groups according to their output dimensions. For each group of the same output dimension, the error is almost linearly correlated with , see Fig. 3A, regardless of the input dimension. In addition, we find that all the cases collapse into a single line when normalizing the error by a factor of , see Fig. 3B.
It is noteworthy that the of convolutional variants of data sets is much smaller than that of the original data sets, and hence the expected accuracy increases. The results confirm the importance of data distribution.
Next, we consider the most difficult data set, i.e., data with random labels. We choose MNIST and then assign each image a random label. We repeat this process 50 times, and compute each . The distribution of is shown in Fig. 4. The smallest is 300, which is much larger than that of the original data sets with . This extreme example again confirms that is a proper measure of the difficulty of classifying a data set.
4.2 Neural network smoothness
In this subsection, we will investigate the relationship between the neural network smoothness and the accuracy, and the effects of network size (depth and width) on the smoothness. We first show results for one- and two-dimensional problems, where can be computed accurately by brute force. Subsequently, we consider the high dimensional setting of the MNIST data set, where we estimate by Eq. (3).
4.2.1 One- and two-dimensional problems
We first consider a one-dimensional case and a two-dimensional case. For the one-dimensional case, we choose the sample space , , and the ideal label function as
[TABLE]
with separation gap . We use equispaced points ( is an even number) on as the training set, i.e., , where
[TABLE]
We choose 10000 equispaced points on as the test data.
For the two-dimensional case, we choose the sample space , , and the ideal label function as
[TABLE]
with . For the training set, we first choose equispaced points, i.e., , and then remove the points with label to ensure that all samples are of single label. We choose equispaced points on as the test data.
In our experiments, we use a 3-layer fully-connected NN with ReLU activation and 30 neurons per layer. The neural network is trained for 1000 iterations by the Adam optimizer (Kingma and Ba, 2015) for the one-dimensional problem, and 2000 iterations for the two-dimensional problem. For the one-dimensional problem, the -accuracy with and lower bounds for different numbers of training points are listed in Table 2. We can see that the bounds become tighter when is larger.
During the training process of the neural network, the test loss first decreases and then increases, while first increases and then decreases, see Fig. 5A for the one-dimensional problem () and Fig. 5B for the two-dimensional problem (). is bounded by , as proved in Theorem 3.19. We also observe that the trends of test loss and coincide, and thus we should stop the training when begins to decrease to prevent overfitting.
4.2.2 High-dimensional problem
In the high-dimensional problem of MNIST, we consider the average loss instead of the maximum loss , which is very sensitive to extreme points. As shown in Eq. (3), we use the following quantity to bound :
[TABLE]
Because we use the -accuracy to approximate the true accuracy, for the classification problems with two categories, and are equivalent. However, they are not equal for problems with more than two categories, where the best depends on the properties of the data set, such as the easiness of learning to classify the data set. If the data set is easy to classify, such as MNIST, the best should be close to 1. In our example, we choose . We train MNIST using a 3-layer fully-connected NN with ReLU activation and 100 neurons per layer for 100 epochs. In Fig. 6, we can also see the consistency between the test loss and neural network smoothness, as we observed in the low-dimensional problems.
4.2.3 Effects of the network size and training dataset size on the smoothness
We have demonstrated that network smoothness is an important factor to the accuracy. Next, we investigate the effects of network size (depth and width) on the smoothness for binary classification problems, which are explained as follows. We consider the one-dimensional sample space , and choose equispaced points on as the training data locations. To avoid the effects of the choice of target true functions, we always repeat experiments with different target functions, and in each experiment we generate a random target function. Specifically, to generate a random target function, we first sample two random functions and from a Gaussian process with the radial basis function kernel of a length scale 0.2, and then assign a point as category 1 if , otherwise assign this point as category 2. When training neural networks, we monitor the value of and stop the training once begins to decrease as shown in Figs. 5 and 6. We first choose the dataset size , and we show that the normalized smoothness decreases as the network depth or width increases (Fig. 7). We also show that the effects of depth is more significant than that of width.
Our main theorem (Theorem 3.20) requires the assumption (Eq. 4), which would not be true for a general machine learning algorithms. Here, we verify this assumption for neural networks by numerical experiments. Specifically, we train a fully-connected neural networks using training datasets of different size . We show that is insensitive to training dataset size, and is always bounded from below by a positive constant (Fig. 8). This result reveals that the neural networks would fit a dataset in a relatively smooth way during the training process.
5 Discussion
When neural networks are used to solve classification problems, we expect that the accuracy is dependent on some properties of the data set. It is still quite surprising, however, that there is a linear relationship between the accuracy and the cover complexity of the data set, as we have seen in Section 4.1. Theorem 3.20(ii) provides an upper bound of the error, but a lower bound is missing. To fully explain this observation, two conjectures of the learnability of fully-connected neural networks are proposed: when a neural network is trained on a data set in such a way that and , then we have
, where is a constant depending only on .
- 2.
, where is a constant.
On the other hand, the theoretical and numerical results provide a better understanding of the generalization of neural network from the training procedure. The smoothness of neural networks plays a key role, where is the maximum cross entropy loss or the average cross entropy loss . We can see that:
depends on both the regularity of and the loss value (which also depends on ). Large requires good regularity and large , i.e., small . However, small could correspond to bad regularity of . Thus, there is a trade-off between the loss value and the regularity of .
- 2.
Due to this trade-off, increases first and then decrease during the training process. Hence, we should not optimize neural networks excessively. Instead, we should stop the training early when begins to decrease, which leads to another “early stopping” strategy to prevent overfitting.
We also note that the lower bound of in Eq. (3) relates to the norm of weight matrices of neural networks:
[TABLE]
There have been some works to study the norm-based complexity of neural networks (see the Introduction), and these bounds typically scale with the product of the norms of the weight matrices, e.g., (Neyshabur et al., 2017a)
[TABLE]
where and are the number of nodes and the weight matrix in layer of a network with -layers, and is the margin quantity, which describes the goodness of fit of the trained network to the data. The product of the matrix norms depends exponentially on the depth, while some recent works show that the generalization bound could scale polynomially in depth under some assumptions (Nagarajan and Kolter, 2019a; Wei and Ma, 2019). The exploration of the dependence of on depth is left for future work.
6 Conclusion
In this paper, we study the generalization error of neural networks for classification problems in terms of the data distribution and neural network smoothness. We first establish a new framework for classification problems. We introduce the cover complexity (CC) to measure the difficulty of learning a data set, an accuracy measure called -accuracy which is stronger than the standard classification accuracy, and the inverse of the modulus of continuity to quantify neural network smoothness. Subsequently, we derive a quantitative bound for the expected accuracy/error in Theorem 3.20, which considers both the cover complexity and neural network smoothness.
We validate our theoretical results by several data sets of images. Our numerical results demonstrate that CC is a reliable measure for the difficulty of learning to classify a data set. On the other hand, we observe a clear consistency between test loss and neural network smoothness during the training process. We also show that neural network smoothness decreases when the network depth and width increases, and the effects of depth is more significant than that of width, while the smoothness is insensitive to training dataset size.
Acknowledgements
This work is supported by the DOE PhILMs project (No. de-sc0019453), the AFOSR grant FA9550-17-1-0013, and the DARPA AIRA grant HR00111990025. The work of P. Jin and Y. Tang is partially supported by the Major Project on New Generation of Artificial Intelligence from MOST of China (Grant No. 2018AAA0101002), and the National Natural Science Foundation of China (Grant No. 11771438).
Appendix A Example of topology
Example A.22**.**
Given
[TABLE]
[TABLE]
then is the topology generated by .
In this example, is an open set, since it consists of all elements containing label , and is also an open set with common part . Besides open sets from base , is still an open set as the union of the two shown above.
Appendix B Proof of Proposition 2.3
Proof.
We use the proof by contradiction. Assume that the result does not hold, then
, and
Because is compact, there exist and a subsequence of such that As , also Choose any , then there exists a sufficient large such that Therefore , which contradicts the assumption. ∎
Appendix C Proof of Proposition 2.4
Proof.
Let as defined in this proposition. For any two different points with distance less than , either , or at least one of the two is a full label point, in both cases . For any , according to the definition of , there exist two points satisfying
[TABLE]
The two facts imply that is the supremum of satisfying Proposition 2.3. ∎
Appendix D Proof of Proposition 2.7
Proof.
According to the definition,
[TABLE]
thus
[TABLE]
∎
Appendix E Estimate of total cover
In this section, we estimate the TC by the number of samples in the training set. The notations, such as , , , , as well as training set
[TABLE]
are the same as before. Note that samples in are drawn according to . Before presenting the analysis, we first collect the following auxiliary notions and results (Definitions E.23-E.26, Theorem E.27) which are easily found in Mitzenmacher and Upfal (2017) (Definitions 14.1-14.3, Definition 14.5, and Theorem 14.8):
Definition E.23**.**
A range space is a pair where:
* is a (finite or infinite) set of points;* 2. 2.
* is a family of subsets of , called ranges.*
Definition E.24**.**
Let be a range space and let . The projection of on is
[TABLE]
Definition E.25**.**
Let be a range space. A set is shattered by if . The Vapnik-Chervonenkis (VC) dimension of a range space is the maximum cardinality of a set that is shattered by . If there are arbitrarily large finite sets that are shattered by , then the VC dimension is infinite.
Definition E.26**.**
Let be a range space, and let be a probability distribution on . A set is an -net for with respect to if for any set such that , the set contains at least one point from , i.e.,
[TABLE]
Theorem E.27**.**
Let be a range space with VC dimension and let be a probability distribution on . For any , there is an
[TABLE]
such that a random sample from of size greater than or equal to is an -net for with probability at least .
Now let
[TABLE]
we first show is a range space with VC dimension .
Lemma E.28**.**
The VC dimension of range space is .
Proof.
The proof can be found in Dudley (1979). ∎
Set
[TABLE]
and
[TABLE]
we have the following lemmas.
Lemma E.29**.**
* when is an -net for .*
Proof.
For any , we have for certain , with . Since is an -net and , we know . Thus there exists such that . Therefore
[TABLE]
The above inequality shows that
[TABLE]
∎
Lemma E.30**.**
.
Proof.
For any positive decreasing sequence which satisfies , it leads to an increasing chain
[TABLE]
where for . Let us consider a series of open balls of radius at most that cover , and we divide them into two parts such that and . Then , and thus . Therefore, we have .
Since
[TABLE]
by dominated convergence theorem, we have
[TABLE]
∎
According to the aforementioned lemmas, we deduce the following theorem.
Theorem E.31**.**
Let be the training set drawn according to , then for any , there exists an
[TABLE]
such that
[TABLE]
holds with probability at least when . Note that when .
Proof.
Theorem E.27 shows that is an -net for range space with probability at least when . By lemma E.29, we have
[TABLE]
∎
From this theorem, we know that a large number of samples lead to a sufficiently large with a high probability.
In the previous sections, there is an assumption that every training point has only one single correct label, so we will naturally consider this special case in the sequel.
Denote
[TABLE]
[TABLE]
[TABLE]
and is a range space with VC dimension at most . Let
[TABLE]
that is, the samples in are drawn according to . As before, denote
[TABLE]
[TABLE]
We have the following lemmas.
Lemma E.32**.**
* when is an -net for .*
Lemma E.33**.**
, here
[TABLE]
[TABLE]
From these two lemmas we deduce the following theorem.
Theorem E.34**.**
Let be the training set drawn according to , then for any , there exists an
[TABLE]
such that
[TABLE]
holds with probability at least when . Note that when for .
The proofs for Lemmas E.32-E.33 and Theorem E.34 are very similar to those for Lemmas E.29-E.30 and Theorem E.31, respectively. We omit them here. It is noteworthy that is intuitively very close to 1, even equal to 1. At worst, is at least greater than which may be quite large in practice, and the proof is similar to what we show in the Lemma in E.30.
Appendix F Proof of Proposition 2.9
Proof.
Let be the training set and be a positive constant greater than 1, , , then
[TABLE]
For the same reason,
[TABLE]
therefore
[TABLE]
∎
Appendix G Proof of Proposition 3.17
Proof.
Denote by is accurate at x_{i}$$\}, . For any , choose such that . From the definition of we know that .
For any , from Proposition 2.3 we know that , and hence . On the other hand, , so we have , therefore
[TABLE]
which means that is accurate at , that is to say
[TABLE]
Then
[TABLE]
The second inequality can be derived from Proposition 2.7. ∎
Appendix H Proof of Theorem 3.18
Proof.
Define . Note that if , and , then , and hence , so that .
Because of
[TABLE]
we have
[TABLE]
so that . Therefore
[TABLE]
so that is c-accurate at . Overall we obtain
[TABLE]
The second inequality can be derived from Proposition 2.7. ∎
Appendix I Proof of Theorem 3.19
Proof.
We will prove it by contradiction. Consider two points and with different labels, and .
If , then by the definition of , , since . Similarly, . Then we have
[TABLE]
On the other hand, by the definition of , and , so , and , therefore
[TABLE]
∎
Appendix J Proof of Theorem 3.20
Proof.
From Theorem 3.18 and assumption we know that
[TABLE]
which implies . Note that is less than and is only determined by the classification problem itself. The above inequality is easy to convert into form (ii). ∎
Appendix K Detailed information of data and parameters for training
First, we list the information concerning the data selection.
MNIST: Last 55000 samples of the training set for training and all the 10000 samples of the test set for testing. 2. 2.
CIFAR-10: First 49000 samples of the training set for training and all the 10000 samples of the test set for testing. 3. 3.
CIFAR-100: First 49000 samples of the training set for training and all the 10000 samples of the test set for testing. 4. 4.
COIL-20: 1200 samples whose end numbers of the figure names are not multiples of 6 for training and 240 samples whose end numbers are multiples of 6 for testing. 5. 5.
COIL-100: 6000 samples whose end numbers of the figure names are not multiples of 30 for training and 1200 samples whose end numbers are multiples of 30 for testing. 6. 6.
SVHN: First 50000 samples of the training set for training and first 10000 samples of the test set for testing.
Parameters for networks are listed in Table 3.
For generating convolution data, we choose the following structure
[TABLE]
with kernel size (strides ) and pool size (strides ), then train this CNN with batch size , learning rate and optimizer RMSProp for 5 epochs. After that, extract new data at location mentioned in above structure by feeding the data to the trained network.
References
- Allen-Zhu et al. (2018a)
Allen-Zhu, Z., Li, Y., Liang, Y., 2018a.
Learning and generalization in overparameterized neural networks, going beyond two layers.
arXiv preprint arXiv:1811.04918 .
- Allen-Zhu et al. (2018b)
Allen-Zhu, Z., Li, Y., Song, Z., 2018b.
A convergence theory for deep learning via over-parameterization.
arXiv preprint arXiv:1811.03962 .
- Arora et al. (2019)
Arora, S., Du, S., Hu, W., Li, Z., Wang, R., 2019.
Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks.
arXiv preprint arXiv:1901.08584 .
- Arora et al. (2018)
Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018.
Stronger generalization bounds for deep nets via a compression approach.
arXiv preprint arXiv:1802.05296 .
- Bartlett et al. (2017a)
Bartlett, P., Foster, D., Telgarsky, M., 2017a.
Spectrally-normalized margin bounds for neural networks, in: Advances in Neural Information Processing Systems, pp. 6240–6249.
- Bartlett et al. (2017b)
Bartlett, P., Harvey, N., Liaw, C., Mehrabian, A., 2017b.
Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks.
arXiv preprint arXiv:1703.02930 .
- Bartlett and Mendelson (2002)
Bartlett, P., Mendelson, S., 2002.
Rademacher and Gaussian Complexities: Risk bounds and structural results.
Journal of Machine Learning Research 3, 463–482.
- Baykal et al. (2018)
Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D., 2018.
Data-dependent coresets for compressing neural networks with applications to generalization bounds.
arXiv preprint arXiv:1804.05345 .
- Belkin et al. (2018)
Belkin, M., Hsu, D., Ma, S., Mandal, S., 2018.
Reconciling modern machine learning and the bias-variance trade-off.
arXiv preprint arXiv:1812.11118 .
- Bottou (2010)
Bottou, L., 2010.
Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010. Springer, pp. 177–186.
- Bottou and Bousquet (2008)
Bottou, L., Bousquet, O., 2008.
The tradeoffs of large scale learning, in: Advances in neural information processing systems, pp. 161–168.
- Cao and Gu (2019)
Cao, Y., Gu, Q., 2019.
A generalization theory of gradient descent for learning over-parameterized deep ReLU networks.
arXiv preprint arXiv:1902.01384 .
- Chen et al. (2018)
Chen, Y., Jin, C., Yu, B., 2018.
Stability and convergence trade-off of iterative optimization algorithms.
arXiv preprint arXiv:1804.01619 .
- Cheng et al. (2018)
Cheng, Y., Wang, D., Zhou, P., Zhang, T., 2018.
Model compression and acceleration for deep neural networks: The principles, progress, and challenges.
IEEE Signal Processing Magazine 35, 126–136.
- Cybenko (1989)
Cybenko, G., 1989.
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems 2, 303–314.
- Dinh et al. (2017)
Dinh, L., Pascanu, R., Bengio, S., Bengio, Y., 2017.
Sharp minima can generalize for deep nets, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org. pp. 1019–1028.
- Du et al. (2018)
Du, S., Lee, J., Li, H., Wang, L., Zhai, X., 2018.
Gradient descent finds global minima of deep neural networks.
arXiv preprint arXiv:1811.03804 .
- Dudley (1979)
Dudley, R., 1979.
Balls in do not cut all subsets of k
- 2 points.
Advances in Mathematics 31, 306 – 308.
- Dziugaite and Roy (2017)
Dziugaite, G., Roy, D., 2017.
Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.
arXiv preprint arXiv:1703.11008 .
- Friedman et al. (2001)
Friedman, J., Hastie, T., Tibshirani, R., 2001.
The elements of statistical learning. volume 1.
Springer series in statistics New York.
- Gonen and Shalev-Shwartz (2017)
Gonen, A., Shalev-Shwartz, S., 2017.
Fast rates for empirical risk minimization of strict saddle problems.
arXiv preprint arXiv:1701.04271 .
- Gunasekar et al. (2018)
Gunasekar, S., Lee, J., Soudry, D., Srebro, N., 2018.
Implicit bias of gradient descent on linear convolutional networks, in: Advances in Neural Information Processing Systems, pp. 9461–9471.
- Hardt et al. (2015)
Hardt, M., Recht, B., Singer, Y., 2015.
Train faster, generalize better: Stability of stochastic gradient descent.
arXiv preprint arXiv:1509.01240 .
- Hornik et al. (1989)
Hornik, K., Stinchcombe, M., White, H., 1989.
Multilayer feedforward networks are universal approximators.
Neural networks 2, 359–366.
- Kawaguchi et al. (2017)
Kawaguchi, K., Kaelbling, L., Bengio, Y., 2017.
Generalization in deep learning.
arXiv preprint arXiv:1710.05468 .
- Keskar et al. (2016)
Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P., 2016.
On large-batch training for deep learning: Generalization gap and sharp minima.
arXiv preprint arXiv:1609.04836 .
- Kingma and Ba (2015)
Kingma, D., Ba, J., 2015.
Adam: A method for stochastic optimization.
International Conference on Learning Representations .
- Krizhevsky and Hinton (2009)
Krizhevsky, A., Hinton, G., 2009.
Learning multiple layers of features from tiny images.
Technical Report. Citeseer.
- Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., Hinton, G., 2012.
Imagenet classification with deep convolutional neural networks.
Neural Information Processing Systems 25.
doi:10.1145/3065386.
- Kuzborskij and Lampert (2017)
Kuzborskij, I., Lampert, C., 2017.
Data-dependent stability of stochastic gradient descent.
arXiv preprint arXiv:1703.01678 .
- LeCun et al. (1998)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al., 1998.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86, 2278–2324.
- Lee et al. (2016)
Lee, J., Simchowitz, M., Jordan, M., Recht, B., 2016.
Gradient descent converges to minimizers.
arXiv preprint arXiv:1602.04915 .
- Li and Liang (2018)
Li, Y., Liang, Y., 2018.
Learning overparameterized neural networks via stochastic gradient descent on structured data, in: Advances in Neural Information Processing Systems, pp. 8157–8166.
- Liang et al. (2017)
Liang, T., Poggio, T., Rakhlin, A., Stokes, J., 2017.
Fisher-Rao metric, geometry, and complexity of neural networks.
arXiv preprint arXiv:1711.01530 .
- Liao and Poggio (2017)
Liao, Q., Poggio, T., 2017.
Theory II: Landscape of the empirical risk in deep learning.
arXiv preprint arXiv:1703.09833 .
- Lu et al. (2019)
Lu, L., Shin, Y., Su, Y., Karniadakis, G., 2019.
Dying ReLU and initialization: Theory and numerical examples.
arXiv preprint arXiv:1903.06733 .
- Lu et al. (2018)
Lu, L., Su, Y., Karniadakis, G., 2018.
Collapse of deep and narrow neural nets.
arXiv preprint arXiv:1808.04947 .
- Maas et al. (2013)
Maas, A., Hannun, A., Ng, A., 2013.
Rectifier nonlinearities improve neural network acoustic models, in: Proc. icml, p. 3.
- Mitzenmacher and Upfal (2017)
Mitzenmacher, M., Upfal, E., 2017.
Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis.
2nd ed., Cambridge University Press, New York, NY, USA.
- Nagarajan and Kolter (2019a)
Nagarajan, V., Kolter, J., 2019a.
Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience, in: International Conference on Learning Representations.
- Nagarajan and Kolter (2019b)
Nagarajan, V., Kolter, J., 2019b.
Generalization in deep networks: The role of distance from initialization.
arXiv preprint arXiv:1901.01672 .
- Nene et al. (1996a)
Nene, S., Nayar, S., Murase, H., 1996a.
Columbia object image library (coil-100).
Citeseer .
- Nene et al. (1996b)
Nene, S., Nayar, S., Murase, H., et al., 1996b.
Columbia object image library (coil-20).
Technical report CUCS-005-96 .
- Netzer et al. (2011)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A., 2011.
Reading digits in natural images with unsupervised feature learning, in: Advances in Neural Information Processing Systems.
- Neyshabur et al. (2017a)
Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N., 2017a.
Exploring generalization in deep learning, in: Advances in Neural Information Processing Systems, pp. 5947–5956.
- Neyshabur et al. (2017b)
Neyshabur, B., Bhojanapalli, S., Srebro, N., 2017b.
A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.
arXiv preprint arXiv:1707.09564 .
- Neyshabur et al. (2019)
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N., 2019.
The role of over-parametrization in generalization of neural networks, in: International Conference on Learning Representations.
- Neyshabur et al. (2015)
Neyshabur, B., Salakhutdinov, R., Srebro, N., 2015.
Path-SGD: Path-normalized optimization in deep neural networks, in: Advances in Neural Information Processing Systems, pp. 2422–2430.
- Neyshabur et al. (2014)
Neyshabur, B., Tomioka, R., Srebro, N., 2014.
In search of the real inductive bias: On the role of implicit regularization in deep learning.
arXiv preprint arXiv:1412.6614 .
- Poggio et al. (2017)
Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., Mhaskar, H., 2017.
Theory of deep learning III: explaining the non-overfitting puzzle.
arXiv preprint arXiv:1801.00173 .
- Rahaman et al. (2018)
Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A., 2018.
On the spectral bias of deep neural networks.
arXiv preprint arXiv:1806.08734 .
- Saxe et al. (2019)
Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D., 2019.
On the information bottleneck theory of deep learning.
Journal of Statistical Mechanics: Theory and Experiment 2019, 124020.
- Shwartz-Ziv and Tishby (2017)
Shwartz-Ziv, R., Tishby, N., 2017.
Opening the black box of deep neural networks via information.
arXiv preprint arXiv:1703.00810 .
- Silver et al. (2016)
Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al., 2016.
Mastering the game of go with deep neural networks and tree search.
Nature 529, 484.
- Sokolic et al. (2016)
Sokolic, J., Giryes, R., Sapiro, G., Rodrigues, M., 2016.
Generalization error of invariant classifiers.
arXiv preprint arXiv:1610.04574 .
- Sokolić et al. (2017)
Sokolić, J., Giryes, R., Sapiro, G., Rodrigues, M., 2017.
Robust large margin deep neural networks.
IEEE Transactions on Signal Processing 65, 4265–4280.
- Soudry et al. (2018)
Soudry, D., Hoffer, E., Nacson, M., Gunasekar, S., Srebro, N., 2018.
The implicit bias of gradient descent on separable data.
The Journal of Machine Learning Research 19, 2822–2878.
- Wei and Ma (2019)
Wei, C., Ma, T., 2019.
Data-dependent sample complexity of deep neural networks via Lipschitz augmentation.
arXiv preprint arXiv:1905.03684 .
- Xu et al. (2019)
Xu, Z., Zhang, Y., Luo, T., Xiao, Y., Ma, Z., 2019.
Frequency principle: Fourier analysis sheds light on deep neural networks.
arXiv preprint arXiv:1901.06523 .
- Zhang et al. (2016)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2016.
Understanding deep learning requires rethinking generalization.
arXiv preprint arXiv:1611.03530 .
- Zhang et al. (2018)
Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., Poggio, T., 2018.
Theory of deep learning IIb: Optimization properties of SGD.
arXiv preprint arXiv:1801.02254 .
- Zhou et al. (2018)
Zhou, W., Veitch, V., Austern, M., Adams, R., Orbanz, P., 2018.
Compressibility and generalization in large-scale deep learning.
arXiv preprint arXiv:1804.05862 .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Allen-Zhu et al. (2018 a) Allen-Zhu, Z., Li, Y., Liang, Y., 2018 a. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918 .
- 2Allen-Zhu et al. (2018 b) Allen-Zhu, Z., Li, Y., Song, Z., 2018 b. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962 .
- 3Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., Wang, R., 2019. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584 .
- 4Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018. Stronger generalization bounds for deep nets via a compression approach. ar Xiv preprint ar Xiv:1802.05296 .
- 5Bartlett et al. (2017 a) Bartlett, P., Foster, D., Telgarsky, M., 2017 a. Spectrally-normalized margin bounds for neural networks, in: Advances in Neural Information Processing Systems, pp. 6240–6249.
- 6Bartlett et al. (2017 b) Bartlett, P., Harvey, N., Liaw, C., Mehrabian, A., 2017 b. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. ar Xiv preprint ar Xiv:1703.02930 .
- 7Bartlett and Mendelson (2002) Bartlett, P., Mendelson, S., 2002. Rademacher and Gaussian Complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, 463–482.
- 8Baykal et al. (2018) Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D., 2018. Data-dependent coresets for compressing neural networks with applications to generalization bounds. ar Xiv preprint ar Xiv:1804.05345 .
