Quantifying the generalization error in deep learning in terms of data   distribution and neural network smoothness

Pengzhan Jin; Lu Lu; Yifa Tang; George Em Karniadakis

arXiv:1905.11427·stat.ML·November 3, 2021

Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness

Pengzhan Jin, Lu Lu, Yifa Tang, George Em Karniadakis

PDF

1 Repo

TL;DR

This paper develops a theoretical framework to quantify deep neural network generalization error based on data complexity and network smoothness, validated through experiments on image datasets.

Contribution

It introduces the cover complexity measure and inverse modulus of continuity to analyze neural network generalization, linking theoretical bounds with empirical observations.

Findings

01

Expected error scales with the square root of the number of classes.

02

Test loss correlates with neural network smoothness during training.

03

Network size affects smoothness, but dataset size does not.

Abstract

The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of the modulus of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and…

Tables4

Table 1. Table 1: Cover complexity C C ( 𝒯 ) 𝐶 𝐶 𝒯 CC(\mathcal{T}) , best error ℰ ( f ) ℰ 𝑓 \mathcal{E}(f) , and normalized error ℰ ( f ) K ℰ 𝑓 𝐾 \frac{\mathcal{E}(f)}{\sqrt{K}} of different data sets. Different variants of data sets are used, including the original RGB or grey images (Original), grey images (Grey), and images extracted from a CNN (Conv). Images in CIFAR-100 have two variants: 100 categories (fine) and 20 categories (coarse). See K for details.

Data Set	Variants	Input dim ( $d$ )	Output dim ( $K$ )	$ρ_{𝒯}$	$C D (𝒯)$	$C C (𝒯)$	$ℰ (f)$	$\frac{ℰ (f)}{\sqrt{K}}$
MNIST	Original	784	10	.8480	.1053	1.442	.01	.0032
CIFAR-10	Original	3072	10	.8332	.0163	10.23	.45	.1423
CIFAR-10	Grey	1024	10	.8486	.0125	12.11	.53	.1676
CIFAR-10	Conv	1024	10	.9505	.0094	5.280	.18	.0569
SVHN	Original	3072	10	.9034	.0076	12.68	.49	.1550
SVHN	Grey	1024	10	.9117	.0084	10.48	.56	.1771
SVHN	Conv	1024	10	.9632	.0123	2.995	.23	.0727
CIFAR-100	Original (coarse)	3072	20	.8337	.0185	9.012	.62	.1386
CIFAR-100	Grey (coarse)	1024	20	.8541	.0132	11.08	.72	.1610
CIFAR-100	Conv (coarse)	1024	20	.9626	.0070	5.326	.40	.0894
COIL-20	Original	16384	20	.9176	.2385	.3453	.03	.0067
CIFAR-100	Original (fine)	3072	100	.8337	.0270	6.149	.73	.0730
CIFAR-100	Grey (fine)	1024	100	.8541	.0198	7.380	.81	.0810
CIFAR-100	Conv (fine)	1024	100	.9457	.0136	4.000	.52	.0520
COIL-100	Original	49152	100	.9430	.1944	.2930	.01	.0010

Table 2. Table 2: Comparison between c 𝑐 c -accuracy p c ( f ) subscript 𝑝 𝑐 𝑓 p_{c}(f) with c = 0.5 𝑐 0.5 c=0.5 and the lower bounds for different training set sizes for the one-dimensional problem. Here δ 𝛿 \delta is indicated in Theorem 3.18 . The neural network is trained for 1000 iterations by the Adam optimizer.

$n$	$L_{f}^{m a x}$	$ρ_{𝒯}$	$δ$	$p_{c} (f)$	$h_{𝒯} (δ)$	$1 - \frac{\sqrt{d}}{δ} (1 - ρ_{𝒯})$
10	.285	.972	.045	1.0	0.80	0.38
20	.246	.988	.041	1.0	1.00	0.69
40	.182	.994	.041	1.0	1.00	0.85
80	.127	.997	.038	1.0	1.00	0.92

Table 3. Table 3: The results in this table were obtained as follows: We used the ReLU as the activation function and Adam as the optimizer. Furthermore, we chose several different network structures and learning rates. We then trained the networks for 10000 iterations with a batch size of 300, and selected the best test error as the final result for each training procedure. Each of the columns i 𝑖 i - x i i 𝑥 𝑖 𝑖 xii stands for a specific choice of network architecture and learning rate, as described in Table 4

Data Set	Version	$i$	$i i$	$i i i$	$i v$	$v$	$v i$	$v i i$	$v i i i$	$i x$	$x$	$x i$	$x i i$	Best Error
MNIST	Original	.02	.02	.05	.02	.02	.04	.02	.02	.04	.01	.02	.03	.01
CIFAR-10	Original	.47	.46	.52	.48	.46	.51	.47	.45	.50	.47	.45	.49	.45
CIFAR-10	Grey	.55	.55	.63	.55	.54	.62	.54	.53	.61	.54	.53	.59	.53
CIFAR-10	Conv	.18	.18	.19	.19	.18	.19	.18	.18	.18	.18	.18	.18	.18
SVHN	Original	.80	.59	.49	.80	.73	.60	.80	.69	.51	.80	.72	.64	.49
SVHN	Grey	.80	.64	.56	.80	.76	.66	.80	.64	.58	.80	.75	.66	.56
SVHN	Conv	.27	.23	.23	.31	.24	.23	.31	.24	.23	.69	.25	.24	.23
CIFAR-100	Original(coarse)	.64	.64	.69	.64	.63	.68	.63	.62	.67	.63	.62	.66	.62
CIFAR-100	Grey(coarse)	.74	.73	.79	.74	.73	.78	.74	.72	.78	.74	.72	.77	.72
CIFAR-100	Conv(coarse)	.40	.41	.45	.41	.41	.44	.40	.41	.44	.41	.40	.43	.40
COIL-20	Original	.08	.05	.05	.07	.06	.05	.08	.05	.05	.07	.05	.03	.03
CIFAR-100	Original(fine)	.75	.75	.82	.75	.74	.81	.74	.73	.80	.75	.73	.79	.73
CIFAR-100	Grey(fine)	.83	.83	.90	.83	.83	.90	.82	.81	.88	.83	.81	.87	.81
CIFAR-100	Conv(fine)	.53	.54	.64	.53	.54	.63	.52	.52	.61	.53	.52	.59	.52
COIL-100	Original	.02	.01	.01	.03	.02	.01	.03	.02	.01	.02	.02	.01	.01

Table 4. Table 4: Detailed setup for each case.

	$10^{- 3}$	$10^{- 4}$	$10^{- 5}$
$[$ Input 256 256 Output $]$	$i$	$i i$	$i i i$
$[$ Input 256 256 256 Output $]$	$i v$	$v$	$v i$
$[$ Input 512 512 Output $]$	$v i i$	$v i i i$	$i x$
$[$ Input 512 512 512 Output $]$	$x$	$x i$	$x i i$

Equations168

D = [0, 1]^{d},

D = [0, 1]^{d},

T = 2^{{1, 2, \dots, K}} \ {\emptyset}

T = 2^{{1, 2, \dots, K}} \ {\emptyset}

\tau_{T}=\left\{\bigcup_{k\in I}U_{k}\Bigg{|}I\subseteq T\right\},

\tau_{T}=\left\{\bigcup_{k\in I}U_{k}\Bigg{|}I\subseteq T\right\},

\forall x \in D \forall k \in f (x) \exists δ > 0 s . t . k \in f (y) i f ∥ y - x ∥ < δ .

\forall x \in D \forall k \in f (x) \exists δ > 0 s . t . k \in f (y) i f ∥ y - x ∥ < δ .

t a g : D \to T

t a g : D \to T

\forall x \in D \forall k \in t a g (x) \exists δ > 0 s . t . k \in t a g (y) i f ∥ y - x ∥ < δ .

\forall x \in D \forall k \in t a g (x) \exists δ > 0 s . t . k \in t a g (y) i f ∥ y - x ∥ < δ .

δ_{0} = in f {∥ x - y ∥ ∣ t a g (x) \neq = t a g (y), an d t a g (x), t a g (y) \in {{1}, {2}, \dots, {K}}} .

δ_{0} = in f {∥ x - y ∥ ∣ t a g (x) \neq = t a g (y), an d t a g (x), t a g (y) \in {{1}, {2}, \dots, {K}}} .

T = {x_{1}, x_{2}, \dots, x_{n}} \subseteq D .

T = {x_{1}, x_{2}, \dots, x_{n}} \subseteq D .

h_{T}^{μ} (r) := μ (D \cap x_{i} \in T ⋃ B (x_{i}, r)),

h_{T}^{μ} (r) := μ (D \cap x_{i} \in T ⋃ B (x_{i}, r)),

ρ (T, μ) := \frac{1}{d} \int_{0}^{d} h_{T}^{μ} (r) d r .

ρ (T, μ) := \frac{1}{d} \int_{0}^{d} h_{T}^{μ} (r) d r .

ρ_{T} := ρ (T, μ) .

ρ_{T} := ρ (T, μ) .

C D (T) := \frac{1}{K} i \sum ρ (T_{i}, μ_{i}) - \frac{1}{K ( K - 1 )} i \neq = j \sum ρ (T_{i}, μ_{j}),

C D (T) := \frac{1}{K} i \sum ρ (T_{i}, μ_{i}) - \frac{1}{K ( K - 1 )} i \neq = j \sum ρ (T_{i}, μ_{j}),

T_{i} := {x \in T ∣ i \in t a g (x)}, μ_{i} (A) := \frac{μ ( A \cap D _{i} )}{μ ( D _{i} )}

T_{i} := {x \in T ∣ i \in t a g (x)}, μ_{i} (A) := \frac{μ ( A \cap D _{i} )}{μ ( D _{i} )}

C C (T) := \frac{1 - ρ _{T}}{C D ( T )}

C C (T) := \frac{1 - ρ _{T}}{C D ( T )}

h_{T}^{μ} (r) \geq 1 - \frac{d}{r} (1 - ρ_{T}), 0 < r \leq d .

h_{T}^{μ} (r) \geq 1 - \frac{d}{r} (1 - ρ_{T}), 0 < r \leq d .

m = O (\frac{d + 1}{ϵ} ln \frac{d + 1}{ϵ} + \frac{1}{ϵ} ln \frac{1}{η}),

m = O (\frac{d + 1}{ϵ} ln \frac{d + 1}{ϵ} + \frac{1}{ϵ} ln \frac{1}{η}),

ρ_{T} \geq ϱ (ϵ)

ρ_{T} \geq ϱ (ϵ)

\hat{f} : x \mapsto \frac{e ^{f (x)}}{\sum _{i} e ^{f_{i} (x)}}

\hat{f} : x \mapsto \frac{e ^{f (x)}}{\sum _{i} e ^{f_{i} (x)}}

\exists i_{ma x} \in t a g (x), s . t . f_{i_{ma x}} (x) > c .

\exists i_{ma x} \in t a g (x), s . t . f_{i_{ma x}} (x) > c .

p_{c} (f) := \frac{μ ( H _{c}^{f} )}{μ ( D )} = μ (H_{c}^{f}),

p_{c} (f) := \frac{μ ( H _{c}^{f} )}{μ ( D )} = μ (H_{c}^{f}),

p_{c}^{T} := \frac{ρ _{T_{c}}}{ρ _{T}},

p_{c}^{T} := \frac{ρ _{T_{c}}}{ρ _{T}},

p (f) := \frac{μ ( H ^{f} )}{μ ( D )} = μ (H^{f}),

p (f) := \frac{μ ( H ^{f} )}{μ ( D )} = μ (H^{f}),

\forall ϵ > 0, \exists δ > 0, s . t . ∥ f (x) - f (y) ∥_{\infty} < ϵ w h e n ∥ x - y ∥ < δ .

\forall ϵ > 0, \exists δ > 0, s . t . ∥ f (x) - f (y) ∥_{\infty} < ϵ w h e n ∥ x - y ∥ < δ .

ϕ_{i} (x) = σ (W_{i} x + b_{i}), 1 \leq i \leq l - 1,

ϕ_{i} (x) = σ (W_{i} x + b_{i}), 1 \leq i \leq l - 1,

f (x) = so f t ma x (W_{l} (ϕ_{l - 1} \circ \dots \circ ϕ_{1} (x)) + b_{l}),

f (x) = so f t ma x (W_{l} (ϕ_{l - 1} \circ \dots \circ ϕ_{1} (x)) + b_{l}),

W_{i} \in R^{n_{i} \times n_{i - 1}}, b_{i} \in R^{n_{i}}, 1 \leq i \leq l,

W_{i} \in R^{n_{i} \times n_{i - 1}}, b_{i} \in R^{n_{i}}, 1 \leq i \leq l,

δ_{f} (ϵ) \geq \frac{ϵ}{L i p ( f )} \geq \frac{ϵ}{∥ W _{1} ∥ _{2} \dots ∥ W _{l} ∥ _{2} \cdot L i p ( so f t ma x )},

δ_{f} (ϵ) \geq \frac{ϵ}{L i p ( f )} \geq \frac{ϵ}{∥ W _{1} ∥ _{2} \dots ∥ W _{l} ∥ _{2} \cdot L i p ( so f t ma x )},

p_{c_{2}} (f) \geq 1 - \frac{d}{δ} (1 - p_{c_{1}}^{T} ρ_{T}),

p_{c_{2}} (f) \geq 1 - \frac{d}{δ} (1 - p_{c_{1}}^{T} ρ_{T}),

L_{f}^{ma x} := 1 \leq i \leq n max ℓ (f (x_{i}), k_{i}) = - 1 \leq i \leq n min ln (f_{k_{i}} (x_{i})) < - ln c,

L_{f}^{ma x} := 1 \leq i \leq n max ℓ (f (x_{i}), k_{i}) = - 1 \leq i \leq n min ln (f_{k_{i}} (x_{i})) < - ln c,

p_{c} (f) \geq 1 - \frac{d}{δ} (1 - ρ_{T}),

p_{c} (f) \geq 1 - \frac{d}{δ} (1 - ρ_{T}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jpzxshi/generalization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness

Pengzhan Jin111Pengzhan Jin and Lu Lu contributed equally to this work.

Lu Lu222Pengzhan Jin and Lu Lu contributed equally to this work.

Yifa Tang

George Em Karniadakis

[email protected]

LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

Division of Applied Mathematics, Brown University, Providence, RI 02912, USA

Abstract

The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of the modulus of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and neural network smoothness. Although most of the analysis is general and not specific to neural networks, we validate our theoretical assumptions and results numerically for neural networks by several data sets of images. The numerical results confirm that the expected error of trained networks scaled with the square root of the number of classes has a linear relationship with respect to the CC. We also observe a clear consistency between test loss and neural network smoothness during the training process. In addition, we demonstrate empirically that the neural network smoothness decreases when the network size increases whereas the smoothness is insensitive to training dataset size.

keywords:

Neural networks , Generalization error , Learnability , Data distribution , Cover complexity , Neural network smoothness

††journal:

1 Introduction

In the last 15 years, deep learning, i.e., deep neural networks (NNs), has been used very effectively in diverse applications, such as image classification (Krizhevsky et al., 2012), natural language processing (Maas et al., 2013), and game playing (Silver et al., 2016). Despite this remarkable success, our theoretical understanding of deep learning is lagging behind. The accuracy of NNs can be characterized by dividing the expected error into three main types: approximation (also called expressivity), optimization, and generalization (Bottou and Bousquet, 2008; Bottou, 2010), see Fig. 1. The well-known universal approximation theorem was obtained by Cybenko (1989) and Hornik et al. (1989) almost three decades ago stating that feed-forward neural nets can approximate essentially any function if their size is sufficiently large. In the past several years, there have been numerous studies that analyze the landscape of the non-convex objective functions, and the optimization process by stochastic gradient descent (SGD) (Lee et al., 2016; Liao and Poggio, 2017; Lu et al., 2018; Allen-Zhu et al., 2018b; Du et al., 2018; Lu et al., 2019). Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization, which is the focus of this study.

The classical analysis of generalization is based on controlling the complexity of the function class, i.e., model complexity, by managing the bias-variance trade-off (Friedman et al., 2001). However, this type of analysis is not able to explain the small generalization gap between training and test performance of neural networks learned by SGD in practice, considering the fact that deep neural networks often have far more model parameters than the number of samples they are trained on, and have sufficient capacity to memorize random labels (Neyshabur et al., 2014; Zhang et al., 2016). To explain this phenomenon, several approaches have been recently developed by many researchers. The first approach is characterizing neural networks with some other low “complexity” instead of the traditional Vapnik-Chervonenkis (VC) dimension (Bartlett et al., 2017b) or Rademacher complexity (Bartlett and Mendelson, 2002), such as path-norm (Neyshabur et al., 2015), margin-based bounds (Sokolić et al., 2017; Bartlett et al., 2017a; Neyshabur et al., 2017b), Fisher-Rao norm (Liang et al., 2017), and more (Neyshabur et al., 2019; Wei and Ma, 2019). The second approach is to analyze some good properties of SGD or its variants, including its stability (Hardt et al., 2015; Kuzborskij and Lampert, 2017; Gonen and Shalev-Shwartz, 2017; Chen et al., 2018), robustness (Sokolic et al., 2016; Sokolić et al., 2017), implicit biases/regularization (Poggio et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018; Nagarajan and Kolter, 2019b), and the structural properties (e.g., sharpness) of the obtained minimizers (Keskar et al., 2016; Dinh et al., 2017; Zhang et al., 2018). The third approach relies on overparameterization, e.g., sufficiently overparameterized networks can learn the ground truth with a small generalization error using SGD from random initialization (Li and Liang, 2018; Allen-Zhu et al., 2018a; Arora et al., 2019; Cao and Gu, 2019). There are also other approaches, such as compression (Arora et al., 2018; Baykal et al., 2018; Zhou et al., 2018; Cheng et al., 2018), Fourier analysis (Rahaman et al., 2018; Xu et al., 2019), “double descent” risk curve (Belkin et al., 2018), PAC-Bayesian framework (Neyshabur et al., 2017b; Nagarajan and Kolter, 2019a), and information bottleneck (Shwartz-Ziv and Tishby, 2017; Saxe et al., 2019).

However, most theoretical bounds fail to explain the performance of neural networks in practice (Neyshabur et al., 2017a; Arora et al., 2018). To get non-vacuous and tight enough bounds to be practically meaningful, some problem-specific factors should be taken into consideration, such as the low complexity (i.e., data-dependent analysis) (Dziugaite and Roy, 2017; Kawaguchi et al., 2017), or properties of the trained neural networks (Sokolić et al., 2017; Arora et al., 2018; Wei and Ma, 2019). In this study, to achieve a practically meaningful bound, our analysis relies on the data distribution and the smoothness of the trained neural network. The analysis proposed in this study provides guarantees on the generalization error, and theoretical insights to guide the practical application.

As shown in Fig. 1, the optimization error is correlated with the loss value (for notation simplicity, the term “loss” indicates “empirical loss”), while the approximation error depends on the network size. In addition, a small loss requires a sufficient approximation ability, i.e., a large network size, which in turn leads to a small approximation error. If we assume a sufficiently small loss, which usually holds in practice, then the expected error mainly depends on the generalization error. Hence, we study the expected error/accuracy directly. In particular, we propose a mathematical framework to analyze the expected accuracy of neural networks for classification problems. We introduce the concepts of total cover (TC), self cover (SC), mutual cover (MC) and cover difference (CD) to represent the data distribution, and then we use the concept of cover complexity (CC) as a measure of the complexity of classification problems. On the other hand, the smoothness of a neural network $f$ is characterized by the inverse of the modulus of continuity $\delta_{f}$ . Because computing $\delta_{f}$ is not tractable in general, we propose an estimation using the spectral norm of the weight matrices of the neural network. The main terminologies are illustrated in Fig 2. By combining the properties of the data distribution and the smoothness of neural networks, we derive a lower bound for the expected accuracy, i.e., an upper bound for the expected classification error.

Subsequently, we test our theoretical bounds on several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), and SVHN (Netzer et al., 2011). Our numerical results not only confirm our theoretical bounds, but also provide insights into the optimization process and the learnability of neural networks. In particular, we find that:

The best accuracy that can be achieved in practice (i.e., optimized by stochastic gradient descent) by fully-connected networks is approximately linear with respect to the cover complexity of the data set.

2.

The trend of the expected accuracy is consistent with the smoothness of the neural network, which provides a new “early stopping” strategy by monitoring the smoothness of the neural network.

3.

The neural network smoothness decreases when the network depth and width increases, with the effects of depth more significant than that of width.

4.

The neural network smoothness is insensitive to the training dataset size, and is bounded from below by a positive constant. This point makes our theoretical result (Theorem 3.20) specifically pertinent to deep neural networks.

The paper is organized as follows. After setting up notation and terminology in Section 2, we present the main theoretical bounds for the accuracy based on the data distribution and the smoothness of neural networks in Section 3, while all proofs are deferred to the appendix. In Section 4, we provide the numerical results for several data sets. In Section 5 we include a discussion, and in Section 6 we summarize our findings.

2 Preliminaries

Before giving the main results, we introduce the necessary notation and terminology. Without loss of generality, we assume that the space we need to classify is

[TABLE]

where $d$ is the dimensionality, and the points in this space are classified into $K$ categories, i.e., there are $K$ labels $\{1,2,\cdots,K\}$ . We denote the probability measure on $D$ by $\mu$ , i.e., for a measurable set $A\subseteq D$ , $\mu(A)$ is the probability of a random sample belonging to $A$ .

2.1 Ideal label function

For the problem setup, we assume that every sample has at least one true label, and one sample may have multiple true labels. Taking image classification as an example, each image has at least one correct label. A fuzzy image or an image with more than one object in it may have multiple possible correct labels, and as long as the prediction is one of these labels, we consider the prediction to be correct.

It is intuitive that when two samples are close enough, they should have similar labels, which means that the ideal label function should be continuous. The continuity of a mapping depends on the topology of both domain and image space. For the domain of the ideal label function, we choose the standard topology induced by the Euclidean metric. As for the topology of the image space, we define it as follows. We first define the label set and the topology on it.

Definition 2.1 (Topology).

Let

[TABLE]

be the label set. Define the topology on $T$ to be

[TABLE]

where $U_{k}=\{V\in T|V\supseteq k\}$ for $k\in T$ , and thus $(T,\tau_{T})$ constitutes a topological space.

Analogous to the Euclidean metric topology, $U_{k}$ is viewed as the open “ball” centered at $k$ , and arbitrary unions of the “balls” $U_{k}$ are defined as the open sets, see A for an example. With this choice of the topology, a function $f:D\to T$ is continuous if and only if

[TABLE]

Next we give the definition of the ideal label function according to this topological space.

Definition 2.2 (Ideal label function).

An idea label function is a continuous function

[TABLE]

where $D$ is equipped with the Euclidean metric topology and $T$ with the topology $\tau_{T}$ from Definition 2.1. This continuity holds if and only if

[TABLE]

Eq. (1) means that two neighboring points would have some common labels. Based on the topological space defined above, it is easy to show that Eq. (1) is equivalent to continuity. The reason why we consider a multi-label setup for classification problems is that it allows for the continuity property in Eq. (1), which is impossible in the setup of a single label set, unless the label function is constant. In addition, the multi-label setup introduces a smooth transition, i.e., a buffer domain, between two domains of different labels, while the transition is sharp in the single label setup. In the following proposition, we show that if two samples are close enough, they must share at least one common label.

Proposition 2.3 (Separation gap).

$\exists\delta>0,\ s.t.\ tag(x)\cap tag(y)\neq\varnothing\ when\ \|x-y\|<\delta.$ * We denote the supremum of $\delta$ as the separation gap $\delta_{0}$ , which is used in the sequel.*

Proof.

The proof can be found in B. ∎

To understand the geometric interpretation of $\delta_{0}$ , we consider the following special case: the label of each sample is either a single label set, such as $\{1\}$ , or the full label set $\{1,2,\cdots,K\}$ if it is not uniquely identifiable.

Proposition 2.4 (Geometric interpretation of separation gap).

If the label of each sample is either a single label set or the full label set $\{1,2,\cdots,K\}$ , then $\delta_{0}$ is the smallest distance between two different single label points, i.e.,

[TABLE]

Proof.

The proof can be found in C. ∎

2.2 Cover complexity of data set

In this subsection, we introduce a quantity to measure the difficulty of learning a training data set

[TABLE]

First, we give some notations and propositions.

With the measure $\mu$ , the probability of the neighborhood of the training set with radius of $r$ is defined as

[TABLE]

where $B(x_{i},r)$ is the open ball centered at $x_{i}$ with radius of $r$ , see Fig. 2A. Obviously, $h_{\mathcal{T}}^{\mu}(r)$ is a monotone non-decreasing function, $h_{\mathcal{T}}^{\mu}(0)=0$ (since $B(x,0)=\varnothing$ ), and $h_{\mathcal{T}}^{\mu}(r)=1$ when $r>\sqrt{d}$ , see Fig. 2B. To represent the global behavior of $h_{\mathcal{T}}^{\mu}(r)$ , we use the integral of $h_{\mathcal{T}}^{\mu}(r)$ with respect to $r$ :

[TABLE]

Hence, $\rho(\mathcal{T},\mu)$ considers both the number and location of the data points, and also the probability distribution of the space. The value $\rho(\mathcal{T},\mu)$ is larger if the number of data points is increased and also if the probability distribution is more concentrated around $\mathcal{T}$ , which we call the “coverability” of $\mathcal{T}$ . We can increase $\rho(\mathcal{T},\mu)$ by adding more data points or redistribute their locations. Next, we introduce the formal definition for the “coverability”.

Definition 2.5 (Coverability).

Let $\mathcal{T}$ be a data set from a domain $D$ with probability measure $\mu$ . We define the following for the coverability of $\mathcal{T}$ .

(i)

The total cover (TC) is

[TABLE]

Thus, $0\leq\rho_{\mathcal{T}}\leq 1$ . 2. (ii)

The cover difference (CD) is

[TABLE]

where $K$ is the number of categories, and $\mathcal{T}_{i}\subset\mathcal{T}$ and $\mu_{i}$ represent the subset and probability measure of the label $i$ respectively, i.e.,

[TABLE]

with $D_{i}:=\{x\in D|i\in tag(x)\}$ . Here, $\rho(\mathcal{T}_{i},\mu_{i})$ is called self cover (SC), and $\rho(\mathcal{T}_{i},\mu_{j})$ is called mutual cover (MC). 3. (iii)

The cover complexity (CC) is

[TABLE]

for $CD(\mathcal{T})\neq 0$ .

Remark 2.6.

CD is defined as the difference between the mean of SC and the mean of MC, since each category occurs with the same probability ( $\sim 1/K$ ) in the data sets mostly used in practice. If there are some categories occurring more frequently than others, then it is straightforward to extend this definition by using the mean weighted by the probability of each category.

In image classification, the dimension of the image space is very high, and thus the data points are quite sparse. However, due to the fact that images actually live on a manifold of low dimension, the probability density around $\mathcal{T}$ is actually high, which makes the TC to be meaningful. In our next result, we derive a lower bound of $h_{\mathcal{T}}^{\mu}(r)$ by $\rho_{\mathcal{T}}$ .

Proposition 2.7.

Let $\mathcal{T}$ be a data set. $h_{\mathcal{T}}^{\mu}(r)$ and $\rho_{\mathcal{T}}$ are defined as above. Then we have

[TABLE]

Proof.

The proof can be found in D. ∎

From this proposition, we know that for a fixed $r$ , $h_{\mathcal{T}}^{\mu}(r)$ is close to 1 if $\rho_{\mathcal{T}}$ is large enough. However, the probability distribution is usually given in practice, and we can only control the number of samples. The following theorem shows that $\rho_{\mathcal{T}}$ can be arbitrary close to 1 when enough samples are available.

Theorem 2.8.

Let $\mathcal{T}$ be a data set of size $n$ drawn according to $\mu$ . Then there exists a non-increasing function $\varrho(\epsilon)$ satisfying $\lim\limits_{\epsilon\to 0}\varrho(\epsilon)=1$ , and for any $0<\eta,\epsilon\leq\frac{1}{2}$ , there exists an

[TABLE]

such that

[TABLE]

holds with probability at least $1-\eta$ when $n\geq m$ .

Proof.

The proof and some other results regarding TC can be found in E. ∎

The reason why CD is introduced is that TC does not consider the labels of the data points. However, data points of the same label should be clustered in an easily learnable data set. $CD(\mathcal{T})$ is the difference of self cover and mutual cover, which considers the distributions of each label. By normalizing TC with CD, the cover complexity $CC(\mathcal{T})$ is able to measure the difficulty of learning a data set. The difficulty of a problem should be translation-independent and scale-independent. It is easy to see that $CC(\mathcal{T})$ is independence of translation, and the following proposition shows that it is also scale-independent.

Proposition 2.9 (Scale independence).

$CC(\mathcal{T})$ * is scale-independent, i.e., if all the data points are scaled by the same factor less than 1, then $CC(\mathcal{T})$ is unchanged.*

Proof.

The proof can be found in F. ∎

2.3 Setup for accuracy analysis

The setup for accuracy analysis is as follows.

Definition 2.10.

If $f:D\rightarrow\mathbb{R}^{K}$ is a continuous mapping, then the mapping

[TABLE]

is still continuous, where $f_{i}(x)$ represents the i-th component of $f(x)$ , and $e^{f(x)}\in\mathbb{R}^{K}$ is the exponential function applied componentwise to $f(x)$ . We have $\hat{f}_{i}(x)>0$ , and $\sum_{i}\hat{f}_{i}(x)=1$ . For convenience, we directly consider the case that $f_{i}(x)>0$ and $\sum f_{i}(x)=1$ , and we call such mapping the normalized continuous positive mapping.

Remark 2.11.

A neural network with softmax nonlinearity is a normalized continuous positive mapping.

Different from the accuracy usually used in classification problems, we define a stronger accuracy called $c$ -accuracy as follows.

Definition 2.12 ( $c$ -accuracy at $x$ ).

Let $f$ be a normalized continuous positive mapping. For $0.5\leq c<1$ , we say that $f$ is $c$ -accurate at point $x$ if

[TABLE]

Definition 2.13 ( $c$ -accuracy on $D$ ).

Let $f$ be a normalized continuous positive mapping. The $c$ -accuracy of $f$ on a sample space $D$ is defined as

[TABLE]

where $H_{c}^{f}:=\{x\in D|f$ is $c$ -accurate at $x$$\}$ .

Definition 2.14 ( $c$ -accuracy on $\mathcal{T}$ ).

Let $f$ be a normalized continuous positive mapping. The $c$ -accuracy of $f$ on a data set $\mathcal{T}$ is defined as

[TABLE]

where $\widetilde{\mathcal{T}}_{c}:=\{x\in\mathcal{T}|f$ is $c$ -accurate at $x\}$ , and $\rho_{\widetilde{\mathcal{T}}_{c}}$ and $\rho_{\mathcal{T}}$ are the TC of $\widetilde{\mathcal{T}}_{c}$ and $\mathcal{T}$ , respectively.

Definition 2.15 (Expected accuracy).

Let $f$ be a normalized continuous positive mapping. The expected accuracy of $f$ on a sample space $D$ is defined as

[TABLE]

where $H^{f}:=\{x\in D|\exists i_{max}\in tag(x),\ s.t.\ f_{i_{max}}(x)>f_{i}(x)\ \forall 1\leq i\leq K,\ i\neq i_{max}\}$ .

We note that the $c$ -accuracy of $f$ on $D$ represents the expected $c$ -accuracy, and the $c$ -accuracy of $f$ on $\mathcal{T}$ represents the empirical $c$ -accuracy.

Finally, we define a non-decreasing function $\delta_{f}$ to describe the smoothness of $f$ .

Definition 2.16 (Smoothness).

Let $f:D\to\mathbb{R}^{K}$ be a continuous mapping. Then $f$ is uniformly continuous due to the compactness of $D$ , i.e.

[TABLE]

We denote the supremum of $\delta$ satisfying the above requirement by $\delta_{f}(\epsilon)$ . It is easy to see that $\delta_{f}(\epsilon)$ is equal to the inverse of the modulus of continuity of $f$ .

For low dimensional problems, we can directly compute $\delta_{f}$ by brute force. However, for high dimensional problems, it is intractable to compute $\delta_{f}$ , and thus we give the following lower bound of $\delta_{f}$ for a fully-connected ReLU-network $f$ with softmax as the activation function in the last layer, which is also the main network structure considered through this work.

A fully-connected neural network is defined as follows:

[TABLE]

where $n_{i}$ is the number of neurons in the layer $i$ ( $n_{0}=d$ and $n_{l}=K$ ), and $\sigma$ is the activation function. Then for the ReLU activation function, we have

[TABLE]

where $\|\cdot\|_{2}$ is the spectral norm, $Lip(f)$ and $Lip(softmax)$ represent the Lipschitz constants of $f$ and $softmax$ mapping from $(D,\|\cdot\|_{2})$ to $(\mathbb{R}^{K},\|\cdot\|_{\infty})$ , respectively. $Lip(softmax)$ is a constant less than $1/2$ , and thus is ignored in our numerical examples. We note that although the lower bound of $\delta_{f}(\epsilon)$ depends exponentially on the neural net depth, $\delta_{f}(\epsilon)$ itself does not necessarily scale exponentially with the network depth.

3 Lower bounds for the expected accuracy

In this section, we present a theoretical analysis of the lower bound for the expected accuracy as well as an upper bound for the expected error.

Proposition 3.17.

Let $f$ be a normalized continuous positive mapping. Suppose that $\mathcal{T}$ is a single label training set, i.e. $tag(\mathcal{T})\subseteq\{\{1\},\{2\},\cdots,\{K\}\}$ . For any $0.5\leq c_{2}<c_{1}\leq 1$ , we have

[TABLE]

where $\delta=\min(\delta_{0},\delta_{f}(c_{1}-c_{2}))$ .

Proof.

The proof can be found in G. ∎

Proposition 3.17 shows that the expected $c_{2}$ -accuracy of $f$ can be bounded by the empirical $c_{1}$ -accuracy and the TC of the training set. We can see that $p_{c_{2}}(f)$ tends to 1 when $\rho_{\mathcal{T}}$ and $p_{c_{1}}^{\mathcal{T}}$ tend to 1. Next we derive a bound for the accuracy by taking into account the loss function.

Theorem 3.18 (Lower bound of $c$ -accuracy).

Let $f$ be a normalized continuous positive mapping. Suppose that $\mathcal{T}=\{x_{1},\cdots,x_{n}\}$ is a single label training set, and $tag(x_{i})=\{k_{i}\}$ . For any $0.5\leq c<1$ , if the maximum cross entropy loss

[TABLE]

then we have

[TABLE]

where $\ell$ is the cross entropy loss that $\ell(f(x),k)=-\ln(f_{k}(x))$ , $\delta=\min(\delta_{0},\delta_{f}(e^{-L_{f}^{max}}-c))$ , and $\delta_{0}$ is defined in Proposition 2.3.

Proof.

The proof can be found in H. ∎

Theorem 3.18 reveals that the expected accuracy is related to the total cover $\rho_{\mathcal{T}}$ , separation gap $\delta_{0}$ , neural network smoothness $\delta_{f}$ , and loss value $L_{f}^{max}$ . We will show numerically in Section 4 that $\delta_{f}(e^{-L_{f}^{max}}-c)$ increases first and then decreases during the training of neural networks. The following theorem states that the maximum value of $\delta_{f}(e^{-L_{f}^{max}}-c)$ is bounded by the empirical separation gap.

Theorem 3.19 (Empirical separation gap).

Let $f$ be a normalized continuous positive mapping. Suppose that $\mathcal{T}=\{x_{1},\cdots,x_{n}\}$ is a single label training set. For any $0.5\leq c<1$ , if $L_{f}^{max}<-\ln c$ , then we have

[TABLE]

where

[TABLE]

is called the empirical separation gap, i.e., the smallest distance between two differently labeled training points.

Proof.

The proof can be found in I. ∎

Besides the upper bound, the lower bound of $\delta_{f}$ is also important to the accuracy. We have observed that in practice the low bound of $\delta_{f}$ exists (Figs. 5, 6, 7 and 8), which indicates the existence of $\kappa$ in the following theorem. Based on this observation, we have the following theorem for the accuracy.

Theorem 3.20 (Lower bound of accuracy).

Assume that there exists a constant $\kappa>0$ , such that

[TABLE]

holds for any single label training set $\mathcal{T}$ and a corresponding suitable trained network $f$ on $\mathcal{T}$ such that $L_{f}^{max}(\mathcal{T})<\ln 2$ , then we have the following conclusions for the expected accuracy $p(f)$ and the expected error $\mathcal{E}(f)=1-p(f)$ :

(i)

*with the same condition of Theorem 3.18, *

$p(f)\geq 1-\frac{\sqrt{d}}{\delta}(1-\rho_{\mathcal{T}})$ , 2. (ii)

$\mathcal{E}(f)\leq\frac{\sqrt{d}\cdot(1-\rho_{\mathcal{T}})}{\min(\delta_{0},\kappa\delta_{\mathcal{T}})}=\alpha(\mathcal{T})\cdot CC(\mathcal{T}),$ ** 3. (iii)

$\lim\limits_{\rho_{\mathcal{T}}\to 1}p(f)=1,$ **

where $\delta=\min(\delta_{0},\delta_{f}(e^{-L_{f}^{max}}-c))$ , and $\alpha(\mathcal{T})=\frac{\sqrt{d}\cdot CD(\mathcal{T})}{\min(\delta_{0},\kappa\delta_{\mathcal{T}})}$ .

Proof.

(i) is the conclusion of Theorem 3.18. The proof of (ii) and (iii) can be found in J. ∎

Remark 3.21.

We have the following remarks for this main theorem:

In (ii), we rewrite $\frac{\sqrt{d}\cdot(1-\rho_{\mathcal{T}})}{\min(\delta_{0},\kappa\delta_{\mathcal{T}})}$ to emphasize $CC(\mathcal{T})$ by introducing a coefficient term $\alpha(\mathcal{T})$ . Although this seems artificial, our numerical experiments empirically show that in practice the linear scaling between $\mathcal{E}(f)$ and $CC(\mathcal{T})$ is indeed satisfied.

2.

This theorem is not specific to neural networks, but rather holds for any trained model $f$ satisfying the assumptions required in the theorem. While the assumptions are not true for a general machine learning algorithm, we show numerically that in practice neural networks satisfy the assumptions.

3.

The current proof of the theorem relies on the assumption of the maximum cross entropy $L_{f}^{max}$ , which could be hard to satisfy in practice, since it will be large even if only a single training sample is misclassified. However, the assumption of the maximum cross entropy is possible to be relaxed according to our experiments.

Here, the cover complexity $CC(\mathcal{T})$ consists of two parts, one represents the richness of the whole training set while the other part describes the degree of separation between different labeled subsets. As for $\alpha(\mathcal{T})$ , both the denominator and numerator seem to have a positive correlation with respect to separation level. What we wish is that $\alpha(\mathcal{T})$ is almost close to a constant with high probability and the expected error $\mathcal{E}(f)$ is mainly determined by $CC(\mathcal{T})$ , which approximately represents the complexity level of the data set. We will provide more information in detail in the section concerning the numerical results.

4 Numerical results

In this section, we use numerical simulations to test the accuracy of neural networks in terms of the data distribution (cover complexity), and neural network smoothness. In addition, we study the effects of the network size and training dataset size on the smoothness. The codes are published in GitHub (https://github.com/jpzxshi/generalization).

4.1 Data distribution

In this subsection, we explore how $CC(\mathcal{T})$ affects the expected error $\mathcal{E}(f)$ . In our experiments, we test several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), SVHN (Netzer et al., 2011). In addition to the original data set, we also create some variants: (1) the images of grey color, (2) the images extracted from a convolutional layer after training the original data set using a convolutional neural network (CNN), (3) combine several categories into one category to reduce the number of total categories, see Table 1 and details in K.

For a training data set $\mathcal{T}$ , we estimate $h_{\mathcal{T}}^{\mu}(r)$ by the proportion of the test data points within the balls with radius $r$ centered at training data points, i.e.,

[TABLE]

and then $\rho_{\mathcal{T}}$ is obtained by Definition 2.5. Similarly, we estimate $CD(\mathcal{T})$ and then compute $CC(\mathcal{T})$ . Next for each data set, we train fully-connected neural networks with different hyperparameters, and record the best error we observed, see the details in K. The cover complexity and the best error for each data set are shown in Table 1.

These data sets are divided into three groups according to their output dimensions. For each group of the same output dimension, the error is almost linearly correlated with $CC(\mathcal{T})$ , see Fig. 3A, regardless of the input dimension. In addition, we find that all the cases collapse into a single line when normalizing the error $\mathcal{E}(f)$ by a factor of $1/\sqrt{K}$ , see Fig. 3B.

It is noteworthy that the $CC(\mathcal{T})$ of convolutional variants of data sets is much smaller than that of the original data sets, and hence the expected accuracy increases. The results confirm the importance of data distribution.

Next, we consider the most difficult data set, i.e., data with random labels. We choose MNIST and then assign each image a random label. We repeat this process 50 times, and compute each $CC(\mathcal{T})$ . The distribution of $|CC(\mathcal{T})|$ is shown in Fig. 4. The smallest $|CC(\mathcal{T})|$ is $\sim$ 300, which is much larger than that of the original data sets with $CC(\mathcal{T})<20$ . This extreme example again confirms that $CC(\mathcal{T})$ is a proper measure of the difficulty of classifying a data set.

4.2 Neural network smoothness

In this subsection, we will investigate the relationship between the neural network smoothness $\delta_{f}(e^{-L_{f}^{max}}-c)$ and the accuracy, and the effects of network size (depth and width) on the smoothness. We first show results for one- and two-dimensional problems, where $\delta_{f}(e^{-L_{f}^{max}}-c)$ can be computed accurately by brute force. Subsequently, we consider the high dimensional setting of the MNIST data set, where we estimate $\delta_{f}(e^{-L_{f}^{max}}-c)$ by Eq. (3).

4.2.1 One- and two-dimensional problems

We first consider a one-dimensional case and a two-dimensional case. For the one-dimensional case, we choose the sample space $D=[0,1]$ , $K=2$ , and the ideal label function as

[TABLE]

with separation gap $\delta_{0}=0.1$ . We use $n$ equispaced points ( $n\geq 4$ is an even number) on $D\setminus(0.5-\frac{\delta_{0}}{2},0.5+\frac{\delta_{0}}{2})$ as the training set, i.e., $\mathcal{T}=\mathcal{T}_{1}\cup\mathcal{T}_{2}$ , where

[TABLE]

We choose 10000 equispaced points on $D$ as the test data.

For the two-dimensional case, we choose the sample space $D=[0,1]^{2}$ , $K=2$ , and the ideal label function as

[TABLE]

with $\delta_{0}=0.1$ . For the training set, we first choose $n=m^{2}$ equispaced points, i.e., $\mathcal{T}=\{0,\frac{1}{m-1},\frac{2}{m-1},\dots,\frac{m-2}{m-1},1\}^{2}$ , and then remove the points with label $\{1,2\}$ to ensure that all samples are of single label. We choose $1000^{2}$ equispaced points on $D$ as the test data.

In our experiments, we use a 3-layer fully-connected NN with ReLU activation and 30 neurons per layer. The neural network is trained for 1000 iterations by the Adam optimizer (Kingma and Ba, 2015) for the one-dimensional problem, and 2000 iterations for the two-dimensional problem. For the one-dimensional problem, the $c$ -accuracy $p_{c}(f)$ with $c=0.5$ and lower bounds for different numbers of training points are listed in Table 2. We can see that the bounds become tighter when $n$ is larger.

During the training process of the neural network, the test loss first decreases and then increases, while $\delta_{f}$ first increases and then decreases, see Fig. 5A for the one-dimensional problem ( $n=20$ ) and Fig. 5B for the two-dimensional problem ( $n=400$ ). $\delta_{f}$ is bounded by $\delta_{\mathcal{T}}/2$ , as proved in Theorem 3.19. We also observe that the trends of test loss and $\delta_{f}$ coincide, and thus we should stop the training when $\delta_{f}$ begins to decrease to prevent overfitting.

4.2.2 High-dimensional problem

In the high-dimensional problem of MNIST, we consider the average loss $L_{f}$ instead of the maximum loss $L_{f}^{max}$ , which is very sensitive to extreme points. As shown in Eq. (3), we use the following quantity to bound $\delta_{f}(e^{-L_{f}}-c)$ :

[TABLE]

Because we use the $c$ -accuracy to approximate the true accuracy, for the classification problems with two categories, $p_{0.5}(f)$ and $p(f)$ are equivalent. However, they are not equal for problems with more than two categories, where the best $c$ depends on the properties of the data set, such as the easiness of learning to classify the data set. If the data set is easy to classify, such as MNIST, the best $c$ should be close to 1. In our example, we choose $c=0.9$ . We train MNIST using a 3-layer fully-connected NN with ReLU activation and 100 neurons per layer for 100 epochs. In Fig. 6, we can also see the consistency between the test loss and neural network smoothness, as we observed in the low-dimensional problems.

4.2.3 Effects of the network size and training dataset size on the smoothness

We have demonstrated that network smoothness $\delta_{f}(e^{-L_{f}^{max}}-c)$ is an important factor to the accuracy. Next, we investigate the effects of network size (depth and width) on the smoothness for binary classification problems, which are explained as follows. We consider the one-dimensional sample space $D=[0,1]$ , and choose $n$ equispaced points on $D$ as the training data locations. To avoid the effects of the choice of target true functions, we always repeat experiments with different target functions, and in each experiment we generate a random target function. Specifically, to generate a random target function, we first sample two random functions $g_{1}(x)$ and $g_{2}(x)$ from a Gaussian process with the radial basis function kernel of a length scale 0.2, and then assign a point $x$ as category 1 if $g_{1}(x)>g_{2}(x)$ , otherwise assign this point as category 2. When training neural networks, we monitor the value of $\delta_{f}(e^{-L_{f}^{max}}-c)$ and stop the training once $\delta_{f}$ begins to decrease as shown in Figs. 5 and 6. We first choose the dataset size $n=10$ , and we show that the normalized smoothness $\delta_{f}/\delta_{T}$ decreases as the network depth or width increases (Fig. 7). We also show that the effects of depth is more significant than that of width.

Our main theorem (Theorem 3.20) requires the assumption (Eq. 4), which would not be true for a general machine learning algorithms. Here, we verify this assumption for neural networks by numerical experiments. Specifically, we train a fully-connected neural networks using training datasets of different size $n$ . We show that $\delta_{f}/\delta_{\mathcal{T}}$ is insensitive to training dataset size, and is always bounded from below by a positive constant $\kappa$ (Fig. 8). This result reveals that the neural networks would fit a dataset in a relatively smooth way during the training process.

5 Discussion

When neural networks are used to solve classification problems, we expect that the accuracy is dependent on some properties of the data set. It is still quite surprising, however, that there is a linear relationship between the accuracy and the cover complexity of the data set, as we have seen in Section 4.1. Theorem 3.20(ii) provides an upper bound of the error, but a lower bound is missing. To fully explain this observation, two conjectures of the learnability of fully-connected neural networks are proposed: when a neural network $f$ is trained on a data set $\mathcal{T}$ in such a way that $L_{f}^{max}(\mathcal{T})<\ln 2$ and $\delta_{f}(e^{-L}-c)\asymp\delta_{\mathcal{T}}$ , then we have

$\mathcal{E}(f)\approx c(K)\cdot CC(\mathcal{T})$ , where $c(K)$ is a constant depending only on $K$ .

2.

$\frac{\mathcal{E}(f)}{\sqrt{K}}\approx c\cdot CC(\mathcal{T})$ , where $c\approx 0.014$ is a constant.

On the other hand, the theoretical and numerical results provide a better understanding of the generalization of neural network from the training procedure. The smoothness $\delta_{f}(e^{-L}-c)$ of neural networks plays a key role, where $L$ is the maximum cross entropy loss $L_{f}^{max}$ or the average cross entropy loss $L_{f}$ . We can see that:

$\delta_{f}(e^{-L}-c)$ depends on both the regularity of $f$ and the loss value $L$ (which also depends on $f$ ). Large $\delta_{f}(e^{-L}-c)$ requires good regularity and large $e^{-L}-c$ , i.e., small $L$ . However, small $L$ could correspond to bad regularity of $f$ . Thus, there is a trade-off between the loss value $L$ and the regularity of $f$ .

2.

Due to this trade-off, $\delta_{f}(e^{-L}-c)$ increases first and then decrease during the training process. Hence, we should not optimize neural networks excessively. Instead, we should stop the training early when $\delta_{f}(e^{-L}-c)$ begins to decrease, which leads to another “early stopping” strategy to prevent overfitting.

We also note that the lower bound of $\delta_{f}(e^{-L}-c)$ in Eq. (3) relates to the norm of weight matrices of neural networks:

[TABLE]

There have been some works to study the norm-based complexity of neural networks (see the Introduction), and these bounds typically scale with the product of the norms of the weight matrices, e.g., (Neyshabur et al., 2017a)

[TABLE]

where $h_{i}$ and $W_{i}$ are the number of nodes and the weight matrix in layer $i$ of a network with $l$ -layers, and $\gamma_{margin}$ is the margin quantity, which describes the goodness of fit of the trained network to the data. The product of the matrix norms depends exponentially on the depth, while some recent works show that the generalization bound could scale polynomially in depth under some assumptions (Nagarajan and Kolter, 2019a; Wei and Ma, 2019). The exploration of the dependence of $\delta_{f}(e^{-L}-c)$ on depth is left for future work.

6 Conclusion

In this paper, we study the generalization error of neural networks for classification problems in terms of the data distribution and neural network smoothness. We first establish a new framework for classification problems. We introduce the cover complexity (CC) to measure the difficulty of learning a data set, an accuracy measure called $c$ -accuracy which is stronger than the standard classification accuracy, and the inverse of the modulus of continuity to quantify neural network smoothness. Subsequently, we derive a quantitative bound for the expected accuracy/error in Theorem 3.20, which considers both the cover complexity and neural network smoothness.

We validate our theoretical results by several data sets of images. Our numerical results demonstrate that CC is a reliable measure for the difficulty of learning to classify a data set. On the other hand, we observe a clear consistency between test loss and neural network smoothness during the training process. We also show that neural network smoothness decreases when the network depth and width increases, and the effects of depth is more significant than that of width, while the smoothness is insensitive to training dataset size.

Acknowledgements

This work is supported by the DOE PhILMs project (No. de-sc0019453), the AFOSR grant FA9550-17-1-0013, and the DARPA AIRA grant HR00111990025. The work of P. Jin and Y. Tang is partially supported by the Major Project on New Generation of Artificial Intelligence from MOST of China (Grant No. 2018AAA0101002), and the National Natural Science Foundation of China (Grant No. 11771438).

Appendix A Example of topology

Example A.22.

Given

[TABLE]

then $\tau_{T}$ is the topology generated by $U$ .

In this example, $\{\{1\},\{1,2\},\{1,3\},\{1,2,3\}\}$ is an open set, since it consists of all elements containing label $\{1\}$ , and $\{\{2,3\},\{1,2,3\}\}$ is also an open set with common part $\{2,3\}$ . Besides open sets from base $U$ , $\{\{1\},\{1,2\},\{1,3\},\{2,3\},\{1,2,3\}\}$ is still an open set as the union of the two shown above.

Appendix B Proof of Proposition 2.3

Proof.

We use the proof by contradiction. Assume that the result does not hold, then

$\exists\{x_{n}\},\{y_{n}\}\subseteq D,\ s.t.\ \|x_{n}-y_{n}\|\rightarrow 0(n\rightarrow\infty)$ , and $tag(x_{n})\cap tag(y_{n})=\varnothing.$

Because $D$ is compact, there exist $x^{*}\in D$ and a subsequence $\{x_{k_{n}}\}$ of $\{x_{n}\}$ such that $\ x_{k_{n}}\rightarrow x^{*}.$ As $\|x_{n}-y_{n}\|\rightarrow 0$ , also $\ y_{k_{n}}\rightarrow x^{*}.$ Choose any $i_{0}\in tag(x^{*})$ , then there exists a sufficient large $k_{n_{0}}$ such that $\ i_{0}\in tag(x_{k_{n_{0}}}),\ i_{0}\in tag(y_{k_{n_{0}}}).$ Therefore $tag(x_{k_{n_{0}}})\cap tag(y_{k_{n_{0}}})\neq\varnothing$ , which contradicts the assumption. ∎

Appendix C Proof of Proposition 2.4

Proof.

Let $\delta_{0}$ as defined in this proposition. For any two different points $x,y$ with distance less than $\delta_{0}$ , either $tag(x)=tag(y)$ , or at least one of the two is a full label point, in both cases $tag(x)\cap tag(y)\neq\varnothing$ . For any $\delta>\delta_{0}$ , according to the definition of $\delta_{0}$ , there exist two points $x_{0},y_{0}$ satisfying

[TABLE]

The two facts imply that $\delta_{0}$ is the supremum of $\delta$ satisfying Proposition 2.3. ∎

Appendix D Proof of Proposition 2.7

Proof.

According to the definition,

[TABLE]

thus

[TABLE]

∎

Appendix E Estimate of total cover

In this section, we estimate the TC by the number of samples in the training set. The notations, such as $D$ , $d$ , $\mu$ , $\rho_{\mathcal{T}}$ , as well as training set

[TABLE]

are the same as before. Note that samples in $\mathcal{T}$ are drawn according to $\mu$ . Before presenting the analysis, we first collect the following auxiliary notions and results (Definitions E.23-E.26, Theorem E.27) which are easily found in Mitzenmacher and Upfal (2017) (Definitions 14.1-14.3, Definition 14.5, and Theorem 14.8):

Definition E.23.

A range space is a pair $(X,\mathcal{R})$ where:

$X$ * is a (finite or infinite) set of points;* 2. 2.

$\mathcal{R}$ * is a family of subsets of $X$ , called ranges.*

Definition E.24.

Let $(X,\mathcal{R})$ be a range space and let $Y\subseteq X$ . The projection of $\mathcal{R}$ on $Y$ is

[TABLE]

Definition E.25.

Let $(X,\mathcal{R})$ be a range space. A set $Y\subseteq X$ is shattered by $\mathcal{R}$ if $|\mathcal{R}_{Y}|=2^{|Y|}$ . The Vapnik-Chervonenkis (VC) dimension of a range space $(X,\mathcal{R})$ is the maximum cardinality of a set $Y\subseteq X$ that is shattered by $\mathcal{R}$ . If there are arbitrarily large finite sets that are shattered by $\mathcal{R}$ , then the VC dimension is infinite.

Definition E.26.

Let $(X,\mathcal{R})$ be a range space, and let $\mathcal{F}$ be a probability distribution on $X$ . A set $N\subseteq X$ is an $\epsilon$ -net for $X$ with respect to $\mathcal{F}$ if for any set $R\in\mathcal{R}$ such that $Pr_{\mathcal{F}}(R)\geq\epsilon$ , the set $R$ contains at least one point from $N$ , i.e.,

[TABLE]

Theorem E.27.

Let $(X,\mathcal{R})$ be a range space with VC dimension $d_{vc}$ and let $\mathcal{F}$ be a probability distribution on $X$ . For any $0<\eta,\epsilon\leq 1/2$ , there is an

[TABLE]

such that a random sample from $\mathcal{F}$ of size greater than or equal to $m$ is an $\epsilon$ -net for $X$ with probability at least $1-\eta$ .

Now let

[TABLE]

we first show $(D,B_{D})$ is a range space with VC dimension $d+1$ .

Lemma E.28.

The VC dimension of range space $(D,B_{D})$ is $d+1$ .

Proof.

The proof can be found in Dudley (1979). ∎

Set

[TABLE]

and

[TABLE]

we have the following lemmas.

Lemma E.29.

$\bigcup\limits_{A\in B_{D}^{\epsilon}(\frac{1}{2}r)}A\subseteq D\cap\bigcup\limits_{x_{i}\in\mathcal{T}}B(x_{i},r)$ * when $\mathcal{T}$ is an $\epsilon$ -net for $(D,B_{D})$ .*

Proof.

For any $x\in\bigcup\limits_{A\in B_{D}^{\epsilon}(\frac{1}{2}r)}A$ , we have $x\in A^{*}=B(x^{*},s^{*})\cap D$ for certain $s^{*}\leq\frac{1}{2}r$ , with $\mu(A^{*})\geq\epsilon$ . Since $\mathcal{T}$ is an $\epsilon$ -net and $\mu(A^{*})\geq\epsilon$ , we know $\mathcal{T}\cap A^{*}\neq\varnothing$ . Thus there exists $x_{t}\in\mathcal{T}$ such that $\|x_{t}-x^{*}\|<s^{*}\leq\frac{1}{2}r$ . Therefore

[TABLE]

The above inequality shows that

[TABLE]

∎

Lemma E.30.

$\lim\limits_{\epsilon\to 0}\varrho(\epsilon)=1$ .

Proof.

For any positive decreasing sequence $\{\epsilon_{i}\}$ which satisfies $\lim_{i\to\infty}\epsilon_{i}=0$ , it leads to an increasing chain

[TABLE]

where $\widetilde{D}=\bigcup_{A\in B_{D}^{+}(\frac{1}{2}r)}A$ for $B_{D}^{+}(\frac{1}{2}r)=\{A\in B_{D}|A=B(x,s)\cap D,0<s\leq\frac{1}{2}r,\mu(A)>0\}$ . Let us consider a series of open balls $\{B_{i}\}$ of radius at most $\frac{1}{2}r$ that cover $D$ , and we divide them into two parts $\{B_{i}\}=\{B_{i}^{+}\}\cup\{B_{i}^{0}\}$ such that $\mu(B_{i}^{+}\cap D)>0$ and $\mu(B_{i}^{0}\cap D)=0$ . Then $\mu(\widetilde{D})\geq\mu(D\cap\bigcup_{i}B_{i}^{+})\geq\mu(D)-\mu(D\cap\bigcup_{i}B_{i}^{0})=1-0=1$ , and thus $\mu(\widetilde{D})=1$ . Therefore, we have $\lim_{\epsilon\rightarrow 0}\mu(\bigcup_{A\in B_{D}^{\epsilon}(\frac{1}{2}r)}A)=\mu(\widetilde{D})=1$ .

Since

[TABLE]

by dominated convergence theorem, we have

[TABLE]

∎

According to the aforementioned lemmas, we deduce the following theorem.

Theorem E.31.

Let $\mathcal{T}=\{x_{1},x_{2},\dots,x_{n}\}$ be the training set drawn according to $\mu$ , then for any $0<\eta,\epsilon\leq\frac{1}{2}$ , there exists an

[TABLE]

such that

[TABLE]

holds with probability at least $1-\eta$ when $n\geq m$ . Note that $\varrho(\epsilon)\rightarrow 1$ when $\epsilon\rightarrow 0$ .

Proof.

Theorem E.27 shows that $\mathcal{T}$ is an $\epsilon$ -net for range space $(D,B_{D})$ with probability at least $1-\eta$ when $n\geq m$ . By lemma E.29, we have

[TABLE]

∎

From this theorem, we know that a large number of samples lead to a sufficiently large $\rho_{\mathcal{T}}$ with a high probability.

In the previous sections, there is an assumption that every training point has only one single correct label, so we will naturally consider this special case in the sequel.

Denote

[TABLE]

and $(D_{sin},B_{D_{sin}})$ is a range space with VC dimension at most $d+1$ . Let

[TABLE]

that is, the samples in $\mathcal{T}$ are drawn according to $\mu_{sin}$ . As before, denote

[TABLE]

We have the following lemmas.

Lemma E.32.

$\bigcup\limits_{A\in B_{D_{sin}}^{\epsilon}(\frac{1}{2}r)}A\subseteq D\cap\bigcup\limits_{x_{i}\in\mathcal{T}}B(x_{i},r)$ * when $\mathcal{T}$ is an $\epsilon$ -net for $(D_{sin},B_{D_{sin}})$ .*

Lemma E.33.

$\lim\limits_{\epsilon\to 0}\varrho_{sin}(\epsilon)=c_{sin}$ , here

[TABLE]

From these two lemmas we deduce the following theorem.

Theorem E.34.

Let $\mathcal{T}=\{x_{1},x_{2},\dots,x_{n}\}$ be the training set drawn according to $\mu_{sin}$ , then for any $0<\eta,\epsilon\leq\frac{1}{2}$ , there exists an

[TABLE]

such that

[TABLE]

holds with probability at least $1-\eta$ when $n\geq m$ . Note that $\varrho_{sin}(\epsilon)\rightarrow c_{sin}$ when $\epsilon\rightarrow 0$ for $c_{sin}\geq\mu(D_{sin})$ .

The proofs for Lemmas E.32-E.33 and Theorem E.34 are very similar to those for Lemmas E.29-E.30 and Theorem E.31, respectively. We omit them here. It is noteworthy that $c_{sin}$ is intuitively very close to 1, even equal to 1. At worst, $c_{sin}$ is at least greater than $\mu(D_{sin})$ which may be quite large in practice, and the proof is similar to what we show in the Lemma in E.30.

Appendix F Proof of Proposition 2.9

Proof.

Let $\mathcal{T}$ be the training set and $\lambda$ be a positive constant greater than 1, $\widetilde{\mathcal{T}}=\mathcal{T}/\lambda$ , $\widetilde{\mu}(A)=\mu(\lambda A)$ , then

[TABLE]

For the same reason,

[TABLE]

therefore

[TABLE]

∎

Appendix G Proof of Proposition 3.17

Proof.

Denote by $\widetilde{\mathcal{T}}_{c_{1}}=\{x_{i}\in\mathcal{T}\ |\ f$ is $c_{1}-$ accurate at $x_{i}$$\}$ , $\delta=\min(\delta_{0},\delta_{f}(c_{1}-c_{2}))$ . For any $x_{i}\in\widetilde{\mathcal{T}}_{c_{1}}$ , choose $k_{i}\in\{1,\cdots,K\}$ such that $tag(x_{i})=\{k_{i}\}$ . From the definition of $\widetilde{\mathcal{T}}_{c_{1}}$ we know that $f_{k_{i}}(x_{i})>c_{1}$ .

For any $x\in B(x_{i},\delta)\cap D$ , from Proposition 2.3 we know that $tag(x)\cap tag(x_{i})\neq\varnothing$ , and hence $k_{i}\in tag(x)$ . On the other hand, $\|f(x)-f(x_{i})\|_{\infty}<c_{1}-c_{2}$ , so we have $|f_{k_{i}}(x)-f_{k_{i}}(x_{i})|<c_{1}-c_{2}$ , therefore

[TABLE]

which means that $f$ is $c_{2}-$ accurate at $x$ , that is to say

[TABLE]

Then

[TABLE]

The second inequality can be derived from Proposition 2.7. ∎

Appendix H Proof of Theorem 3.18

Proof.

Define $\delta:=\min(\delta_{0},\delta_{f}(e^{-L_{f}^{max}}-c))$ . Note that if $x_{i}\in\mathcal{T}$ , and $x\in B(x_{i},\delta)\cap D$ , then $\|x-x_{i}\|<\delta_{0}$ , and hence $tag(x)\cap tag(x_{i})\neq\varnothing$ , so that $k_{i}\in tag(x)$ .

Because of

[TABLE]

we have

[TABLE]

so that $|f_{k_{i}}(x)-f_{k_{i}}(x_{i})|<e^{-L_{f}^{max}}-c$ . Therefore

[TABLE]

so that $f$ is c-accurate at $x$ . Overall we obtain

[TABLE]

The second inequality can be derived from Proposition 2.7. ∎

Appendix I Proof of Theorem 3.19

Proof.

We will prove it by contradiction. Consider two points $x_{1}$ and $x_{2}$ with different labels, and $\|x_{1}-x_{2}\|=\delta_{\mathcal{T}}$ .

If $\delta_{f}(e^{-L_{f}^{max}}-c)>\delta_{\mathcal{T}}/2$ , then by the definition of $\delta_{f}(e^{-L_{f}^{max}}-c)$ , $\|f(\frac{x_{1}+x_{2}}{2})-f(x_{1})\|_{\infty}<e^{-L_{f}^{max}}-c$ , since $\|\frac{x_{1}+x_{2}}{2}-x_{1}\|=\|\frac{x_{2}-x_{1}}{2}\|=\delta_{\mathcal{T}}/2$ . Similarly, $\|f(\frac{x_{1}+x_{2}}{2})-f(x_{2})\|_{\infty}<e^{-L_{f}^{max}}-c$ . Then we have

[TABLE]

On the other hand, by the definition of $L_{f}^{max}$ , $-\ln f_{k_{1}}(x_{1})\leq L_{f}^{max}$ and $-\ln f_{k_{2}}(x_{2})\leq L_{f}^{max}$ , so $f_{k_{1}}(x_{1})\geq e^{-L_{f}^{max}}$ , $f_{k_{2}}(x_{2})\geq e^{-L_{f}^{max}}$ and $f_{k_{1}}(x_{2})\leq 1-f_{k_{2}}(x_{2})\leq 1-e^{-L_{f}^{max}}$ , therefore

[TABLE]

∎

Appendix J Proof of Theorem 3.20

Proof.

From Theorem 3.18 and assumption we know that

[TABLE]

which implies $\lim\limits_{\rho_{\mathcal{T}}\to 1}p(f)=1$ . Note that $\kappa$ is less than $0.5$ and $\delta_{0}$ is only determined by the classification problem itself. The above inequality is easy to convert into form (ii). ∎

Appendix K Detailed information of data and parameters for training

First, we list the information concerning the data selection.

MNIST: Last 55000 samples of the training set for training and all the 10000 samples of the test set for testing. 2. 2.

CIFAR-10: First 49000 samples of the training set for training and all the 10000 samples of the test set for testing. 3. 3.

CIFAR-100: First 49000 samples of the training set for training and all the 10000 samples of the test set for testing. 4. 4.

COIL-20: 1200 samples whose end numbers of the figure names are not multiples of 6 for training and 240 samples whose end numbers are multiples of 6 for testing. 5. 5.

COIL-100: 6000 samples whose end numbers of the figure names are not multiples of 30 for training and 1200 samples whose end numbers are multiples of 30 for testing. 6. 6.

SVHN: First 50000 samples of the training set for training and first 10000 samples of the test set for testing.

Parameters for networks are listed in Table 3.

For generating convolution data, we choose the following structure

[TABLE]

with kernel size $3\times 3$ (strides $1$ ) and pool size $2\times 2$ (strides $2$ ), then train this CNN with batch size $64$ , learning rate $0.001$ and optimizer RMSProp for 5 epochs. After that, extract new data at location mentioned in above structure by feeding the data to the trained network.

References

Allen-Zhu et al. (2018a)

Allen-Zhu, Z., Li, Y., Liang, Y., 2018a.

Learning and generalization in overparameterized neural networks, going beyond two layers.

arXiv preprint arXiv:1811.04918 .

Allen-Zhu et al. (2018b)

Allen-Zhu, Z., Li, Y., Song, Z., 2018b.

A convergence theory for deep learning via over-parameterization.

arXiv preprint arXiv:1811.03962 .

Arora et al. (2019)

Arora, S., Du, S., Hu, W., Li, Z., Wang, R., 2019.

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks.

arXiv preprint arXiv:1901.08584 .

Arora et al. (2018)

Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018.

Stronger generalization bounds for deep nets via a compression approach.

arXiv preprint arXiv:1802.05296 .

Bartlett et al. (2017a)

Bartlett, P., Foster, D., Telgarsky, M., 2017a.

Spectrally-normalized margin bounds for neural networks, in: Advances in Neural Information Processing Systems, pp. 6240–6249.

Bartlett et al. (2017b)

Bartlett, P., Harvey, N., Liaw, C., Mehrabian, A., 2017b.

Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks.

arXiv preprint arXiv:1703.02930 .

Bartlett and Mendelson (2002)

Bartlett, P., Mendelson, S., 2002.

Rademacher and Gaussian Complexities: Risk bounds and structural results.

Journal of Machine Learning Research 3, 463–482.

Baykal et al. (2018)

Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D., 2018.

Data-dependent coresets for compressing neural networks with applications to generalization bounds.

arXiv preprint arXiv:1804.05345 .

Belkin et al. (2018)

Belkin, M., Hsu, D., Ma, S., Mandal, S., 2018.

Reconciling modern machine learning and the bias-variance trade-off.

arXiv preprint arXiv:1812.11118 .

Bottou (2010)

Bottou, L., 2010.

Large-scale machine learning with stochastic gradient descent, in: Proceedings of COMPSTAT’2010. Springer, pp. 177–186.

Bottou and Bousquet (2008)

Bottou, L., Bousquet, O., 2008.

The tradeoffs of large scale learning, in: Advances in neural information processing systems, pp. 161–168.

Cao and Gu (2019)

Cao, Y., Gu, Q., 2019.

A generalization theory of gradient descent for learning over-parameterized deep ReLU networks.

arXiv preprint arXiv:1902.01384 .

Chen et al. (2018)

Chen, Y., Jin, C., Yu, B., 2018.

Stability and convergence trade-off of iterative optimization algorithms.

arXiv preprint arXiv:1804.01619 .

Cheng et al. (2018)

Cheng, Y., Wang, D., Zhou, P., Zhang, T., 2018.

Model compression and acceleration for deep neural networks: The principles, progress, and challenges.

IEEE Signal Processing Magazine 35, 126–136.

Cybenko (1989)

Cybenko, G., 1989.

Approximation by superpositions of a sigmoidal function.

Mathematics of control, signals and systems 2, 303–314.

Dinh et al. (2017)

Dinh, L., Pascanu, R., Bengio, S., Bengio, Y., 2017.

Sharp minima can generalize for deep nets, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org. pp. 1019–1028.

Du et al. (2018)

Du, S., Lee, J., Li, H., Wang, L., Zhai, X., 2018.

Gradient descent finds global minima of deep neural networks.

arXiv preprint arXiv:1811.03804 .

Dudley (1979)

Dudley, R., 1979.

Balls in $\mathbb{R}^{k}$ do not cut all subsets of k

2 points.

Advances in Mathematics 31, 306 – 308.

Dziugaite and Roy (2017)

Dziugaite, G., Roy, D., 2017.

Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.

arXiv preprint arXiv:1703.11008 .

Friedman et al. (2001)

Friedman, J., Hastie, T., Tibshirani, R., 2001.

The elements of statistical learning. volume 1.

Springer series in statistics New York.

Gonen and Shalev-Shwartz (2017)

Gonen, A., Shalev-Shwartz, S., 2017.

Fast rates for empirical risk minimization of strict saddle problems.

arXiv preprint arXiv:1701.04271 .

Gunasekar et al. (2018)

Gunasekar, S., Lee, J., Soudry, D., Srebro, N., 2018.

Implicit bias of gradient descent on linear convolutional networks, in: Advances in Neural Information Processing Systems, pp. 9461–9471.

Hardt et al. (2015)

Hardt, M., Recht, B., Singer, Y., 2015.

Train faster, generalize better: Stability of stochastic gradient descent.

arXiv preprint arXiv:1509.01240 .

Hornik et al. (1989)

Hornik, K., Stinchcombe, M., White, H., 1989.

Multilayer feedforward networks are universal approximators.

Neural networks 2, 359–366.

Kawaguchi et al. (2017)

Kawaguchi, K., Kaelbling, L., Bengio, Y., 2017.

Generalization in deep learning.

arXiv preprint arXiv:1710.05468 .

Keskar et al. (2016)

Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P., 2016.

On large-batch training for deep learning: Generalization gap and sharp minima.

arXiv preprint arXiv:1609.04836 .

Kingma and Ba (2015)

Kingma, D., Ba, J., 2015.

Adam: A method for stochastic optimization.

International Conference on Learning Representations .

Krizhevsky and Hinton (2009)

Krizhevsky, A., Hinton, G., 2009.

Learning multiple layers of features from tiny images.

Technical Report. Citeseer.

Krizhevsky et al. (2012)

Krizhevsky, A., Sutskever, I., Hinton, G., 2012.

Imagenet classification with deep convolutional neural networks.

Neural Information Processing Systems 25.

doi:10.1145/3065386.

Kuzborskij and Lampert (2017)

Kuzborskij, I., Lampert, C., 2017.

Data-dependent stability of stochastic gradient descent.

arXiv preprint arXiv:1703.01678 .

LeCun et al. (1998)

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al., 1998.

Gradient-based learning applied to document recognition.

Proceedings of the IEEE 86, 2278–2324.

Lee et al. (2016)

Lee, J., Simchowitz, M., Jordan, M., Recht, B., 2016.

Gradient descent converges to minimizers.

arXiv preprint arXiv:1602.04915 .

Li and Liang (2018)

Li, Y., Liang, Y., 2018.

Learning overparameterized neural networks via stochastic gradient descent on structured data, in: Advances in Neural Information Processing Systems, pp. 8157–8166.

Liang et al. (2017)

Liang, T., Poggio, T., Rakhlin, A., Stokes, J., 2017.

Fisher-Rao metric, geometry, and complexity of neural networks.

arXiv preprint arXiv:1711.01530 .

Liao and Poggio (2017)

Liao, Q., Poggio, T., 2017.

Theory II: Landscape of the empirical risk in deep learning.

arXiv preprint arXiv:1703.09833 .

Lu et al. (2019)

Lu, L., Shin, Y., Su, Y., Karniadakis, G., 2019.

Dying ReLU and initialization: Theory and numerical examples.

arXiv preprint arXiv:1903.06733 .

Lu et al. (2018)

Lu, L., Su, Y., Karniadakis, G., 2018.

Collapse of deep and narrow neural nets.

arXiv preprint arXiv:1808.04947 .

Maas et al. (2013)

Maas, A., Hannun, A., Ng, A., 2013.

Rectifier nonlinearities improve neural network acoustic models, in: Proc. icml, p. 3.

Mitzenmacher and Upfal (2017)

Mitzenmacher, M., Upfal, E., 2017.

Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis.

2nd ed., Cambridge University Press, New York, NY, USA.

Nagarajan and Kolter (2019a)

Nagarajan, V., Kolter, J., 2019a.

Deterministic PAC-bayesian generalization bounds for deep networks via generalizing noise-resilience, in: International Conference on Learning Representations.

Nagarajan and Kolter (2019b)

Nagarajan, V., Kolter, J., 2019b.

Generalization in deep networks: The role of distance from initialization.

arXiv preprint arXiv:1901.01672 .

Nene et al. (1996a)

Nene, S., Nayar, S., Murase, H., 1996a.

Columbia object image library (coil-100).

Citeseer .

Nene et al. (1996b)

Nene, S., Nayar, S., Murase, H., et al., 1996b.

Columbia object image library (coil-20).

Technical report CUCS-005-96 .

Netzer et al. (2011)

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A., 2011.

Reading digits in natural images with unsupervised feature learning, in: Advances in Neural Information Processing Systems.

Neyshabur et al. (2017a)

Neyshabur, B., Bhojanapalli, S., McAllester, D., Srebro, N., 2017a.

Exploring generalization in deep learning, in: Advances in Neural Information Processing Systems, pp. 5947–5956.

Neyshabur et al. (2017b)

Neyshabur, B., Bhojanapalli, S., Srebro, N., 2017b.

A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks.

arXiv preprint arXiv:1707.09564 .

Neyshabur et al. (2019)

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N., 2019.

The role of over-parametrization in generalization of neural networks, in: International Conference on Learning Representations.

Neyshabur et al. (2015)

Neyshabur, B., Salakhutdinov, R., Srebro, N., 2015.

Path-SGD: Path-normalized optimization in deep neural networks, in: Advances in Neural Information Processing Systems, pp. 2422–2430.

Neyshabur et al. (2014)

Neyshabur, B., Tomioka, R., Srebro, N., 2014.

In search of the real inductive bias: On the role of implicit regularization in deep learning.

arXiv preprint arXiv:1412.6614 .

Poggio et al. (2017)

Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., Mhaskar, H., 2017.

Theory of deep learning III: explaining the non-overfitting puzzle.

arXiv preprint arXiv:1801.00173 .

Rahaman et al. (2018)

Rahaman, N., Arpit, D., Baratin, A., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A., 2018.

On the spectral bias of deep neural networks.

arXiv preprint arXiv:1806.08734 .

Saxe et al. (2019)

Saxe, A.M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B.D., Cox, D.D., 2019.

On the information bottleneck theory of deep learning.

Journal of Statistical Mechanics: Theory and Experiment 2019, 124020.

Shwartz-Ziv and Tishby (2017)

Shwartz-Ziv, R., Tishby, N., 2017.

Opening the black box of deep neural networks via information.

arXiv preprint arXiv:1703.00810 .

Silver et al. (2016)

Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al., 2016.

Mastering the game of go with deep neural networks and tree search.

Nature 529, 484.

Sokolic et al. (2016)

Sokolic, J., Giryes, R., Sapiro, G., Rodrigues, M., 2016.

Generalization error of invariant classifiers.

arXiv preprint arXiv:1610.04574 .

Sokolić et al. (2017)

Sokolić, J., Giryes, R., Sapiro, G., Rodrigues, M., 2017.

Robust large margin deep neural networks.

IEEE Transactions on Signal Processing 65, 4265–4280.

Soudry et al. (2018)

Soudry, D., Hoffer, E., Nacson, M., Gunasekar, S., Srebro, N., 2018.

The implicit bias of gradient descent on separable data.

The Journal of Machine Learning Research 19, 2822–2878.

Wei and Ma (2019)

Wei, C., Ma, T., 2019.

Data-dependent sample complexity of deep neural networks via Lipschitz augmentation.

arXiv preprint arXiv:1905.03684 .

Xu et al. (2019)

Xu, Z., Zhang, Y., Luo, T., Xiao, Y., Ma, Z., 2019.

Frequency principle: Fourier analysis sheds light on deep neural networks.

arXiv preprint arXiv:1901.06523 .

Zhang et al. (2016)

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O., 2016.

Understanding deep learning requires rethinking generalization.

arXiv preprint arXiv:1611.03530 .

Zhang et al. (2018)

Zhang, C., Liao, Q., Rakhlin, A., Miranda, B., Golowich, N., Poggio, T., 2018.

Theory of deep learning IIb: Optimization properties of SGD.

arXiv preprint arXiv:1801.02254 .

Zhou et al. (2018)

Zhou, W., Veitch, V., Austern, M., Adams, R., Orbanz, P., 2018.

Compressibility and generalization in large-scale deep learning.

arXiv preprint arXiv:1804.05862 .

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen-Zhu et al. (2018 a) Allen-Zhu, Z., Li, Y., Liang, Y., 2018 a. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918 .
2Allen-Zhu et al. (2018 b) Allen-Zhu, Z., Li, Y., Song, Z., 2018 b. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962 .
3Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., Wang, R., 2019. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584 .
4Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., Zhang, Y., 2018. Stronger generalization bounds for deep nets via a compression approach. ar Xiv preprint ar Xiv:1802.05296 .
5Bartlett et al. (2017 a) Bartlett, P., Foster, D., Telgarsky, M., 2017 a. Spectrally-normalized margin bounds for neural networks, in: Advances in Neural Information Processing Systems, pp. 6240–6249.
6Bartlett et al. (2017 b) Bartlett, P., Harvey, N., Liaw, C., Mehrabian, A., 2017 b. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. ar Xiv preprint ar Xiv:1703.02930 .
7Bartlett and Mendelson (2002) Bartlett, P., Mendelson, S., 2002. Rademacher and Gaussian Complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, 463–482.
8Baykal et al. (2018) Baykal, C., Liebenwein, L., Gilitschenski, I., Feldman, D., Rus, D., 2018. Data-dependent coresets for compressing neural networks with applications to generalization bounds. ar Xiv preprint ar Xiv:1804.05345 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness

Abstract

keywords:

1 Introduction

2 Preliminaries

2.1 Ideal label function

Definition 2.1** (Topology).**

Definition 2.2** (Ideal label function).**

Proposition 2.3** (Separation gap).**

Proof.

Proposition 2.4** (Geometric interpretation of separation gap).**

Proof.

2.2 Cover complexity of data set

Definition 2.5** (Coverability).**

Remark 2.6**.**

Proposition 2.7**.**

Proof.

Theorem 2.8**.**

Proof.

Proposition 2.9** (Scale independence).**

Proof.

2.3 Setup for accuracy analysis

Definition 2.10**.**

Remark 2.11**.**

Definition 2.12** (ccc-accuracy at xxx).**

Definition 2.13** (ccc-accuracy on DDD).**

Definition 2.14** (ccc-accuracy on T\mathcal{T}T).**

Definition 2.15** (Expected accuracy).**

Definition 2.16** (Smoothness).**

3 Lower bounds for the expected accuracy

Proposition 3.17**.**

Proof.

Theorem 3.18** (Lower bound of ccc-accuracy).**

Proof.

Theorem 3.19** (Empirical separation gap).**

Proof.

Theorem 3.20** (Lower bound of accuracy).**

Proof.

Remark 3.21**.**

4 Numerical results

4.1 Data distribution

4.2 Neural network smoothness

4.2.1 One- and two-dimensional problems

4.2.2 High-dimensional problem

4.2.3 Effects of the network size and training dataset size on the smoothness

5 Discussion

6 Conclusion

Acknowledgements

Appendix A Example of topology

Example A.22**.**

Appendix B Proof of Proposition 2.3

Proof.

Appendix C Proof of Proposition 2.4

Proof.

Appendix D Proof of Proposition 2.7

Proof.

Appendix E Estimate of total cover

Definition E.23**.**

Definition E.24**.**

Definition E.25**.**

Definition E.26**.**

Theorem E.27**.**

Lemma E.28**.**

Proof.

Lemma E.29**.**

Proof.

Lemma E.30**.**

Proof.

Theorem E.31**.**

Proof.

Lemma E.32**.**

Lemma E.33**.**

Theorem E.34**.**

Definition 2.1 (Topology).

Definition 2.2 (Ideal label function).

Proposition 2.3 (Separation gap).

Proposition 2.4 (Geometric interpretation of separation gap).

Definition 2.5 (Coverability).

Remark 2.6.

Proposition 2.7.

Theorem 2.8.

Proposition 2.9 (Scale independence).

Definition 2.10.

Remark 2.11.

Definition 2.12 ( $c$ -accuracy at $x$ ).

Definition 2.13 ( $c$ -accuracy on $D$ ).

Definition 2.14 ( $c$ -accuracy on $\mathcal{T}$ ).

Definition 2.15 (Expected accuracy).

Definition 2.16 (Smoothness).

Proposition 3.17.

Theorem 3.18 (Lower bound of $c$ -accuracy).

Theorem 3.19 (Empirical separation gap).

Theorem 3.20 (Lower bound of accuracy).

Remark 3.21.

Example A.22.

Definition E.23.

Definition E.24.

Definition E.25.

Definition E.26.

Theorem E.27.

Lemma E.28.

Lemma E.29.

Lemma E.30.

Theorem E.31.

Lemma E.32.

Lemma E.33.

Theorem E.34.