Fast convergence rates of deep neural networks for classification
Yongdai Kim, Ilsang Ohn, Dongha Kim

TL;DR
This paper establishes that deep neural networks with ReLU activation and hinge or cross-entropy loss can achieve fast convergence rates in classification tasks under various conditions, highlighting their flexibility and effectiveness.
Contribution
The paper provides theoretical convergence rate results for DNN classifiers with ReLU and hinge loss across different data conditions, and compares hinge loss with cross-entropy in practice.
Findings
DNN classifiers with ReLU and hinge loss achieve fast convergence under smooth decision boundary and margin conditions.
DNN classifiers with cross-entropy converge quickly when class probabilities are near 0 or 1.
Numerical experiments support the theoretical convergence rates and compare hinge loss and cross-entropy performance.
Abstract
We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition…
| Data | # of training data | # of test data | Input dimension | Selected classes |
|---|---|---|---|---|
| MNIST | 60,000 | 10,000 | ‘5’ vs. ‘7’ | |
| SVHN | 73,257 | 26,032 | ‘4’ vs. ‘9’ | |
| CIFAR10 | 60,000 | 50,000 | ‘cat’ vs. ‘dog’ |
| Data | # of training | Hinge loss | Logistic loss | ||
|---|---|---|---|---|---|
| samples per each class | Mean | SE | Mean | SE | |
| MNIST | 50 | 0.9318 | 0.0078 | 0.9359 | 0.0100 |
| 500 | 0.9806 | 0.0031 | 0.9799 | 0.0024 | |
| 5000 | 0.9929 | 0.0006 | 0.9925 | 0.0005 | |
| SVHN | 50 | 0.7877 | 0.0698 | 0.7851 | 0.0798 |
| 500 | 0.9500 | 0.0061 | 0.9545 | 0.0063 | |
| 5000 | 0.9796 | 0.0011 | 0.9801 | 0.0014 | |
| CIFAR10 | 50 | 0.6628 | 0.0123 | 0.6698 | 0.0096 |
| 500 | 0.7758 | 0.0090 | 0.7804 | 0.0081 | |
| 5000 | 0.8760 | 0.0064 | 0.8788 | 0.0047 | |
| SVHN | CIFAR10 |
|---|---|
| RGB images | |
| conv. 64 ReLU | conv. 96 ReLU |
| conv. 64 ReLU | conv. 96 ReLU |
| conv. 64 ReLU | conv. 96 ReLU |
| max-pool, stride 2 | |
| dropout, | |
| conv. 128 ReLU | conv. 192 ReLU |
| conv. 128 ReLU | conv. 192 ReLU |
| conv. 128 ReLU | conv. 192 ReLU |
| max-pool, stride 2 | |
| dropout, | |
| conv. 128 ReLU | conv. 192 ReLU |
| conv. 128 ReLU | conv. 192 ReLU |
| conv. 128 ReLU | conv. 192 ReLU |
| global average pool, | |
| FC | FC |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Fast convergence rates of deep neural networks for classification
Yongdai Kim, Ilsang Ohn, and Dongha Kim
Department of Statistics, Seoul National University, Seoul, Korea
Abstract
We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition that the conditional class probabilities of most data are sufficiently close to either 1 or zero. This assumption is not unusual for image recognition because human beings are extremely good at recognizing most images. To confirm our theoretical explanation, we present the results of a small numerical study conducted to compare the hinge loss and cross-entropy.
Keywords: Classification, Deep neural network, Excess risk, Fast convergence rate
1 Introduction
Deep learning (Hinton and Salakhutdinov, 2006; Larochelle et al., 2007; Goodfellow et al., 2016) has received much attention for dimension reduction and classification of objects, such as images, speech, and language. Various supervised/unsupervised deep learning architectures, such as deep belief network (Hinton et al., 2006), have been developed and applied to large scale real data with great success. A key ingredient for the success of deep learning is to discover multiple levels of representation of the given dataset with higher levels of representation defined hierarchically in terms of lower level representations. The central motivation is that higher-level representations can potentially capture relevant higher-level abstractions. See Goodfellow et al. (2016) for details.
Theoretical explanations regarding the success of deep learning have been recently studied. Many researchers have demonstrated that deep neural networks (DNNs) are much more efficient in representing certain complex functions than their shallow counterparts (Montufar et al., 2014; Raghu et al., 2016; Eldan and Shamir, 2016), which has been reconfirmed by Yarotsky (2017) and Petersen and Voigtlaender (2018), who showed that DNNs can approximate a large class of functions, including even discontinuous functions with a parsimonious number of parameters. In turn, using this efficient approximation property of a DNN, Schmidt-Hieber (2017) and Imaizumi and Fukumizu (2018) proved that, for regression problems, we can estimate a complex function including a discontinuous function using a DNN with the (in the minimax sense) optimal convergence rate. A surprising result is that any linear estimators, which include the ridge penalized kernel estimator, are sub-optimal in estimating a discontinuous function while the DNN is optimal.
In this paper, we consider classification problems. It is known that estimating the classifier directly instead of estimating the conditional class probability (i.e., ) can help achieve fast convergence rates (Mammen and Tsybakov, 1999; Tsybakov, 2004; Tsybakov and van de Geer, 2005; Audibert and Tsybakov, 2007) under the Tsybakov’s low noise condition. We prove that the estimation of a classifier based on the DNN with the hinge loss can achieve fast convergence rates under various situations.
In practice, estimating the classifier directly is difficult because the classifier itself is discontinuous. Mammen and Tsybakov (1999); Tsybakov (2004); Tsybakov and van de Geer (2005) considered estimating the classifier directly, which may be computationally infeasible in practice. Under the smoothness assumption on the conditional class probability, Audibert and Tsybakov (2007) estimated the conditional class probability using a local polynomial estimator and obtained a plug-in classifier. Finding the best plug-in classifier, however, requires searching in a given sieve, which is computationally demanding. In contrast, learning a DNN is relatively straightforward owing to the gradient descent algorithm, despite a risk of arriving at bad local minima.
We consider three cases regarding a true classifier: (1) a smooth boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of the inputs near the decision boundary is small). We prove that the DNN classifier can achieve fast convergence rates for all of these three cases if the architecture (i.e., the number of layers, number of nodes, and sparsity of the weights) of the DNN is carefully selected. In particular, the DNN classifier is minimax optimal for a smooth conditional class probability, and achieves faster convergence rates under the margin condition. To the best of the authors’ knowledge, no other estimator achieves fast convergence rates for these three cases simultaneously.
The cross-entropy is the standard objective function used in learning a DNN, and is an empirical risk with respect to the logistic loss (i.e., the negative log-likelihood of the logistic model). It is well known that the logistic loss estimates the conditional class probability rather than the classifier, and hence will be sub-optimal. However, learning a DNN with the cross-entropy performs quite well in practice. We justify the use of the cross-entropy in learning a DNN by showing that the corresponding classifier also achieves a fast convergence rate when most data have a conditional class probability close to 1 or zero. Note that this assumption is reasonable for image recognition because human beings recognize most real world images quite well.
The remainder of this paper is organized as follows. Section 2 describes the hinge loss and DNN classifier. Section 3 derives the convergence rates of the excessive risk of a DNN classifier for the aforementioned three cases regarding a true model. The fast convergence rate of the DNN classifier with the cross-entropy is derived in Section 4, and concluding remarks follow in Section 5.
1.1 Notations
For a function , where denotes the domain of the function, let . For a given subset of , we let .
For two given sequences and of real numbers, we write if there exists a constant such that for all sufficiently large . In addition, we write if and . For , we let .
Let be a multiple index, where . We define and for a multiple index . For and , let
[TABLE]
and for , let
[TABLE]
We denote by and , the space of times differentiable functions on whose partial derivatives of order with are continuous. For a positive real value , we write , where and . The Hölder space of order is defined as , where denotes the Hölder norm defined by
[TABLE]
We let
[TABLE]
which is a closed ball in the Hölder space of radius with respect to the Hölder norm.
2 Estimation of the classifier with DNNs
We consider a binary classification problem. The data are given as , where are input vectors, and are class labels. Here, for simplicity, we set ; however, this can be extended to any compact subset of . We assume that are independent copies of a random vector for a certain probability measure . We let be the marginal distribution of induced by the joint distribution .
2.1 Necessity of the hinge loss
Before going further, we will first review why we consider the hinge loss instead of the logistic loss to achieve fast convergence rates. Let be the class of all classifiers (i.e., all measurable mapping from to . The objective of classification is to find the optimal classifier (called the Bayes classifier) , which is defined as
[TABLE]
where is 1 if is true, and is 0 otherwise.
Because we do not know the probability measure generating data, we cannot find . Instead, we estimate based on the training data. The most popular method for estimating is the empirical risk minimization approach, where we estimate by minimizing the empirical risk. That is, we estimate using , where
[TABLE]
where is a given class of classifiers depending on the sample size .
In practice, is not computationally feasible because minimizing the empirical risk with the 0-1 loss over is NP hard (Bartlett et al., 2006). An alternative approach is to replace the 0-1 loss with other computationally easier losses so-called surrogate losses. In addition, instead of a class of classifiers , we consider a class of real-valued functions . For a given surrogate loss , we estimate by minimizing the surrogate empirical risk (or empirical -risk)
[TABLE]
on , and construct a classifier by .
A question in using a convex surrogate loss is the relation between the minimizer of the 0-1 empirical risk (2.1) and that of the empirical -risk (2.2). Because the empirical -risk converges to the population -risk for a given by the law of large numbers, we can consider as an estimator of , which is defined as
[TABLE]
where is the limit of in a certain sense. When is the set of all measurable functions, we say that the surrogate loss is Fisher consistent if .
It is known (Lin, 2004; Bartlett et al., 2006) that the Fisher consistency holds under very mild conditions on . In particular, is known for various surrogate losses. For example, when is the logistic loss (i.e., ), we have , where (Friedman et al., 2000). Hence, the logistic loss satisfies the Fisher consistency, which justifies the use of the cross-entropy when learning a deep neural network. That is, deep learning with the cross-entropy essentially estimates the log odds of the conditional class probability.
As we explained in the Introduction, it would be better to estimate the Bayes classifier directly, which is realized conceptually if is the Bayes classifier. The hinge loss has such a property (Lin, 2002), which is why we consider the hinge loss. Note that there are other losses that have . An example is the -loss (Shen et al., 2003), which is also known as the ramp loss (Collobert et al., 2006). Although the -loss has many advantages over the hinge loss, the -loss is nonconvex, and learning a DNN classifier using the -loss would be extremely difficult because the DNN classifier is nonconvex as well.
2.2 Learning DNN with the hinge loss
We consider DNNs that take -dimensional inputs and produce one-dimensional outputs. A DNN with many layers, and many nodes at each layer, is defined as
[TABLE]
and
[TABLE]
for and
[TABLE]
with and . We consider the ReLU activation function . We denote as , where is the parameter set including all weights and biases.
For the given , let be the number of layers in . Let be the maximum number of nodes, that is, has at most nodes at each layer. We define as the number of nonzero parameters in ,
[TABLE]
where transforms the matrix into the corresponding vector by concatenating the column vectors. Similarly, we define as the largest absolute value of the parameters in ,
[TABLE]
For a given , let be
[TABLE]
where the positive constants , , , , and are specified later.
We let be the minimizer of over for a given surrogate loss , i.e.,
[TABLE]
In the following section, we prove the fast convergence rates of for various cases of the true model when is the hinge loss and , , , and are carefully selected. For detailed formulas of , and in terms of the sample size , see the proofs of the corresponding theorems in the Appendix.
3 Fast convergence rates of DNN classifiers with the hinge loss
In this section, we consider the hinge loss and derive the convergence rates of the excess risk of . For a given function , the excess risk of is defined as
[TABLE]
and the excess -risk of is defined by
[TABLE]
Throughout this paper, we always assume the Tsybakov noise condition (Mammen and Tsybakov (1999); Tsybakov (2004)).
- (N)
There exists and such that for any
[TABLE]
We call the parameter appearing in assumption (N) the noise exponent.
We consider three cases regarding a true model: (1) a smooth decision boundary, (2) smooth class conditional probability, and (3) the margin condition. We derive the fast convergence rates of the DNN classifier using the hinge loss for all three cases.
3.1 Case 1: Smooth boundary
To describe the smooth Bayes classifier, we introduce the notion of piecewise constant functions with smooth boundaries. We adopt the notations and definitions from Petersen and Voigtlaender (2018) and Imaizumi and Fukumizu (2018). For and , we define a horizon function as
[TABLE]
where . For each horizon function, we define the corresponding basis piece as
[TABLE]
We define a piece by the intersection of basis pieces. The set of pieces is denoted by
[TABLE]
Let be the set of classifiers of the form
[TABLE]
for , and disjoint subsets of in . In this subsection, we assume that the Bayes classifier belongs to .
The following theorem proves the convergence rate of the DNN classifier with the hinge loss.
Theorem 1**.**
Assume (N) using the noise exponent . If the surrogate loss is the hinge loss, the classifier defined by (2.3) with carefully selected , and satisfies
[TABLE]
where the expectation is taken over the training data.
Tsybakov (2004) showed that the minimax lower bound is given by
[TABLE]
where the infimum is taken over all classifiers , where is a set of all measurable functions. Unfortunately, the convergence rate (3.1) is not optimal in the minimax sense. However, the difference becomes small when the noise exponent is large. Note that the estimators in Mammen and Tsybakov (1999) and Tsybakov (2004) have slower convergence rates than that in (3.1) when . However, the estimator of Tsybakov and van de Geer (2005) achieves the minimax lower bound for any . At this point, we do not know whether the sub-optimal convergence rate (3.1) is inevitable owing the use of the hinge loss rather than the 0-1 loss. We will pursue this issue in the near future.
3.2 Case 2: Smooth conditional class probability
We assume that is smooth. That is, for some and . The following theorem provides the convergence rate of the DNN classifier.
Theorem 2**.**
Assume (N) with the noise exponent . If the surrogate loss is the hinge loss, the classifier defined by (2.3) with carefully selected , and satisfies
[TABLE]
Audibert and Tsybakov (2007) showed that when , the minimax lower bound of the excess risk is given by
[TABLE]
Hence, the convergence rate (3.2) is minimax optimal up to a logarithmic factor.
3.3 Case 3: Margin condition
The convergence rate can be improved if we assume that the density of an input vector is small around the decision boundary. Let , where and , where denotes the Euclidian norm. We introduce the following condition on the probability measure .
- (M)
There exist , , and such that for any ,
[TABLE]
The condition (M) is considered by Steinwart and Christmann (2008), where the parameter in (M) is called the margin exponent. Steinwart and Christmann (2008) proves that the support vector machine with the Gaussian kernel achieves a fast convergence rate under the condition (M). The following theorem proves that a similar convergence rate can be achieved using the DNN classifier.
Theorem 3**.**
Assume (N) with the noise exponent , and (M) with the margin exponent . If the surrogate loss is the hinge loss, the classifier defined by (2.3) with carefully selected , and satisfies
[TABLE]
An interesting feature of the convergence rate (3.3) is that the dependency of the input dimension diminishes as increases. In the extreme case where , the convergence rate becomes up to the logarithm factor, which depends on neither the smoothness of the boundary nor the dimension of the input. This partly explains why the DNN classifier works well with high-dimensional inputs such as images.
To investigate the validity of the margin condition (M), we explore the area near the decision boundary obtained by the cat and dog images of the CIFAR10 dataset. We first fit the decision boundary using a convolutional neural network (CNN) with cat and dog images in the CIFAR10 dataset. We then randomly select two images, one from dog and the other from cat, and take convex combinations of them to obtain a sequence of images between the two selected images. Figure 1 shows five sequences of images from five randomly selected pairs of dog and cat images. The images in the red box, which are the interpolated images with weights of the dog images ranging from to , are visually unrealistic, which suggests that the image classification has a large margin exponent.
3.4 Remarks regarding adpative estimation
In practice, we know neither nor , that affect the choice of the DNN architecture parameters , and . We may select them data-adaptively. General tool kits used to find an adaptive classifier have been developed by Tsybakov (2004) and Audibert and Tsybakov (2007). These tools can be applied to a DNN classifier with minor modification.
For example, the model selection approach with a data-split proposed by Audibert and Tsybakov (2007) can be applied without much hamper. We first split the training data into two parts, and , with the sample sizes and . We then choose various values of , and , select the corresponding DNN architectures, and learn the architectures on data . Finally, among the learned DNN architectures, we choose the best DNN architecture based on the data . Because there is an algorithm of model selection where the difference between the selected model and true model is (for example, see Juditsky et al. (2008) and Audibert and Tsybakov (2007)), the selected model achieves the best possible convergence rate as long as and . We plan to report the detailed results of this soon.
4 Use of cross-entropy
The logistic loss does not estimate the classifier directly, and hence the convergence rate is sup-optimal in general. However, in practice, a DNN with the logistic loss (i.e., learned by minimizing the cross-entropy) works quite well. In this section, we investigate when the logistic loss works well with a DNN. We prove that the convergence rate of the excess risk of the DNN estimator with the logistic loss can be fast when the true conditional class probabilities of most of data are close to 1 or 0. This condition is expected to hold in most image recognition problems because human beings, who are thought to be a Bayes classifier, are very good at recognizing most images. The formal statement of this condition is given as follows:
- (E)
For a given positive sequence with , there exists a positive sequence with such that
[TABLE]
Theorem 4**.**
Assume (M) with the margin exponent . Let . Assume (E) with and . If is the logistic loss, then the classifier defined by (2.3) with and carefully selected , and satisfies
[TABLE]
The convergence rate in Theorem 4 is equivalent to that in Theorem 3 for up to a logarithmic factor.
To investigate the validity of the condition (E), Figure 2 shows a histogram of the estimated conditional class probabilities of the test data of the CIFAR10 data using the DNN classifier with the logistic loss. Note that most of the conditional class probabilities are very close to either 1 or 0.
We compare the performance of the two DNN classifiers learned using the two surrogate losses - the logistic loss and the hinge loss. We analyze three benchmark datasets for image recognition, that is, MNIST, SVHN, and CIFAR10, where for each dataset we select two classes that are most difficult to recognize. The data descriptions and selected classes are summarized in Table 1. The detailed DNN architectures for the three datasets are given in Appendix A.10. The Adam is used for optimization with the learning rate . Table 2 summarizes the test data error rates for various sizes of training data. The results are the averages (and standard errors) of 100 randomly selected training data, which amply show that the two estimators compete well with each other.
5 Concluding Remarks
We showed that a DNN is very flexible in the sense that it achieves fast convergence rates for various cases regarding a true model. It is interesting to note that a DNN is not only good at estimating a smooth decision boundary but also a smooth conditional class probability. In addition, a DNN can fully utilize the margin condition.
We showed that using the cross-entropy is also promising when the true conditional class probability is close to either 0 or 1 for most data. However, we conjecture that learning a DNN by minimizing the cross-entropy would be sub-optimal when the conditional class probability is not extreme.
Our theoretical results could be used to develop model selection procedures, particularly for the optimal selection of and . Moreover, it will be interesting to develop an online learning algorithm that can select and data adaptively.
We did not consider a computational issue in this paper. Learning a DNN with a sparsity constraint has not been fully studied, although some methods have been proposed (e.g., Liu et al. (2015), Han et al. (2015), and Wen et al. (2016)). A learning algorithm that supports our theoretical results will be worth pursuing.
Acknowledgement
This work was supported by the Samsung Science and Technology Foundation under Project Number SSTF-BA1601-02.
Appendix A Appendix
A.1 Complexity measures of a class of functions
We introduce the complexity measures of a given class of functions. Let for be defined as , where denotes the Lebesgue measure and .
Let be a given class of real-value functions defined on . Let and . A collection is called a -covering set of with respect to the norm if, for all , there exists in the collection such that . The cardinality of the minimal -covering set is called the -covering number of with respect to the norm, and is denoted by , that is,
[TABLE]
where .
A collection of pairs is called a -bracketing set of with respect to the norm if for all , and for any , there is a pair in the collection such that . The cardinality of the minimal -bracketing set is called the -bracketing number of with respect to the norm, and is denoted by . The -bracketing entropy, denoted by is the logarithm of the -bracketing number, i.e., .
For any , it is known (see, for example, Lemma 2.1 of van de Geer (2000)) that
[TABLE]
for any , and
[TABLE]
if .
A.2 Convergence rate of the excess -risk for general surrogate losses
In this subsection, we derive the convergence rate of the excess -risk under regularity conditions, which is used repeatedly in the following subsections. The regularity conditions and techniques of the proof are minor modifications of those in Park (2009); however, we present the complete conditions and proof for the sake of readers’ convenience.
We assume the following regularity conditions.
- (A1)
is Lipschitz, i.e., there exists a constant such that for any .
- (A2)
For a positive sequence as for some , there exists a sequence of function classes such that
[TABLE]
for some .
- (A3)
There exists a sequence with such that .
- (A4)
There exists a constant such that for any and any ,
[TABLE]
for a constant depending only on and .
- (A5)
For a positive constant , there exists a sequence such that
[TABLE]
for in (A2), in (A3), and in (A4).
For a proof of the general convergence result, we apply the large deviation inequality of Shen and Wong (1994) presented in Lemma 1.
Lemma 1** (Theorem 3 of Shen and Wong (1994)).**
Let be the class of functions bounded above by . Assume that for any and for some . Suppose that there exists such that
- (C1)
, 2. (C2)
, , 3. (C3)
if ,
[TABLE]
Then,
[TABLE]
where denotes the outer probability measure.
The following Theorem is the main result of this section, which gives the convergence rate of the excess -risk.
Theorem 5**.**
Suppose that the conditions (A1)-(A5) are met. Let . Then, the empirical -risk minimizer over satisfies
[TABLE]
for some universal constant .
Proof.
Let , , and be constants appearing in assumptions (A1), (A4), and (A5), respectively. Let . We define the following empirical process
[TABLE]
where is a function such that .
Since minimizes , it follows that
[TABLE]
We define
[TABLE]
Note that for such that , is an empty set. This is because for any , , and thus . Therefore, , where . Thus, we only deal with using . Because , we have
[TABLE]
We introduce the notation for a concise expression. Through the triangle inequality and (A4), we obtain the following variance bound
[TABLE]
Now, we have
[TABLE]
To bound the right-hand side, we apply Lemma 1 to the class of functions
[TABLE]
with , , , and where we let
[TABLE]
Note that for any , , and by (A.2). Since and , , and . Now we will check (C1)-(C3) of Lemma 1. Because for any and ,
[TABLE]
and
[TABLE]
Therefore, (C2) in Lemma 1 holds. For (C3), we first note that
[TABLE]
where the first inequality follows from (A1), and the second inequality follows from . Because is non-increasing in ,
[TABLE]
where the fourth inequality is due to (A5). By taking , (C3) of Lemma 1 is satisfied. Furthermore, (A.4) implies that
[TABLE]
where the last inequality is due to that . On the other hand, since ,
[TABLE]
which is larger than Hence (C1) of Lemma 1 is met.
Applying Lemma 1 to each , (A.3) is further bounded as
[TABLE]
for certain positive constants , and , which leads to the desired result.
A.3 Generic convergence rate for the hinge loss
We derive the convergence rate of the excess risk of the hinge loss under the conditions (A2), (A3), and (A5). Note that (A1) holds with for the hinge loss. We adopt the following lemma for the variance bound (A4).
Lemma 2** (Lemma 6.1 of Steinwart and Scovel (2007)).**
Assume (N) with the noise exponent . Assume for any . For the hinge loss , we have that, for any ,
[TABLE]
where and is defined by
[TABLE]
Theorem 6**.**
Let be the hinge loss. Assume (N) with the noise exponent , and that (A2), (A3), and (A5) are met. Let . Assume that for an arbitrarily small constant . Then, the empirical -risk minimizer over satisfies
[TABLE]
where the expectation is taken over the training data.
Proof.
By Zhang’s inequality (Theorem 2.31 of (Steinwart and Christmann, 2008)), we have . Since (A4) is satisfied with by Lemma 2, Theorem 5 implies that
[TABLE]
for some universal constant . Since is bounded above by 1, the preceding display and the assumption imply the desired result.
A.4 Entropy of the class of DNNs
The following proposition states the upper bound of the -entropy of a neural network function space.
Proposition 1** (Lemma 3 of Suzuki (2018), Lemma 5 of Schmidt-Hieber (2017)).**
For any ,
[TABLE]
where .
A.5 Proof of Theorem 1
The following proposition given by Petersen and Voigtlaender (2018) proves that DNNs are good at approximating piecewise constant functions with smooth boundaries.
Proposition 2** (Corollary 3.7 of Petersen and Voigtlaender (2018)).**
Let , , and . For any and any sufficiently small , there exists a neural network
[TABLE]
where the positive constants , and depend only on , and , such that
[TABLE]
Proof of Theorem 1.
We will check the conditions (A2), (A3), and (A5) in Section A.2, and apply Theorem 6 to complete the proof. For (A2), let be a positive sequence such that . Through Proposition 2, there exists such that with , and . Thus,
[TABLE]
and hence (A2) and (A3) hold with and .
For (A5), let . Then, by Proposition 1,
[TABLE]
In turn, (A.1) implies that (A5) is satisfied if we choose satisfying
[TABLE]
which leads to the best possible convergence rate
[TABLE]
and completes the proof by Theorem 6.
A.6 Proof of Theorem 2
We first introduce the smooth function approximation result of DNNs.
Proposition 3**.**
For any function and any sufficiently small , there exists a neural network
[TABLE]
such that
[TABLE]
where the constants , and depend only on and .
Proof.
Theorem 5 of Schmidt-Hieber (2017) proves that for any and any integers and , there exists a neural network such that
[TABLE]
where , , and . By letting and , we have , and . Finally, because , we have , and hence we complete the proof with .
Proof of Theorem 2.
For a given , by Proposition 3, there exists such that with at most layers, nodes in each layer, and nonzero parameters for some positive constants , and . We construct the neural network by adding one layer to to achieve
[TABLE]
where denotes the ReLU activation function. Note that is equal to if , if , and otherwise. Let . Then, for , because when . Similarly, we can show that when . Therefore, by (N) we have
[TABLE]
where the inequality in the last line holds since .
Note that is also a DNN in which the last layer of has a finite number of parameters, and the maximum of the parameters is bounded above by . Hence, we can construct the DNN class containing with , , , and . Now, take and observe that
[TABLE]
through Proposition 1. Since , (A5) is satisfied if we choose satisfying
[TABLE]
which leads to the best possible convergence rate
[TABLE]
and completes the proof based on Theorem 6.
A.7 Proof of Theorem 3
The main technique of the proof is to approximate a piecewise constant function using a DNN with respect to the supremum norm on a specific subset of the domain, where this subset depends on the function to be approximated.
Let , , and . Let be a disjoint with the form
[TABLE]
Let , and let
[TABLE]
For a given , define such that
[TABLE]
It turns out that any point in has the supremum norm from the the decision boundary of being larger than . The following theorem proves that a DNN approximates well on .
Proposition 4**.**
Let , , and . For any and a sufficiently small , there exists a neural network
[TABLE]
where the positive constants , and depend only on , and , such that
[TABLE]
where is the function defined in (A.6), and is defined in (A.7).
Proof.
The proof is deferred to Section A.9.
Proof of Theorem 3.
Let be a positive sequence such that . Based on Theorem 4, there exists such that
[TABLE]
with , , , and for some .
We now show that , where . Suppose that . Then, there are and such that . Let be the -dimensional vector where the -th component is equal to and the other components are the same as the corresponding components of , i.e., and . Clearly, is on the decision boundary . Since , it follows that , which implies that for any since . Therefore, through the condition (M),
[TABLE]
for some constant , and hence (A2) and (A3) hold with and .
For (A5), if we take , it follows that
[TABLE]
Since , (A5) is satisfied if we choose satisfying
[TABLE]
which leads to the best possible convergence rate
[TABLE]
and completes the proof by Theorem 6.
A.8 Proof of Theorem 4
For the logistic loss, the following two lemmas are needed. The first lemma states that the -risks of both the -risk minimizer and the Bayes classifier are bounded. The second lemma provides the variance bound of the logistic loss.
Lemma 3**.**
Let be the logistic loss. Assume (E) with . There then exist constants and such that
[TABLE]
where .
Proof.
Recall that . We let
[TABLE]
. It then follows that
[TABLE]
Let
[TABLE]
We divide into two disjoint sets and . On , we have
[TABLE]
Similarly, we can show that on , which implies .
We use the similar argument above for . For
[TABLE]
and similarly we obtain the same upper bound on .
Lemma 4** (Lemma 6.1. of Park (2009)).**
Assume (N) with the noise exponent . Assume for any . Then, for the logistic loss , we have that, for any ,
[TABLE]
for some constant .
Proof of Theorem 4.
Let . As in the proof of Theorem 3, for a positive sequence approaching zero, we can find such that
[TABLE]
and
[TABLE]
where is defined in (A.7), with being the Bayes classifier. Because , the condition (M) implies that
[TABLE]
for some constant . By Lemma 3 and the Lipschitz property of the logistic loss, we have
[TABLE]
for some positive constants and . Recall that we have defined
[TABLE]
We now take and such that , and thus the conditions (A2) and (A3) in Section A.2 hold with and .
For (A5), let . Because , it follows that
[TABLE]
which implies (A5) and completes the proof through Theorem 5 with Lemma 4, which proves the condition (A4).
A.9 Proof of Proposition 4
Before we provide the proof of Theorem 4, we introduce some useful definitions and techniques for the construction of DNNs, which are mostly from Petersen and Voigtlaender (2018).
For matrices , we let denote a block diagonal matrix whose diagonal matrices are . When have the same number of rows, we let denote a concatenated matrix along the column, and when have the same number of columns, we let denote a concatenated matrix along the row.
For an index set , a masking neural network with layers, denoted by , where , in which and for , and , where denotes a identity matrix and is a diagonal matrix where the -th diagonal entry is equal to 1 if , and is zero otherwise. If , we define . The output of the masking neural network is equal to the masked input , of which the -th element is equal to if , and is zero otherwise. Note that .
Let and be two neural networks such that the input layer of has the same dimension as the output layer of . Then, a stacked neural network of and denoted by , where is defined by
[TABLE]
The stacked neural network has layers and satisfies
[TABLE]
for any input . In addition, we have that , where is the constant equal to the multiplication of the input dimension of and the output dimension of .
Let and be two neural networks with the same number of layers and -dimensional inputs. A concatenated neural network of the two networks and denoted by , where is defined by , where for , for , and . The concatenated neural network satisfies
[TABLE]
for any input , as well as .
We are ready to prove Proposition 4. We divide the proof into two steps. First we give the proof of approximation of the horizon functions, and then using the result, we prove Proposition 4.
Lemma 5** (Approximation of horizon functions).**
Let , , and . For a horizon function , where , and , define
[TABLE]
There then exists a neural network
[TABLE]
where the positive constants , and depend only on , and , such that
[TABLE]
Proof.
Without a loss of generality, assume . By Proposition 3, we can construct a neural network on such that with , , , , and . Define the map by
[TABLE]
Let and . Let . Then, consider the network
[TABLE]
where and are masking neural networks and . Clearly, we have
[TABLE]
with .
Let . We now construct a neural network that approximates . Let , where with and for , and for , , and . It can be shown that for every with . See Lemma A.2 of Petersen and Voigtlaender (2018) for details.
Finally, define
[TABLE]
so that . We then have
[TABLE]
We show that both terms on the right-hand side of the preceding display are zero.
For the first term, we note that
[TABLE]
Note also that, on , , and by the construction of , for any . Combining these two facts, we obtain
[TABLE]
which implies that the first term is equal to zero.
For the second term, we have that, for every
[TABLE]
where the second ineqaulity holds since if and if . Thus, , which completes the proof.
Proof of Proposition 4.
We give the proof only for the case of . An extension of the cases is straightforward. Thus, we omit the subscript in all expressions.
Let be a neural network such that
[TABLE]
for any , as in Lemma 5. Define the neural network with -dimensional inputs as
[TABLE]
where denotes the ReLU activation function, and define
[TABLE]
We now show that
[TABLE]
If , then there is such that . Hence, and thus . If , then for all , and hence .
A.10 DNN architectures used for the experiments
For the MNIST dataset, we used a DNN with five hidden layers, whose numbers of nodes were 1200, 600, 300, 150, and 150, respectively. All hidden layers are followed by batch normalization (Ioffe and Szegedy, 2015). In addition, for the SVHN and CIFAR10 datasets, we used the CNN models whose architectures are provided in Table 3.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Audibert and Tsybakov [2007] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics , 35(2):608–633, 2007.
- 2Bartlett et al. [2006] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006.
- 3Collobert et al. [2006] Ronan Collobert, Fabian Sinz, Jason Weston, and Léon Bottou. Large scale transductive svms. Journal of Machine Learning Research , 7(Aug):1687–1712, 2006.
- 4Eldan and Shamir [2016] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory , pages 907–940, 2016.
- 5Friedman et al. [2000] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics , 28(2):337–407, 2000.
- 6Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning . MIT Press, 2016.
- 7Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems , pages 1135–1143, 2015.
- 8Hinton and Salakhutdinov [2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science , 313(5786):504–507, 2006.
