Fast convergence rates of deep neural networks for classification

Yongdai Kim; Ilsang Ohn; Dongha Kim

arXiv:1812.03599·stat.ML·June 19, 2019

Fast convergence rates of deep neural networks for classification

Yongdai Kim, Ilsang Ohn, Dongha Kim

PDF

TL;DR

This paper establishes that deep neural networks with ReLU activation and hinge or cross-entropy loss can achieve fast convergence rates in classification tasks under various conditions, highlighting their flexibility and effectiveness.

Contribution

The paper provides theoretical convergence rate results for DNN classifiers with ReLU and hinge loss across different data conditions, and compares hinge loss with cross-entropy in practice.

Findings

01

DNN classifiers with ReLU and hinge loss achieve fast convergence under smooth decision boundary and margin conditions.

02

DNN classifiers with cross-entropy converge quickly when class probabilities are near 0 or 1.

03

Numerical experiments support the theoretical convergence rates and compare hinge loss and cross-entropy performance.

Abstract

We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition…

Tables3

Table 1. Table 1: Data summary

Data	# of training data	# of test data	Input dimension	Selected classes
MNIST	60,000	10,000	$28 \times 28$	‘5’ vs. ‘7’
SVHN	73,257	26,032	$3 \times 32 \times 32$	‘4’ vs. ‘9’
CIFAR10	60,000	50,000	$3 \times 32 \times 32$	‘cat’ vs. ‘dog’

Table 2. Table 2: Test errors of the DNN classifiers learned using the hinge and logistic losses with various training data sizes.

Data	# of training	Hinge loss		Logistic loss
Data	samples per each class	Mean	SE	Mean	SE
MNIST	50	0.9318	0.0078	0.9359	0.0100
	500	0.9806	0.0031	0.9799	0.0024
	5000	0.9929	0.0006	0.9925	0.0005
SVHN	50	0.7877	0.0698	0.7851	0.0798
	500	0.9500	0.0061	0.9545	0.0063
	5000	0.9796	0.0011	0.9801	0.0014
CIFAR10	50	0.6628	0.0123	0.6698	0.0096
	500	0.7758	0.0090	0.7804	0.0081
	5000	0.8760	0.0064	0.8788	0.0047

Table 3. Table 3: CNN models used in our experiments over SVHN and CIFAR-10. All convolutional (conv.) and fully connected (FC) layers are followed by the batch normalization.

SVHN	CIFAR10
$32 \times 32$ RGB images
$3 \times 3$ conv. 64 ReLU	$3 \times 3$ conv. 96 ReLU
$3 \times 3$ conv. 64 ReLU	$3 \times 3$ conv. 96 ReLU
$3 \times 3$ conv. 64 ReLU	$3 \times 3$ conv. 96 ReLU
$2 \times 2$ max-pool, stride 2
dropout, $p = 0.5$
$3 \times 3$ conv. 128 ReLU	$3 \times 3$ conv. 192 ReLU
$3 \times 3$ conv. 128 ReLU	$3 \times 3$ conv. 192 ReLU
$3 \times 3$ conv. 128 ReLU	$3 \times 3$ conv. 192 ReLU
$2 \times 2$ max-pool, stride 2
dropout, $p = 0.5$
$3 \times 3$ conv. 128 ReLU	$3 \times 3$ conv. 192 ReLU
$1 \times 1$ conv. 128 ReLU	$1 \times 1$ conv. 192 ReLU
$1 \times 1$ conv. 128 ReLU	$1 \times 1$ conv. 192 ReLU
global average pool, $6 \times 6 \to 1 \times 1$
FC $128 \to 1$	FC $192 \to 1$

Equations263

\partial^{m} f = \frac{\partial ^{∣ m ∣} f}{\partial x ^{m}} = \frac{\partial ^{∣ m ∣} f}{\partial x _{1}^{m_{1}} \dots \partial x _{d}^{m_{d}}},

\partial^{m} f = \frac{\partial ^{∣ m ∣} f}{\partial x ^{m}} = \frac{\partial ^{∣ m ∣} f}{\partial x _{1}^{m_{1}} \dots \partial x _{d}^{m_{d}}},

[f]_{s, X} = x, y \in X, x \neq = y sup \frac{∣ f ( x ) - f ( y ) ∣}{∣ x - y ∣ ^{s}} .

[f]_{s, X} = x, y \in X, x \neq = y sup \frac{∣ f ( x ) - f ( y ) ∣}{∣ x - y ∣ ^{s}} .

∥ f ∥_{H^{α} (X)} = ∣ m ∣ \leq [α]^{-} max ∥ \partial^{m} f ∥_{\infty, X} + ∣ m ∣ = [α]^{-} max [\partial^{m} f]_{{α}^{+}, X} .

∥ f ∥_{H^{α} (X)} = ∣ m ∣ \leq [α]^{-} max ∥ \partial^{m} f ∥_{\infty, X} + ∣ m ∣ = [α]^{-} max [\partial^{m} f]_{{α}^{+}, X} .

H^{α, r} (X) = {f \in C^{[α]^{-}} (X) : ∥ f ∥_{H^{α} (X)} \leq r},

H^{α, r} (X) = {f \in C^{[α]^{-}} (X) : ∥ f ∥_{H^{α} (X)} \leq r},

C^{*} = C \in C argmin \mbox E [1 {C (X) \neq = Y}],

C^{*} = C \in C argmin \mbox E [1 {C (X) \neq = Y}],

C = C \in C_{n} argmin i = 1 \sum n 1 {C (x_{i}) \neq = y_{i}} / n,

C = C \in C_{n} argmin i = 1 \sum n 1 {C (x_{i}) \neq = y_{i}} / n,

E_{ϕ, n} (f) = i = 1 \sum n ϕ (y_{i} f (x_{i})) / n

E_{ϕ, n} (f) = i = 1 \sum n ϕ (y_{i} f (x_{i})) / n

f_{ϕ}^{*} = f \in F_{\infty} argmin E_{ϕ} (f),

f_{ϕ}^{*} = f \in F_{\infty} argmin E_{ϕ} (f),

z_{j}^{(l)} (x) = b_{j}^{(l)} + k = 1 \sum N^{(l - 1)} W_{j, k}^{(l)} h_{k}^{(l - 1)} (x)

z_{j}^{(l)} (x) = b_{j}^{(l)} + k = 1 \sum N^{(l - 1)} W_{j, k}^{(l)} h_{k}^{(l - 1)} (x)

h_{j}^{(l)} (x) = σ (z_{j}^{(l)} (x))

h_{j}^{(l)} (x) = σ (z_{j}^{(l)} (x))

f (x) = b^{(L + 1)} + k = 1 \sum N^{(L)} W_{1, k}^{(L + 1)} h_{k}^{(L)} (x)

f (x) = b^{(L + 1)} + k = 1 \sum N^{(L)} W_{1, k}^{(L + 1)} h_{k}^{(L)} (x)

∥Θ ∥_{0} = l = 1 \sum L + 1 (∥ vec (W^{(l)}) ∥_{0} + ∥ b^{(l)} ∥_{0}),

∥Θ ∥_{0} = l = 1 \sum L + 1 (∥ vec (W^{(l)}) ∥_{0} + ∥ b^{(l)} ∥_{0}),

∥Θ ∥_{\infty} = max {1 \leq l \leq L + 1 max ∥ vec (W^{(l)}) ∥_{\infty}, 1 \leq l \leq L + 1 max ∥ b^{(l)} ∥_{\infty}} .

∥Θ ∥_{\infty} = max {1 \leq l \leq L + 1 max ∥ vec (W^{(l)}) ∥_{\infty}, 1 \leq l \leq L + 1 max ∥ b^{(l)} ∥_{\infty}} .

F_{n}

F_{n}

\displaystyle=\big{\{}f({\bf x}|\Theta):|\Theta|\leq L_{n},N_{\max}(\Theta)\leq N_{n},\|\Theta\|_{0}\leq S_{n},

\displaystyle\qquad\qquad\qquad\|\Theta\|_{\infty}\leq B_{n},\|f(\cdot|\Theta)\|_{\infty}\leq F_{n}\big{\}}

f_{ϕ, n}^{DNN} = f \in F_{n} ar g min \frac{1}{n} i = 1 \sum n ϕ (y_{i} f (x_{i})) .

f_{ϕ, n}^{DNN} = f \in F_{n} ar g min \frac{1}{n} i = 1 \sum n ϕ (y_{i} f (x_{i})) .

E (f, C^{*}) = E (f) - E (C^{*}) = \mbox E [1 (Y f (X) < 0)] - \mbox E [1 (Y C^{*} (X) < 0)],

E (f, C^{*}) = E (f) - E (C^{*}) = \mbox E [1 (Y f (X) < 0)] - \mbox E [1 (Y C^{*} (X) < 0)],

E_{ϕ} (f, f_{ϕ}^{*}) = E_{ϕ} (f) - E_{ϕ} (f_{ϕ}^{*}) = \mbox E [ϕ (Y f (X))] - \mbox E [ϕ (Y f_{ϕ}^{*} (X))] .

E_{ϕ} (f, f_{ϕ}^{*}) = E_{ϕ} (f) - E_{ϕ} (f_{ϕ}^{*}) = \mbox E [ϕ (Y f (X))] - \mbox E [ϕ (Y f_{ϕ}^{*} (X))] .

Pr ({X : ∣2 η (X) - 1∣ \leq t}) \leq C t^{q} .

Pr ({X : ∣2 η (X) - 1∣ \leq t}) \leq C t^{q} .

Ψ_{g, j} (x) = 1 (x_{j} \geq g (x_{- j})),

Ψ_{g, j} (x) = 1 (x_{j} \geq g (x_{- j})),

I_{g, j} = {x \in [0, 1]^{d} : Ψ_{g, j} (x) = 1} .

I_{g, j} = {x \in [0, 1]^{d} : Ψ_{g, j} (x) = 1} .

A^{α, r, K} = {A \subset [0, 1]^{d} : A = k = 1 ⋂ K I_{g_{k}, j_{k}}, g_{k} \in H^{α, r} ([0, 1]^{d - 1}), j_{k} \in [d]} .

A^{α, r, K} = {A \subset [0, 1]^{d} : A = k = 1 ⋂ K I_{g_{k}, j_{k}}, g_{k} \in H^{α, r} ([0, 1]^{d - 1}), j_{k} \in [d]} .

C (x) = 2 t = 1 \sum T 1 (x \in A_{t}) - 1,

C (x) = 2 t = 1 \sum T 1 (x \in A_{t}) - 1,

C^{*} \in C^{α, r, K, T} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) ( q + 1 )}}

C^{*} \in C^{α, r, K, T} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) ( q + 1 )}}

f_{n} in f C^{*} \in C^{α, r, 1, 1} sup \mbox E [E (f_{n}, C^{*})] ≳ n^{- \frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) q}},

f_{n} in f C^{*} \in C^{α, r, 1, 1} sup \mbox E [E (f_{n}, C^{*})] ≳ n^{- \frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) q}},

η \in H^{β, r} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{β ( q + 1 )}{β ( q + 2 ) + d}} .

η \in H^{β, r} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{β ( q + 1 )}{β ( q + 2 ) + d}} .

f_{n} in f η \in H^{β, r} sup \mbox E [E (f_{n}, C^{*})] ≳ n^{- \frac{β ( q + 1 )}{β ( q + 2 ) + d}} .

f_{n} in f η \in H^{β, r} sup \mbox E [E (f_{n}, C^{*})] ≳ n^{- \frac{β ( q + 1 )}{β ( q + 2 ) + d}} .

Pr ({X : dist (X, D^{*}) \leq ϵ}) \leq C ϵ^{γ} .

Pr ({X : dist (X, D^{*}) \leq ϵ}) \leq C ϵ^{γ} .

C^{⋆} \in C^{α, r, K, T} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) ( q + 1 ) / γ}} .

C^{⋆} \in C^{α, r, K, T} sup \mbox E [E (f_{ϕ, n}^{DNN}, C^{*})] ≲ (\frac{lo g ^{3} n}{n})^{\frac{α ( q + 1 )}{α ( q + 2 ) + ( d - 1 ) ( q + 1 ) / γ}} .

Pr {X : ∣ f_{ϕ}^{*} (X) ∣ > \tilde{F}_{n}} \geq 1 - λ_{n} .

Pr {X : ∣ f_{ϕ}^{*} (X) ∣ > \tilde{F}_{n}} \geq 1 - λ_{n} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Fast convergence rates of deep neural networks for classification

Yongdai Kim, Ilsang Ohn, and Dongha Kim

Department of Statistics, Seoul National University, Seoul, Korea

Abstract

We derive the fast convergence rates of a deep neural network (DNN) classifier with the rectified linear unit (ReLU) activation function learned using the hinge loss. We consider three cases for a true model: (1) a smooth decision boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of inputs near the decision boundary is small). We show that the DNN classifier learned using the hinge loss achieves fast rate convergences for all three cases provided that the architecture (i.e., the number of layers, number of nodes and sparsity). is carefully selected. An important implication is that DNN architectures are very flexible for use in various cases without much modification. In addition, we consider a DNN classifier learned by minimizing the cross-entropy, and show that the DNN classifier achieves a fast convergence rate under the condition that the conditional class probabilities of most data are sufficiently close to either 1 or zero. This assumption is not unusual for image recognition because human beings are extremely good at recognizing most images. To confirm our theoretical explanation, we present the results of a small numerical study conducted to compare the hinge loss and cross-entropy.

Keywords: Classification, Deep neural network, Excess risk, Fast convergence rate

1 Introduction

Deep learning (Hinton and Salakhutdinov, 2006; Larochelle et al., 2007; Goodfellow et al., 2016) has received much attention for dimension reduction and classification of objects, such as images, speech, and language. Various supervised/unsupervised deep learning architectures, such as deep belief network (Hinton et al., 2006), have been developed and applied to large scale real data with great success. A key ingredient for the success of deep learning is to discover multiple levels of representation of the given dataset with higher levels of representation defined hierarchically in terms of lower level representations. The central motivation is that higher-level representations can potentially capture relevant higher-level abstractions. See Goodfellow et al. (2016) for details.

Theoretical explanations regarding the success of deep learning have been recently studied. Many researchers have demonstrated that deep neural networks (DNNs) are much more efficient in representing certain complex functions than their shallow counterparts (Montufar et al., 2014; Raghu et al., 2016; Eldan and Shamir, 2016), which has been reconfirmed by Yarotsky (2017) and Petersen and Voigtlaender (2018), who showed that DNNs can approximate a large class of functions, including even discontinuous functions with a parsimonious number of parameters. In turn, using this efficient approximation property of a DNN, Schmidt-Hieber (2017) and Imaizumi and Fukumizu (2018) proved that, for regression problems, we can estimate a complex function including a discontinuous function using a DNN with the (in the minimax sense) optimal convergence rate. A surprising result is that any linear estimators, which include the ridge penalized kernel estimator, are sub-optimal in estimating a discontinuous function while the DNN is optimal.

In this paper, we consider classification problems. It is known that estimating the classifier directly instead of estimating the conditional class probability (i.e., $\eta({\bf x})=\Pr(Y=1|{\bf X}={\bf x})$ ) can help achieve fast convergence rates (Mammen and Tsybakov, 1999; Tsybakov, 2004; Tsybakov and van de Geer, 2005; Audibert and Tsybakov, 2007) under the Tsybakov’s low noise condition. We prove that the estimation of a classifier based on the DNN with the hinge loss can achieve fast convergence rates under various situations.

In practice, estimating the classifier directly is difficult because the classifier itself is discontinuous. Mammen and Tsybakov (1999); Tsybakov (2004); Tsybakov and van de Geer (2005) considered estimating the classifier directly, which may be computationally infeasible in practice. Under the smoothness assumption on the conditional class probability, Audibert and Tsybakov (2007) estimated the conditional class probability using a local polynomial estimator and obtained a plug-in classifier. Finding the best plug-in classifier, however, requires searching in a given sieve, which is computationally demanding. In contrast, learning a DNN is relatively straightforward owing to the gradient descent algorithm, despite a risk of arriving at bad local minima.

We consider three cases regarding a true classifier: (1) a smooth boundary, (2) smooth conditional class probability, and (3) the margin condition (i.e., the probability of the inputs near the decision boundary is small). We prove that the DNN classifier can achieve fast convergence rates for all of these three cases if the architecture (i.e., the number of layers, number of nodes, and sparsity of the weights) of the DNN is carefully selected. In particular, the DNN classifier is minimax optimal for a smooth conditional class probability, and achieves faster convergence rates under the margin condition. To the best of the authors’ knowledge, no other estimator achieves fast convergence rates for these three cases simultaneously.

The cross-entropy is the standard objective function used in learning a DNN, and is an empirical risk with respect to the logistic loss (i.e., the negative log-likelihood of the logistic model). It is well known that the logistic loss estimates the conditional class probability rather than the classifier, and hence will be sub-optimal. However, learning a DNN with the cross-entropy performs quite well in practice. We justify the use of the cross-entropy in learning a DNN by showing that the corresponding classifier also achieves a fast convergence rate when most data have a conditional class probability close to 1 or zero. Note that this assumption is reasonable for image recognition because human beings recognize most real world images quite well.

The remainder of this paper is organized as follows. Section 2 describes the hinge loss and DNN classifier. Section 3 derives the convergence rates of the excessive risk of a DNN classifier for the aforementioned three cases regarding a true model. The fast convergence rate of the DNN classifier with the cross-entropy is derived in Section 4, and concluding remarks follow in Section 5.

1.1 Notations

For a function $f:{\cal X}\to\mathbb{R}$ , where ${\cal X}$ denotes the domain of the function, let $\|f\|_{\infty}=\sup_{{\bf x}\in{\cal X}}|f({\bf x})|$ . For a given subset $B$ of ${\cal X}$ , we let $\|f\|_{\infty,B}=\sup_{{\bf x}\in B}|f({\bf x})|$ .

For two given sequences $\{a_{n}\}_{n\in\mathbb{N}}$ and $\{b_{n}\}_{n\in\mathbb{N}}$ of real numbers, we write $a_{n}\lesssim b_{n}$ if there exists a constant $C>0$ such that $a_{n}\leq Cb_{n}$ for all sufficiently large $n$ . In addition, we write $a_{b}\asymp b_{n}$ if $a_{n}\lesssim b_{n}$ and $a_{n}\gtrsim b_{n}$ . For $N\in\mathbb{N}$ , we let $[N]=\{1,\dots,N\}$ .

Let ${\bf m}=(m_{1},\dots,m_{d})\in\mathbb{N}_{0}^{d}$ be a multiple index, where $\mathbb{N}_{0}=\mathbb{N}\cup\{0\}$ . We define $|{\bf m}|=m_{1}+\cdots+m_{d}$ and ${\bf x}^{{\bf m}}=x_{1}^{m_{1}}\cdots x_{d}^{m_{d}}$ for a multiple index ${\bf m}$ . For $f:{\cal X}\to\mathbb{R}$ and ${\bf m}\in\mathbb{N}_{0}^{d}$ , let

[TABLE]

and for $s\in(0,1]$ , let

[TABLE]

We denote by ${\cal C}^{m}({\cal X})$ and $m\in\mathbb{N}$ , the space of $m$ times differentiable functions on ${\cal X}$ whose partial derivatives of order ${\bf m}$ with $|{\bf m}|\leq m$ are continuous. For a positive real value $\alpha$ , we write $\alpha=[\alpha]^{-}+\{\alpha\}^{+}$ , where $[\alpha]^{-}=\lceil\alpha-1\rceil\in\mathbb{N}_{0}$ and $\{\alpha\}^{+}=\alpha-[\alpha]^{-}\in(0,1]$ . The Hölder space of order $\alpha$ is defined as ${\cal H}^{\alpha}({\cal X})=\left\{f\in{\cal C}^{[\alpha]^{-}}({\cal X}):\|f\|_{{\cal H}^{\alpha}({\cal X})}<\infty\right\}$ , where $\|f\|_{{\cal H}^{\alpha}({\cal X})}$ denotes the Hölder norm defined by

[TABLE]

We let

[TABLE]

which is a closed ball in the Hölder space of radius $r$ with respect to the Hölder norm.

2 Estimation of the classifier with DNNs

We consider a binary classification problem. The data are given as $({\bf x}_{1},y_{1}),\ldots,({\bf x}_{n},y_{n})$ , where ${\bf x}_{i}\in{\cal X}\subset\mathbb{R}^{d}$ are input vectors, and $y_{i}\in\{-1,1\}$ are class labels. Here, for simplicity, we set ${\cal X}=[0,1]^{d}$ ; however, this can be extended to any compact subset of $\mathbb{R}^{d}$ . We assume that $({\bf x}_{i},y_{i})$ are independent copies of a random vector $({\bf X},Y)\sim\Pr$ for a certain probability measure $\Pr$ . We let $P_{X}$ be the marginal distribution of ${\bf X}$ induced by the joint distribution $\Pr$ .

2.1 Necessity of the hinge loss

Before going further, we will first review why we consider the hinge loss instead of the logistic loss to achieve fast convergence rates. Let ${\cal C}$ be the class of all classifiers (i.e., all measurable mapping from ${\cal X}$ to $\{-1,1\})$ . The objective of classification is to find the optimal classifier (called the Bayes classifier) $C^{*}$ , which is defined as

[TABLE]

where $\mathbf{1}\{\cdot\}$ is 1 if $\{\cdot\}$ is true, and is 0 otherwise.

Because we do not know the probability measure $\Pr$ generating data, we cannot find $C^{*}$ . Instead, we estimate $C^{*}$ based on the training data. The most popular method for estimating $C^{*}$ is the empirical risk minimization approach, where we estimate $C^{*}$ by minimizing the empirical risk. That is, we estimate $C^{*}$ using $\widehat{C}$ , where

[TABLE]

where ${\cal C}_{n}$ is a given class of classifiers depending on the sample size $n$ .

In practice, $\widehat{C}$ is not computationally feasible because minimizing the empirical risk with the 0-1 loss over ${\cal C}_{n}$ is NP hard (Bartlett et al., 2006). An alternative approach is to replace the 0-1 loss with other computationally easier losses so-called surrogate losses. In addition, instead of a class of classifiers ${\cal C}_{n}$ , we consider a class of real-valued functions ${\cal F}_{n}$ . For a given surrogate loss $\phi$ , we estimate $\widehat{f}$ by minimizing the surrogate empirical risk (or empirical $\phi$ -risk)

[TABLE]

on ${\cal F}_{n}$ , and construct a classifier by $\widehat{C}({\bf x})={\rm sign}\widehat{f}({\bf x})$ .

A question in using a convex surrogate loss is the relation between the minimizer of the 0-1 empirical risk (2.1) and that of the empirical $\phi$ -risk (2.2). Because the empirical $\phi$ -risk converges to the population $\phi$ -risk ${\cal E}_{\phi}(f)=\mbox{{\rm E}}(\phi(Yf({\bf X}))$ for a given $f$ by the law of large numbers, we can consider $\widehat{f}$ as an estimator of $f^{*}_{\phi}$ , which is defined as

[TABLE]

where ${\cal F}_{\infty}$ is the limit of ${\cal F}_{n}$ in a certain sense. When ${\cal F}_{\infty}$ is the set of all measurable functions, we say that the surrogate loss $\phi$ is Fisher consistent if ${\rm sign}(f_{\phi}^{*}({\bf x}))=C^{*}({\bf x})$ .

It is known (Lin, 2004; Bartlett et al., 2006) that the Fisher consistency holds under very mild conditions on $\phi$ . In particular, $f_{\phi}^{*}$ is known for various surrogate losses. For example, when $\phi$ is the logistic loss (i.e., $\phi(z)=\log(1+\exp(-z))$ ), we have $f_{\phi}^{*}({\bf x})=\log\eta({\bf x})/(1-\eta({\bf x}))$ , where $\eta({\bf x})=\Pr(Y=1|{\bf X}={\bf x})$ (Friedman et al., 2000). Hence, the logistic loss satisfies the Fisher consistency, which justifies the use of the cross-entropy when learning a deep neural network. That is, deep learning with the cross-entropy essentially estimates the log odds of the conditional class probability.

As we explained in the Introduction, it would be better to estimate the Bayes classifier directly, which is realized conceptually if $f^{*}_{\phi}$ is the Bayes classifier. The hinge loss $\phi(z)=(1-z)_{+}=\max\{1-z,0\}$ has such a property (Lin, 2002), which is why we consider the hinge loss. Note that there are other losses that have $f^{*}_{\phi}=C^{*}$ . An example is the $\psi$ -loss (Shen et al., 2003), which is also known as the ramp loss (Collobert et al., 2006). Although the $\psi$ -loss has many advantages over the hinge loss, the $\psi$ -loss is nonconvex, and learning a DNN classifier using the $\psi$ -loss would be extremely difficult because the DNN classifier is nonconvex as well.

2.2 Learning DNN with the hinge loss

We consider DNNs that take $d$ -dimensional inputs and produce one-dimensional outputs. A DNN with $L$ many layers, and $\{N^{(l)},l\in[L]\}$ many nodes at each layer, is defined as

[TABLE]

and

[TABLE]

for $l=1,\ldots,L$ and

[TABLE]

with $N^{(0)}=d$ and $h_{k}^{(0)}({\bf x})=x_{k}$ . We consider the ReLU activation function $\sigma(z)=(z)_{+}$ . We denote $f({\bf x})$ as $f({\bf x}|\Theta)$ , where $\Theta=(({\bf W}^{(l)},{\bf b}^{(l)}))_{l=1,\dots,L+1}$ is the parameter set including all weights and biases.

For the given $\Theta$ , let $|\Theta|$ be the number of layers in $\Theta$ . Let $N_{\max}(\Theta)$ be the maximum number of nodes, that is, $f(\cdot|\Theta)$ has at most $N_{\max}(\Theta)$ nodes at each layer. We define $\|\Theta\|_{0}$ as the number of nonzero parameters in $\Theta$ ,

[TABLE]

where $\text{vec}({\bf W}^{(l)})$ transforms the matrix ${\bf W}^{(l)}$ into the corresponding vector by concatenating the column vectors. Similarly, we define $\|\Theta\|_{\infty}$ as the largest absolute value of the parameters in $\Theta$ ,

[TABLE]

For a given $n$ , let ${\cal F}_{n}$ be

[TABLE]

where the positive constants $L_{n}$ , $N_{n}$ , $S_{n}$ , $B_{n}$ , and $F_{n}$ are specified later.

We let $\widehat{f}^{\textup{DNN}}_{\phi,n}$ be the minimizer of ${\cal E}_{\phi,n}(f)$ over ${\cal F}_{n}$ for a given surrogate loss $\phi$ , i.e.,

[TABLE]

In the following section, we prove the fast convergence rates of $\widehat{f}^{\textup{DNN}}_{\phi,n}$ for various cases of the true model when $\phi$ is the hinge loss and $L_{n}$ , $N_{n}$ $S_{n}$ , $B_{n}$ , and $F_{n}$ are carefully selected. For detailed formulas of $L_{n},N_{n},S_{n},B_{n}$ , and $F_{n}$ in terms of the sample size $n$ , see the proofs of the corresponding theorems in the Appendix.

3 Fast convergence rates of DNN classifiers with the hinge loss

In this section, we consider the hinge loss and derive the convergence rates of the excess risk of $\widehat{f}^{\textup{DNN}}_{\phi,n}$ . For a given function $f$ , the excess risk of $f$ is defined as

[TABLE]

and the excess $\phi$ -risk of $f$ is defined by

[TABLE]

Throughout this paper, we always assume the Tsybakov noise condition (Mammen and Tsybakov (1999); Tsybakov (2004)).

(N)

There exists $C>0$ and $q\in[0,\infty]$ such that for any $t>0$

[TABLE]

We call the parameter $q$ appearing in assumption (N) the noise exponent.

We consider three cases regarding a true model: (1) a smooth decision boundary, (2) smooth class conditional probability, and (3) the margin condition. We derive the fast convergence rates of the DNN classifier using the hinge loss for all three cases.

3.1 Case 1: Smooth boundary

To describe the smooth Bayes classifier, we introduce the notion of piecewise constant functions with smooth boundaries. We adopt the notations and definitions from Petersen and Voigtlaender (2018) and Imaizumi and Fukumizu (2018). For $g\in{\cal H}^{\alpha,r}([0,1]^{d-1})$ and $j\in[d]$ , we define a horizon function $\Psi_{g,j}:[0,1]^{d}\to\{0,1\}$ as

[TABLE]

where ${\bf x}_{-j}=(x_{1},\dots,x_{j-1},x_{j+1},\dots,x_{d})$ . For each horizon function, we define the corresponding basis piece $I_{g,j}$ as

[TABLE]

We define a piece by the intersection of $K$ basis pieces. The set of pieces is denoted by

[TABLE]

Let ${\cal C}^{\alpha,r,K,T}$ be the set of classifiers of the form

[TABLE]

for $T\in\mathbb{N}$ , and disjoint subsets $A_{1},\ldots,A_{T}$ of ${\cal X}$ in ${\cal A}^{\alpha,r,K}$ . In this subsection, we assume that the Bayes classifier belongs to ${\cal C}^{\alpha,r,K,T}$ .

The following theorem proves the convergence rate of the DNN classifier with the hinge loss.

Theorem 1.

Assume (N) using the noise exponent $q\in[0,\infty]$ . If the surrogate loss $\phi$ is the hinge loss, the classifier $\widehat{f}_{\phi,n}^{\textup{DNN}}$ defined by (2.3) with carefully selected $L_{n},N_{n},S_{n},B_{n}$ , and $F_{n}$ satisfies

[TABLE]

where the expectation is taken over the training data.

Tsybakov (2004) showed that the minimax lower bound is given by

[TABLE]

where the infimum is taken over all classifiers $\widehat{f}_{n}:({\cal X}\times{\cal Y})^{n}\mapsto{\cal F}$ , where ${\cal F}$ is a set of all measurable functions. Unfortunately, the convergence rate (3.1) is not optimal in the minimax sense. However, the difference becomes small when the noise exponent $q$ is large. Note that the estimators in Mammen and Tsybakov (1999) and Tsybakov (2004) have slower convergence rates than that in (3.1) when $\alpha<d-1$ . However, the estimator of Tsybakov and van de Geer (2005) achieves the minimax lower bound for any $\alpha>0$ . At this point, we do not know whether the sub-optimal convergence rate (3.1) is inevitable owing the use of the hinge loss rather than the 0-1 loss. We will pursue this issue in the near future.

3.2 Case 2: Smooth conditional class probability

We assume that $\eta({\bf x})$ is smooth. That is, $\eta(\cdot)\in{\cal H}^{\beta,r}([0,1]^{d})$ for some $\beta>0$ and $r>0$ . The following theorem provides the convergence rate of the DNN classifier.

Theorem 2.

Assume (N) with the noise exponent $q\in[0,\infty]$ . If the surrogate loss $\phi$ is the hinge loss, the classifier $\widehat{f}_{\phi,n}^{\textup{DNN}}$ defined by (2.3) with carefully selected $L_{n},N_{n},S_{n},B_{n}$ , and $F_{n}$ satisfies

[TABLE]

Audibert and Tsybakov (2007) showed that when $\eta(\cdot)\in{\cal H}^{\beta}([0,1]^{d})$ , the minimax lower bound of the excess risk is given by

[TABLE]

Hence, the convergence rate (3.2) is minimax optimal up to a logarithmic factor.

3.3 Case 3: Margin condition

The convergence rate can be improved if we assume that the density of an input vector is small around the decision boundary. Let $B_{\epsilon}^{*}=\{{\bf x}:\text{dist}({\bf x},D^{*})\leq\epsilon\}$ , where $D^{*}=\{{\bf x}:\eta(x)=1/2\}$ and $\text{dist}({\bf x},D^{*})=\inf_{{\bf x}^{\prime}\in D^{*}}\|{\bf x}-{\bf x}^{\prime}\|_{2}$ , where $\|\cdot\|_{2}$ denotes the Euclidian norm. We introduce the following condition on the probability measure $P_{X}$ .

(M)

There exist $C>0$ , $\epsilon_{0}>0$ , and $\gamma\in[1,\infty]$ such that for any $\epsilon\in(0,\epsilon_{0}]$ ,

[TABLE]

The condition (M) is considered by Steinwart and Christmann (2008), where the parameter $\gamma$ in (M) is called the margin exponent. Steinwart and Christmann (2008) proves that the support vector machine with the Gaussian kernel achieves a fast convergence rate under the condition (M). The following theorem proves that a similar convergence rate can be achieved using the DNN classifier.

Theorem 3.

Assume (N) with the noise exponent $q\in[0,\infty]$ , and (M) with the margin exponent $\gamma\in[1,\infty]$ . If the surrogate loss $\phi$ is the hinge loss, the classifier $\widehat{f}_{\phi,n}^{\textup{DNN}}$ defined by (2.3) with carefully selected $L_{n},N_{n},S_{n},B_{n}$ , and $F_{n}$ satisfies

[TABLE]

An interesting feature of the convergence rate (3.3) is that the dependency of the input dimension $d$ diminishes as $\gamma$ increases. In the extreme case where $\gamma\rightarrow\infty$ , the convergence rate becomes $n^{-(q+1)/(q+2)}$ up to the logarithm factor, which depends on neither the smoothness of the boundary nor the dimension of the input. This partly explains why the DNN classifier works well with high-dimensional inputs such as images.

To investigate the validity of the margin condition (M), we explore the area near the decision boundary obtained by the cat and dog images of the CIFAR10 dataset. We first fit the decision boundary using a convolutional neural network (CNN) with cat and dog images in the CIFAR10 dataset. We then randomly select two images, one from dog and the other from cat, and take convex combinations of them to obtain a sequence of images between the two selected images. Figure 1 shows five sequences of images from five randomly selected pairs of dog and cat images. The images in the red box, which are the interpolated images with weights of the dog images ranging from $0.3$ to $0.7$ , are visually unrealistic, which suggests that the image classification has a large margin exponent.

3.4 Remarks regarding adpative estimation

In practice, we know neither $q,\alpha,\beta$ nor $\gamma$ , that affect the choice of the DNN architecture parameters $L_{n},N_{n},S_{n},B_{n}$ , and $F_{n}$ . We may select them data-adaptively. General tool kits used to find an adaptive classifier have been developed by Tsybakov (2004) and Audibert and Tsybakov (2007). These tools can be applied to a DNN classifier with minor modification.

For example, the model selection approach with a data-split proposed by Audibert and Tsybakov (2007) can be applied without much hamper. We first split the training data into two parts, $D_{1}$ and $D_{2}$ , with the sample sizes $n_{1}$ and $n_{2}$ . We then choose various values of $q,\alpha,\beta$ , and $\gamma$ , select the corresponding DNN architectures, and learn the architectures on data $D_{1}$ . Finally, among the learned DNN architectures, we choose the best DNN architecture based on the data $D_{2}$ . Because there is an algorithm of model selection where the difference between the selected model and true model is $O_{p}(1/n)$ (for example, see Juditsky et al. (2008) and Audibert and Tsybakov (2007)), the selected model achieves the best possible convergence rate $r_{n}^{*}$ as long as $n_{1}/n\rightarrow 1$ and $r_{n}^{*}/n_{2}\rightarrow 0$ . We plan to report the detailed results of this soon.

4 Use of cross-entropy

The logistic loss does not estimate the classifier directly, and hence the convergence rate is sup-optimal in general. However, in practice, a DNN with the logistic loss (i.e., learned by minimizing the cross-entropy) works quite well. In this section, we investigate when the logistic loss works well with a DNN. We prove that the convergence rate of the excess risk of the DNN estimator with the logistic loss can be fast when the true conditional class probabilities of most of data are close to 1 or 0. This condition is expected to hold in most image recognition problems because human beings, who are thought to be a Bayes classifier, are very good at recognizing most images. The formal statement of this condition is given as follows:

(E)

For a given positive sequence $\{\tilde{F}_{n}\}_{n\in\mathbb{N}}$ with $\tilde{F}_{n}\to\infty$ , there exists a positive sequence $\{\lambda_{n}\}_{n\in\mathbb{N}}$ with $\lambda_{n}\downarrow 0$ such that

[TABLE]

Theorem 4.

Assume (M) with the margin exponent $\gamma\in[1,\infty]$ . Let $\kappa=\alpha/(\alpha+(d-1)/\gamma)$ . Assume (E) with $\tilde{F}_{n}\asymp\kappa(\log n-3\log(\log n))$ and $\lambda_{n}\asymp e^{-\tilde{F}_{n}}$ . If $\phi$ is the logistic loss, then the classifier $\widehat{f}_{\phi,n}^{\textup{DNN}}$ defined by (2.3) with $F_{n}=\tilde{F}_{n}$ and carefully selected $L_{n},N_{n},S_{n}$ , and $B_{n}$ satisfies

[TABLE]

The convergence rate in Theorem 4 is equivalent to that in Theorem 3 for $q=\infty$ up to a logarithmic factor.

To investigate the validity of the condition (E), Figure 2 shows a histogram of the estimated conditional class probabilities of the test data of the CIFAR10 data using the DNN classifier with the logistic loss. Note that most of the conditional class probabilities are very close to either 1 or 0.

We compare the performance of the two DNN classifiers learned using the two surrogate losses - the logistic loss and the hinge loss. We analyze three benchmark datasets for image recognition, that is, MNIST, SVHN, and CIFAR10, where for each dataset we select two classes that are most difficult to recognize. The data descriptions and selected classes are summarized in Table 1. The detailed DNN architectures for the three datasets are given in Appendix A.10. The Adam is used for optimization with the learning rate $10^{-3}$ . Table 2 summarizes the test data error rates for various sizes of training data. The results are the averages (and standard errors) of 100 randomly selected training data, which amply show that the two estimators compete well with each other.

5 Concluding Remarks

We showed that a DNN is very flexible in the sense that it achieves fast convergence rates for various cases regarding a true model. It is interesting to note that a DNN is not only good at estimating a smooth decision boundary but also a smooth conditional class probability. In addition, a DNN can fully utilize the margin condition.

We showed that using the cross-entropy is also promising when the true conditional class probability is close to either 0 or 1 for most data. However, we conjecture that learning a DNN by minimizing the cross-entropy would be sub-optimal when the conditional class probability is not extreme.

Our theoretical results could be used to develop model selection procedures, particularly for the optimal selection of $L_{n}$ and $N_{n}$ . Moreover, it will be interesting to develop an online learning algorithm that can select $L_{n}$ and $N_{n}$ data adaptively.

We did not consider a computational issue in this paper. Learning a DNN with a sparsity constraint has not been fully studied, although some methods have been proposed (e.g., Liu et al. (2015), Han et al. (2015), and Wen et al. (2016)). A learning algorithm that supports our theoretical results will be worth pursuing.

Acknowledgement

This work was supported by the Samsung Science and Technology Foundation under Project Number SSTF-BA1601-02.

Appendix A Appendix

A.1 Complexity measures of a class of functions

We introduce the complexity measures of a given class of functions. Let $\|\cdot\|_{p}$ for $1\leq p<\infty$ be defined as $\|f\|_{p}=\left(\int_{\cal X}|f({\bf x})|^{p}\text{d}\mu({\bf x})\right)^{1/p}$ , where $\mu$ denotes the Lebesgue measure and $\|f\|_{\infty}=\sup_{{\bf x}\in{\cal X}}|f({\bf x})|$ .

Let ${\cal F}$ be a given class of real-value functions defined on ${\cal X}$ . Let $\delta>0$ and $p\in[1,\infty]$ . A collection $\{f_{i}\in{\cal F}:i\in[N]\}$ is called a $\delta$ -covering set of ${\cal F}$ with respect to the $L_{p}$ norm if, for all $f\in{\cal F}$ , there exists $f_{i}$ in the collection such that $\|f-f_{i}\|_{p}\leq\delta$ . The cardinality of the minimal $\delta$ -covering set is called the $\delta$ -covering number of ${\cal F}$ with respect to the $L_{p}$ norm, and is denoted by ${\cal N}(\delta,{\cal F},\|\cdot\|_{p})$ , that is,

[TABLE]

where $B_{p}(f_{i},\delta)=\{f\in{\cal F}:\|f-f_{i}\|_{p}\leq\delta\}$ .

A collection of pairs $\{(f_{i}^{L},f_{i}^{U})\in{\cal F}\times{\cal F}:i\in[N]\}$ is called a $\delta$ -bracketing set of ${\cal F}$ with respect to the $L_{p}$ norm if $\|f_{i}^{U}-f_{i}^{L}\|\leq\delta$ for all $i\in[N]$ , and for any $f\in{\cal F}$ , there is a pair $(f_{i}^{L},f_{i}^{U})$ in the collection such that $f_{i}^{L}\leq f\leq f_{i}^{U}$ . The cardinality of the minimal $\delta$ -bracketing set is called the $\delta$ -bracketing number of ${\cal F}$ with respect to the $L_{p}$ norm, and is denoted by ${\cal N}_{B}(\delta,{\cal F},\|\cdot\|_{p})$ . The $\delta$ -bracketing entropy, denoted by $H_{B}(\delta,{\cal F},\|\cdot\|_{p})$ is the logarithm of the $\delta$ -bracketing number, i.e., $H_{B}(\delta,{\cal F},\|\cdot\|_{p})=\log{\cal N}_{B}(\delta,{\cal F},\|\cdot\|_{p})$ .

For any $\delta>0$ , it is known (see, for example, Lemma 2.1 of van de Geer (2000)) that

[TABLE]

for any $p\in[1,\infty)$ , and

[TABLE]

if $\mu({\cal X})=1$ .

A.2 Convergence rate of the excess $\phi$ -risk for general surrogate losses

In this subsection, we derive the convergence rate of the excess $\phi$ -risk under regularity conditions, which is used repeatedly in the following subsections. The regularity conditions and techniques of the proof are minor modifications of those in Park (2009); however, we present the complete conditions and proof for the sake of readers’ convenience.

We assume the following regularity conditions.

(A1)

$\phi$ is Lipschitz, i.e., there exists a constant $C_{1}>0$ such that $|\phi(z_{1})-\phi(z_{2})|\leq C_{1}|z_{1}-z_{2}|$ for any $z_{1},z_{2}\in\mathbb{R}$ .

(A2)

For a positive sequence $a_{n}=O(n^{-a_{0}})$ as $n\to\infty$ for some $a_{0}>0$ , there exists a sequence of function classes $\{{\cal F}_{n}\}_{n\in\mathbb{N}}$ such that

[TABLE]

for some $f_{n}\in{\cal F}_{n}$ .

(A3)

There exists a sequence $\{F_{n}\}_{n\in\mathbb{N}}$ with $F_{n}\gtrsim 1$ such that $\sup_{f\in{\cal F}_{n}}\|f\|_{\infty}\leq F_{n}$ .

(A4)

There exists a constant $\nu\in(0,1]$ such that for any $f\in{\cal F}_{n}$ and any $n\in\mathbb{N}$ ,

[TABLE]

for a constant $C_{2}>0$ depending only on $\phi$ and $\eta(\cdot)$ .

(A5)

For a positive constant $C_{3}>0$ , there exists a sequence $\{\delta_{n}\}_{n\in\mathbb{N}}$ such that

[TABLE]

for $\{{\cal F}_{n}\}_{n\in\mathbb{N}}$ in (A2), $\{F_{n}\}_{n\in\mathbb{N}}$ in (A3), and $\nu$ in (A4).

For a proof of the general convergence result, we apply the large deviation inequality of Shen and Wong (1994) presented in Lemma 1.

Lemma 1 (Theorem 3 of Shen and Wong (1994)).

Let ${\cal F}$ be the class of functions bounded above by $F$ . Assume that $\mbox{{\rm E}}f(Z)=0$ for any $f\in{\cal F}$ and $v\geq\sup_{f\in{\cal F}}\textup{Var}(f(Z))$ for some $v>0$ . Suppose that there exists $\zeta>0$ such that

(C1)

$H_{B}(v^{1/2},{\cal F},\|\cdot\|_{2})\leq\zeta nM^{2}/(8(4v+MF/3))$ , 2. (C2)

$M\leq\zeta v/(4F)$ , $v^{1/2}\leq F$ , 3. (C3)

if $\zeta M/8<v^{1/2}$ ,

[TABLE]

Then,

[TABLE]

where ${\Pr}^{*}$ denotes the outer probability measure.

The following Theorem is the main result of this section, which gives the convergence rate of the excess $\phi$ -risk.

Theorem 5.

Suppose that the conditions (A1)-(A5) are met. Let $\epsilon_{n}^{2}\asymp\max\{a_{n},\delta_{n}\}$ . Then, the empirical $\phi$ -risk minimizer $\widehat{f}_{\phi,n}$ over ${\cal F}_{n}$ satisfies

[TABLE]

for some universal constant $C>0$ .

Proof.

Let $C_{1}$ , $C_{2}$ , and $C_{3}$ be constants appearing in assumptions (A1), (A4), and (A5), respectively. Let $\epsilon_{n}^{2}=\max\{2a_{n},2^{7}\delta_{n}/C_{1}\}$ . We define the following empirical process

[TABLE]

where $f_{n}\in{\cal F}_{n}$ is a function such that ${\cal E}_{\phi}(f_{n},f^{*}_{\phi})\leq a_{n}$ .

Since $\widehat{f}_{n}$ minimizes ${\cal E}_{\phi,n}(f)=\frac{1}{n}\sum_{i=1}^{n}\phi(y_{i}f({\bf x}_{i}))$ , it follows that

[TABLE]

We define

[TABLE]

Note that for $i\in\mathbb{N}$ such that $2^{i-1}\epsilon_{n}^{2}>2C_{1}F_{n}$ , ${\cal F}_{n,i}$ is an empty set. This is because for any $f\in{\cal F}_{n}$ , $\|f\|_{\infty}\leq F_{n}$ , and thus ${\cal E}_{\phi}(f,f^{*}_{\phi})\leq\mbox{{\rm E}}|\phi(Yf({\bf X}))-\phi(Yf^{*}_{\phi}({\bf X}))|\leq C_{1}\mbox{{\rm E}}|f({\bf X})-f^{*}_{\phi}({\bf X})|\leq 2C_{1}F_{n}$ . Therefore, $\{f\in{\cal F}_{n}:{\cal E}_{\phi}(f,f^{*}_{\phi})\geq\epsilon_{n}^{2}\}\subset\bigcup_{i=1}^{i^{*}_{n}}{\cal F}_{n,i}$ , where $i^{*}_{n}=\inf\{i\in\mathbb{N}:2^{i-1}\epsilon_{n}^{2}>2C_{1}F_{n}\}$ . Thus, we only deal with ${\cal F}_{n,i}$ using $i\leq i^{*}_{n}$ . Because ${\cal E}_{\phi}(f_{n},f^{*}_{\phi})\leq a_{n}\leq\epsilon_{n}^{2}/2$ , we have

[TABLE]

We introduce the notation $M_{n,i}=2^{i-2}\epsilon_{n}^{2}$ for a concise expression. Through the triangle inequality and (A4), we obtain the following variance bound

[TABLE]

Now, we have

[TABLE]

To bound the right-hand side, we apply Lemma 1 to the class of functions

[TABLE]

with $\zeta=1/2$ , $F=D_{1}F_{n}$ , $M=M_{n,i}$ , and $v=v_{n,i}=D_{2}F_{n}^{2-\nu}M_{n,i}^{\nu}$ where we let

[TABLE]

Note that for any $h\in{\cal H}_{n,i}$ , $\|h\|_{\infty}\leq C_{1}\|f_{n}-f\|_{\infty}\leq 2C_{1}F_{n}$ , and $\sup_{h\in{\cal H}_{n,i}}\textup{Var}(h({\bf X},Y))\leq C_{2}(1+4^{\nu})F_{n}^{2-\nu}M_{n,i}^{\nu}$ by (A.2). Since $D_{1}\geq 2C_{1}$ and $D_{2}\geq C_{2}(1+4^{\nu})$ , $\sup_{h\in{\cal H}_{n,i}}\|h\|_{\infty}\leq D_{1}F_{n}$ , and $\sup_{h\in{\cal H}_{n,i}}\textup{Var}(h({\bf X},Y))\leq v_{n,i}$ . Now we will check (C1)-(C3) of Lemma 1. Because $M_{n,i}\leq 2C_{1}F_{n}$ for any $i\leq i_{n}^{*}$ and $D_{2}\geq 64(2C_{1})^{2-\nu}$ ,

[TABLE]

and

[TABLE]

Therefore, (C2) in Lemma 1 holds. For (C3), we first note that

[TABLE]

where the first inequality follows from (A1), and the second inequality follows from ${\cal F}_{n,i}\subset{\cal F}_{n}$ . Because $\int_{\zeta M_{n,i}/32}^{v_{n,i}^{1/2}}H_{B}(u,{\cal F}_{n},\|\cdot\|_{2})^{1/2}\textup{d}u/M_{n,i}$ is non-increasing in $i$ ,

[TABLE]

where the fourth inequality is due to (A5). By taking $C_{3}^{1/2}=2^{-13/2-5\nu/2}C_{1}^{\nu/2-1}D_{2}^{-1/2}$ , (C3) of Lemma 1 is satisfied. Furthermore, (A.4) implies that

[TABLE]

where the last inequality is due to that $v^{1/2}_{n,i}\geq M_{n,i}/8$ . On the other hand, since $v_{n,i}/(8D_{1}F_{n})\geq M_{n,i}$ ,

[TABLE]

which is larger than $\frac{1}{7^{2}\times 2^{17}}\frac{M_{n,i}^{2}}{v_{n,i}}n.$ Hence (C1) of Lemma 1 is met.

Applying Lemma 1 to each ${\cal H}_{n,i}$ , (A.3) is further bounded as

[TABLE]

for certain positive constants $C_{4},C_{5}$ , and $C_{6}$ , which leads to the desired result. $\square$

A.3 Generic convergence rate for the hinge loss

We derive the convergence rate of the excess risk of the hinge loss under the conditions (A2), (A3), and (A5). Note that (A1) holds with $C_{1}=1$ for the hinge loss. We adopt the following lemma for the variance bound (A4).

Lemma 2 (Lemma 6.1 of Steinwart and Scovel (2007)).

Assume (N) with the noise exponent $q\in[0,\infty]$ . Assume $\|f\|_{\infty}\leq F$ for any $f\in{\cal F}$ . For the hinge loss $\phi$ , we have that, for any $f\in{\cal F}$ ,

[TABLE]

where $C_{\eta,q}=\left(\|(2\eta-1)^{-1}\|_{q,\infty}^{q}+1\right)\mathbf{1}(q>0)+1$ and $\|(2\eta-1)^{-1}\|_{q,\infty}^{q}$ is defined by

[TABLE]

Theorem 6.

Let $\phi$ be the hinge loss. Assume (N) with the noise exponent $q\in[0,\infty]$ , and that (A2), (A3), and (A5) are met. Let $\epsilon_{n}^{2}\asymp\max\{a_{n},\delta_{n}\}$ . Assume that $n^{1-\iota}(\epsilon_{n}^{2}/F_{n})^{(q+2)/(q+1)}\gtrsim 1$ for an arbitrarily small constant $\iota>0$ . Then, the empirical $\phi$ -risk minimizer $\widehat{f}_{\phi,n}$ over ${\cal F}_{n}$ satisfies

[TABLE]

where the expectation is taken over the training data.

Proof.

By Zhang’s inequality (Theorem 2.31 of (Steinwart and Christmann, 2008)), we have ${\cal E}(\widehat{f}_{\phi,n},C^{*})\leq{\cal E}_{\phi}(\widehat{f}_{\phi,n},f_{\phi}^{*})$ . Since (A4) is satisfied with $\nu=q/(q+1)$ by Lemma 2, Theorem 5 implies that

[TABLE]

for some universal constant $C>0$ . Since ${\cal E}(\widehat{f}_{\phi,n},C^{*})$ is bounded above by 1, the preceding display and the assumption $n^{1-\iota}(\epsilon_{n}^{2}/F_{n})^{(q+2)/(q+1)}\gtrsim 1$ imply the desired result. $\square$

A.4 Entropy of the class of DNNs

The following proposition states the upper bound of the $\delta$ -entropy of a neural network function space.

Proposition 1 (Lemma 3 of Suzuki (2018), Lemma 5 of Schmidt-Hieber (2017)).

For any $\delta>0$ ,

[TABLE]

where $B\vee 1=\max\{B,1\}$ .

A.5 Proof of Theorem 1

The following proposition given by Petersen and Voigtlaender (2018) proves that DNNs are good at approximating piecewise constant functions with smooth boundaries.

Proposition 2 (Corollary 3.7 of Petersen and Voigtlaender (2018)).

Let $d\geq 2$ , $\alpha,r>0,K\in\mathbb{N}$ , and $T\in\mathbb{N}$ . For any $C\in{\cal C}^{\alpha,r,K,T}$ and any sufficiently small $\xi>0$ , there exists a neural network

[TABLE]

where the positive constants $L_{0},N_{0},S_{0},B_{0}$ , and $b_{0}$ depend only on $d,\alpha,r$ , and $K$ , such that

[TABLE]

Proof of Theorem 1.

We will check the conditions (A2), (A3), and (A5) in Section A.2, and apply Theorem 6 to complete the proof. For (A2), let $\{\xi_{n}\}_{n\in\mathbb{N}}$ be a positive sequence such that $\xi_{n}\downarrow 0$ . Through Proposition 2, there exists $f_{n}\in{\cal F}_{n}={\cal F}^{\textup{DNN}}(L_{n},N_{n},S_{n},B_{n},1)$ such that $\mbox{{\rm E}}|f_{n}({\bf X})-C^{*}({\bf X})|\leq\xi_{n}$ with $L_{n}\lesssim\log(1/\xi_{n})$ , $N_{n}\lesssim\xi_{n}^{-(d-1)/\alpha}$ and $S_{n}\lesssim\xi_{n}^{-(d-1)/\alpha}\log(1/\xi_{n})$ . Thus,

[TABLE]

and hence (A2) and (A3) hold with $a_{n}=\xi_{n}$ and $F_{n}=1$ .

For (A5), let $\epsilon_{n}^{2}=C_{1}\xi_{n}$ . Then, by Proposition 1,

[TABLE]

In turn, (A.1) implies that (A5) is satisfied if we choose $\epsilon_{n}$ satisfying

[TABLE]

which leads to the best possible convergence rate

[TABLE]

and completes the proof by Theorem 6. $\square$

A.6 Proof of Theorem 2

We first introduce the smooth function approximation result of DNNs.

Proposition 3.

For any function $f\in{\cal H}^{\alpha,r}([0,1]^{d})$ and any sufficiently small $\xi>0$ , there exists a neural network

[TABLE]

such that

[TABLE]

where the constants $L_{0},N_{0},S_{0}$ , and $F$ depend only on $d,\alpha$ and $r$ .

Proof.

Theorem 5 of Schmidt-Hieber (2017) proves that for any $f\in{\cal H}^{\alpha,r}([0,1]^{d})$ and any integers $m\geq 1$ and $M\geq(\alpha+1)^{d}\vee(r+1)$ , there exists a neural network $f(\cdot|\Theta)\in{\cal F}^{\text{DNN}}(L,N,S,1,\infty)$ such that

[TABLE]

where $L=8+(m+5)(1+\lceil\log_{2}d\rceil)$ , $N=12dM$ , and $S=94d^{2}(\alpha+1)^{2d}M(m+6)(1+\lceil\log_{2}d\rceil)$ . By letting $M=(2^{-(\alpha+1)}r^{-1}\xi^{-d/\alpha}$ and $m=\log_{2}\left((2r+1)6^{d+1}(2r)^{d/\alpha}\xi^{-d/\alpha-1}\right)$ , we have $L\lesssim L_{0}\log\left(1/\xi\right),N\lesssim N_{0}\xi^{-d/\alpha},S\lesssim S_{0}\xi^{-d/\alpha}\log\left(1/\xi\right)$ , and $\|f(\cdot|\Theta)-f\|_{\infty}\leq\xi$ . Finally, because $\|f\|_{\infty}\leq r$ , we have $\|f(\cdot|\Theta)\|_{\infty}\leq r+\epsilon$ , and hence we complete the proof with $F=r+\xi$ . $\square$

Proof of Theorem 2.

For a given $\xi_{n}$ , by Proposition 3, there exists $\tilde{\eta}_{n}$ such that $\|\tilde{\eta}_{n}({\bf x})-\eta({\bf x})\|_{\infty}\leq\xi_{n}$ with at most $C_{1}\log(1/\xi_{n})$ layers, $C_{2}\xi_{n}^{-d/\beta}$ nodes in each layer, and $C_{3}\xi_{n}^{-d/\beta}\log(1/\xi_{n})$ nonzero parameters for some positive constants $C_{1},C_{2}$ , and $C_{3}$ . We construct the neural network $f_{n}$ by adding one layer to $\tilde{\eta}({\bf x})$ to achieve

[TABLE]

where $\sigma$ denotes the ReLU activation function. Note that $f_{n}({\bf x})$ is equal to $1$ if $\tilde{\eta}_{n}({\bf x})\geq 1/2+\xi_{n}$ , $(\tilde{\eta}_{n}({\bf x})-1/2)/\xi_{n}$ if $1/2\leq\tilde{\eta}_{n}({\bf x})<1/2+\xi_{n}$ , and $-1$ otherwise. Let $A(4\xi_{n})=\{{\bf x}:|2\eta({\bf x})-1|>4\xi_{n}\}$ . Then, for $A(4\xi_{n})$ , $|f_{n}({\bf x})-C^{*}({\bf x})|=0$ because $\tilde{\eta}_{n}({\bf x})-1/2=(\eta({\bf x})-1/2)-(\tilde{\eta}_{n}({\bf x})-\eta({\bf x}))\geq\xi_{n}$ when $2\eta({\bf x})-1>4\xi_{n}$ . Similarly, we can show that $\tilde{\eta}_{n}({\bf x})-1/2<-\xi_{n}$ when $2\eta({\bf x})-1<4\xi_{n}$ . Therefore, by (N) we have

[TABLE]

where the inequality in the last line holds since $\|f_{n}({\bf x})\|_{\infty}\leq 1$ .

Note that $f_{n}$ is also a DNN in which the last layer of $f_{n}$ has a finite number of parameters, and the maximum of the parameters is bounded above by $\xi_{n}^{-1}$ . Hence, we can construct the DNN class ${\cal F}^{\textup{DNN}}(L_{n},N_{n},S_{n},B_{n},1)$ containing $f_{n}$ with $L_{n}\lesssim\log(1/\xi_{n})$ , $N_{n}\lesssim\xi_{n}^{-d/\beta}$ , $S_{n}\lesssim\xi_{n}^{-d/\beta}\log(1/\xi_{n})$ , and $B_{n}\lesssim\xi_{n}^{-1}$ . Now, take $\epsilon_{n}^{2}\asymp\xi_{n}^{q+1}$ and observe that

[TABLE]

through Proposition 1. Since $H_{B}(\delta,{\cal F},\|\cdot\|_{2})\leq\log{\cal N}(\delta/2,{\cal F},\|\cdot\|_{\infty})$ , (A5) is satisfied if we choose $\epsilon_{n}$ satisfying

[TABLE]

which leads to the best possible convergence rate

[TABLE]

and completes the proof based on Theorem 6. $\square$

A.7 Proof of Theorem 3

The main technique of the proof is to approximate a piecewise constant function using a DNN with respect to the supremum norm on a specific subset of the domain, where this subset depends on the function to be approximated.

Let $d\geq 2$ , $\alpha,r>0$ , and $K\in\mathbb{N}$ . Let $A_{1},\dots,A_{T}\in{\cal A}^{\alpha,r,K}$ be a disjoint with the form

[TABLE]

Let $T\in\mathbb{N}$ , and let

[TABLE]

For a given $\xi>0$ , define $B_{\xi}$ such that

[TABLE]

It turns out that any point in $B_{\xi}$ has the supremum norm from the the decision boundary of $C({\bf x})$ being larger than $\xi$ . The following theorem proves that a DNN approximates $C({\bf x})$ well on $B_{\xi}$ .

Proposition 4.

Let $d\geq 2$ , $\alpha,r>0,K\in\mathbb{N}$ , and $T\in\mathbb{N}$ . For any $C\in{\cal C}^{\alpha,r,K,T}$ and a sufficiently small $\xi>0$ , there exists a neural network

[TABLE]

where the positive constants $L_{0},N_{0},S_{0},B_{0}$ , and $b_{0}$ depend only on $d,\alpha,r,K$ , and $T$ , such that

[TABLE]

where $C({\bf x})$ is the function defined in (A.6), and $B_{\xi}$ is defined in (A.7).

Proof.

The proof is deferred to Section A.9. $\square$

Proof of Theorem 3.

Let $\{\xi_{n}\}_{n\in\mathbb{N}}$ be a positive sequence such that $\xi_{n}\downarrow 0$ . Based on Theorem 4, there exists $f_{n}\in{\cal F}_{n}={\cal F}^{\textup{DNN}}(L_{n},N_{n},S_{n},B_{n},1)$ such that

[TABLE]

with $L_{n}\lesssim\log(1/\xi_{n})$ , $N_{n}\lesssim\xi_{n}^{-(d-1)/\alpha}$ , $S_{n}\lesssim\xi_{n}^{-(d-1)/\alpha}\log(1/\xi_{n})$ , and $B_{n}\lesssim\xi_{n}^{-b_{0}}$ for some $b_{0}>0$ .

We now show that $B_{\xi_{n}}^{c}\subset B_{\xi_{n}}^{*}$ , where $B_{\xi_{n}}^{*}=\{{\bf x}:\text{dist}({\bf x},D^{*})\leq\xi_{n}\}$ . Suppose that ${\bf x}\in B_{\xi_{n}}^{c}$ . Then, there are $t\in[T]$ and $k\in[K]$ such that $|x_{j_{(t,k)}}-g_{(t,k)}({\bf x}_{-j_{(t,k)}})|\leq\xi_{n}$ . Let ${\bf x}^{*}$ be the $d$ -dimensional vector where the $j_{(t,k)}$ -th component is equal to $g_{(t,k)}({\bf x}_{-j_{(t,k)}})$ and the other components are the same as the corresponding components of ${\bf x}$ , i.e., $x^{*}_{j_{(t,k)}}=g_{(t,k)}({\bf x}_{-j_{(t,k)}})$ and ${\bf x}^{*}_{-j_{(t,k)}}={\bf x}_{-j_{(t,k)}}$ . Clearly, ${\bf x}^{*}$ is on the decision boundary $D^{*}$ . Since $\|{\bf x}-{\bf x}^{*}\|_{2}=|x_{j_{(t,k)}}-g_{(t,k)}({\bf x}_{-j_{(t,k)}})|\leq\xi_{n}$ , it follows that $\text{dist}({\bf x},D^{*})\leq\xi_{n}$ , which implies that $f_{n}({\bf x})-C^{*}({\bf x})=0$ for any ${\bf x}\in(B_{\xi_{n}}^{*})^{c}$ since $(B_{\xi_{n}}^{*})^{c}\subset B_{\xi_{n}}$ . Therefore, through the condition (M),

[TABLE]

for some constant $C>0$ , and hence (A2) and (A3) hold with $a_{n}=C\xi_{n}^{\gamma}$ and $F=1$ .

For (A5), if we take $\epsilon_{n}^{2}=C\xi_{n}^{\gamma}$ , it follows that

[TABLE]

Since $H_{B}(\delta,{\cal F},\|\cdot\|_{2})\leq\log{\cal N}(\delta/2,{\cal F},\|\cdot\|_{\infty})$ , (A5) is satisfied if we choose $\epsilon_{n}$ satisfying

[TABLE]

which leads to the best possible convergence rate

[TABLE]

and completes the proof by Theorem 6. $\square$

A.8 Proof of Theorem 4

For the logistic loss, the following two lemmas are needed. The first lemma states that the $\phi$ -risks of both the $\phi$ -risk minimizer and the Bayes classifier are bounded. The second lemma provides the variance bound of the logistic loss.

Lemma 3.

Let $\phi$ be the logistic loss. Assume (E) with $\lambda_{n}\asymp e^{-\tilde{F}_{n}}$ . There then exist constants $C_{1}>0$ and $C_{2}>0$ such that

[TABLE]

where $\bar{f}_{n}^{*}=\tilde{F}_{n}C^{*}$ .

Proof.

Recall that $f_{\phi}^{*}({\bf x})=\log(\eta({\bf x})/(1-\eta({\bf x})))$ . We let

[TABLE]

. It then follows that

[TABLE]

Let

[TABLE]

We divide $A_{n}$ into two disjoint sets $\{{\bf x}:f_{\phi}^{*}>\tilde{F}_{n}\}$ and $\{{\bf x}:f_{\phi}^{*}<-\tilde{F}_{n}\}$ . On $\{{\bf x}:f_{\phi}^{*}>\tilde{F}_{n}\}$ , we have

[TABLE]

Similarly, we can show that $G_{n}\lesssim\tilde{F}_{n}e^{-\tilde{F}_{n}}$ on $\{{\bf x}:f_{\phi}^{*}<-\tilde{F}_{n}\}$ , which implies ${\cal E}_{\phi}(f_{\phi}^{*})\lesssim\tilde{F}_{n}\lambda_{n}+\tilde{F}_{n}e^{-\tilde{F}_{n}}\asymp\tilde{F}_{n}e^{-\tilde{F}_{n}}$ .

We use the similar argument above for $\bar{f}_{n}^{*}=\tilde{F}_{n}C^{*}({\bf x})$ . For $\{{\bf x}:f_{\phi}^{*}>\tilde{F}_{n}\}$

[TABLE]

and similarly we obtain the same upper bound on $\{{\bf x}:f_{\phi}^{*}<-\tilde{F}_{n}\}$ . $\square$

Lemma 4 (Lemma 6.1. of Park (2009)).

Assume (N) with the noise exponent $q\in[0,\infty]$ . Assume $\|f\|_{\infty}\leq F$ for any $f\in{\cal F}$ . Then, for the logistic loss $\phi$ , we have that, for any $f\in{\cal F}$ ,

[TABLE]

for some constant $C>0$ .

Proof of Theorem 4.

Let $\bar{f}_{n}^{*}=F_{n}C^{*}$ . As in the proof of Theorem 3, for a positive sequence $\{\xi_{n}\}_{n\in\mathbb{N}}$ approaching zero, we can find $f_{n}\in{\cal F}_{n}={\cal F}^{\textup{DNN}}(L_{n},N_{n},S_{n},B_{n},F_{n})$ such that

[TABLE]

and

[TABLE]

where $B_{\xi_{n}}$ is defined in (A.7), with $C$ being the Bayes classifier. Because $B_{\xi_{n}}^{c}\subset B_{\xi_{n}}^{*}=\{{\bf x}:\text{dist}({\bf x},D^{*})\leq\xi_{n}\}$ , the condition (M) implies that

[TABLE]

for some constant $C_{1}>0$ . By Lemma 3 and the Lipschitz property of the logistic loss, we have

[TABLE]

for some positive constants $C_{2}$ and $C_{3}$ . Recall that we have defined

[TABLE]

We now take $F_{n}=\kappa(\log n-3\log(\log n))$ and $\xi_{n}^{\gamma}=n^{-\kappa}\log^{3\kappa}n$ such that ${\cal E}_{\phi}(f_{n},f_{\phi}^{*})\lesssim F_{n}e^{-F_{n}}\asymp n^{-\kappa}\log^{3\kappa+1}n$ , and thus the conditions (A2) and (A3) in Section A.2 hold with $a_{n}=n^{-\kappa}\log^{3\kappa+1}n$ and $F_{n}=\kappa(\log n-3\log(\log n))$ .

For (A5), let $\epsilon_{n}^{2}=n^{-\kappa}\log^{3\kappa+1}n$ . Because $\kappa(d-1)/\alpha\gamma=(d-1)/(\alpha\gamma+d-1)=1-\kappa$ , it follows that

[TABLE]

which implies (A5) and completes the proof through Theorem 5 with Lemma 4, which proves the condition (A4). $\square$

A.9 Proof of Proposition 4

Before we provide the proof of Theorem 4, we introduce some useful definitions and techniques for the construction of DNNs, which are mostly from Petersen and Voigtlaender (2018).

For matrices ${\bf W}_{1},\dots,{\bf W}_{N}$ , we let $\textsf{diag}\left({{\bf W}_{1},\dots,{\bf W}_{N}}\right)$ denote a block diagonal matrix whose diagonal matrices are ${\bf W}_{1},\dots,{\bf W}_{N}$ . When ${\bf W}_{i}$ have the same number of rows, we let $\textsf{hstack}\left({{\bf W}_{1},\dots,{\bf W}_{N}}\right)$ denote a concatenated matrix along the column, and when ${\bf W}_{i}$ have the same number of columns, we let $\textsf{vstack}\left({{\bf W}_{1},\dots,{\bf W}_{N}}\right)$ denote a concatenated matrix along the row.

For an index set $D\subset[d]$ , a masking neural network with $L$ layers, denoted by $f(\cdot|\Theta_{D,L})$ , where $\Theta_{D,L}=\left({({\bf W}^{(l)},\mathbf{0})}\right)_{l=1,\dots,L+1}$ , in which ${\bf W}^{(1)}=\textsf{vstack}\left({{\bf I}(D),-{\bf I}(D)}\right)\in\mathbb{R}^{2d\times d}$ and ${\bf W}^{(1)}={\bf I}_{2d}\in\mathbb{R}^{2d\times 2d}$ for $l=2,\dots,L$ , and ${\bf W}^{(L+1)}=\textsf{hstack}\left({{\bf I}(D),-{\bf I}(D)}\right)\in\mathbb{R}^{d\times 2d}$ , where ${\bf I}_{d}$ denotes a $d\times d$ identity matrix and ${\bf I}(D)$ is a diagonal matrix where the $i$ -th diagonal entry is equal to 1 if $i\in D$ , and is zero otherwise. If $L=1$ , we define $\Theta_{D,1}=\left({({\bf I}(D),\mathbf{0})}\right)$ . The output of the masking neural network is equal to the masked input ${\bf I}(D){\bf x}$ , of which the $j$ -th element is equal to $x_{j}$ if $j\in D$ , and is zero otherwise. Note that $\left\|{\Theta_{D,L}}\right\|_{0}=2d(L-2)+4|D|\leq 2dL$ .

Let $\Theta_{1}=\left({({\bf W}_{1}^{(l)},{\bf b}_{1}^{(l)})}\right)_{l=1,\dots,L_{1}+1}$ and $\Theta_{2}=\left({({\bf W}_{2}^{(l)},{\bf b}_{2}^{(l)})}\right)_{l=1,\dots,L_{2}+1}$ be two neural networks such that the input layer of $\Theta_{1}$ has the same dimension as the output layer of $\Theta_{2}$ . Then, a stacked neural network of $\Theta_{1}$ and $\Theta_{2}$ denoted by $f(\cdot|\Theta_{1}\bullet\Theta_{2})$ , where $\Theta_{1}\bullet\Theta_{2}$ is defined by

[TABLE]

The stacked neural network $\Theta_{1}\bullet\Theta_{2}$ has $L_{1}+L_{2}$ layers and satisfies

[TABLE]

for any input ${\bf x}\in\mathbb{R}^{d}$ . In addition, we have that $\left\|{\Theta_{1}\bullet\Theta_{2}}\right\|_{0}\leq\left\|{\Theta_{1}}\right\|_{0}+\left\|{\Theta_{2}}\right\|_{0}+2C$ , where $C$ is the constant equal to the multiplication of the input dimension of $\Theta_{1}$ and the output dimension of $\Theta_{2}$ .

Let $\Theta_{1}=\left({({\bf W}_{1}^{(l)},{\bf b}_{1}^{(l)})}\right)_{l=1,\dots,L+1}$ and $\Theta_{2}=\left({{\bf W}_{2}^{(l)},{\bf b}_{2}^{(l)})}\right)_{l=1,\dots,L+1}$ be two neural networks with the same number of layers and $d$ -dimensional inputs. A concatenated neural network of the two networks $\Theta_{1}$ and $\Theta_{2}$ denoted by $f(\cdot|\Theta_{1}\oplus\Theta_{2})$ , where $\Theta_{1}\oplus\Theta_{2}$ is defined by $\Theta_{1}\oplus\Theta_{2}=\left({\left({{\bf W}_{1,2}^{(l)},{\bf b}_{1,2}^{(l)}}\right)}\right)_{l=1,\dots,L+1}$ , where ${\bf b}_{1,2}^{(l)}=\textsf{vstack}\left({{\bf b}_{1}^{(l)},{\bf b}_{2}^{(l)}}\right)$ for $l=1,\dots,L+1$ , ${\bf W}_{1,2}^{(l)}=\textsf{diag}\left({{\bf W}_{1}^{(l)},{\bf W}_{2}^{(l)}}\right)$ for $l=2,\dots,L+1$ , and ${\bf W}_{1,2}^{(1)}=\textsf{vstack}\left({{\bf W}_{1}^{(l1)},{\bf W}_{2}^{(l)}}\right)$ . The concatenated neural network satisfies

[TABLE]

for any input ${\bf x}\in\mathbb{R}^{d}$ , as well as $\left\|{\Theta_{1}\oplus\Theta_{2}}\right\|_{0}=\left\|{\Theta_{1}}\right\|_{0}+\left\|{\Theta_{2}}\right\|_{0}$ .

We are ready to prove Proposition 4. We divide the proof into two steps. First we give the proof of approximation of the horizon functions, and then using the result, we prove Proposition 4.

Lemma 5 (Approximation of horizon functions).

Let $d\geq 2$ , $\alpha,r>0$ , and $K\in\mathbb{N}$ . For a horizon function $\Psi_{g,j}$ , where $g\in{\cal H}^{\alpha,r}([0,1]^{d-1}),j\in[d]$ , and $\xi>0$ , define

[TABLE]

There then exists a neural network

[TABLE]

where the positive constants $L_{0},N_{0},S_{0},B_{0}$ , and $b_{0}$ depend only on $d,\alpha$ , and $r$ , such that

[TABLE]

Proof.

Without a loss of generality, assume $j=1$ . By Proposition 3, we can construct a neural network $\tilde{g}=f(\cdot|\Theta_{g})$ on $[0,1]^{d-1}$ such that $\|f(\cdot|\Theta_{g})-g\|_{\infty}<\xi/4$ with $|\Theta_{g}|\lesssim\log(1/\xi)$ , $N_{\max}(\Theta_{g})\lesssim\xi^{-(d-1)/\alpha}$ , $\|\Theta_{g}\|_{0}\lesssim\xi^{-(d-1)/\alpha}\log(1/\xi)$ , $\|\Theta_{g}\|_{\infty}\leq 1$ , and $\|f(\cdot|\Theta_{g})\|_{\infty}\leq r$ . Define the map $\Phi:\mathbb{R}^{d}\to\mathbb{R}^{d}$ by

[TABLE]

Let $D_{1}=\{1\}$ and $D_{-1}=[d]\setminus\{1\}$ . Let $\Theta_{\pm}=\left({(1,-1)^{\top},(-\xi/4,0)^{\top}}\right)$ . Then, consider the network

[TABLE]

where $\Theta_{D_{1},L}$ and $\Theta_{D_{-1},L}$ are masking neural networks and $L=|\Theta_{g}|$ . Clearly, we have

[TABLE]

with $\|\Theta_{\Phi}\|_{0}\lesssim\xi^{-(d-1)/\alpha}\log(1/\xi)$ .

Let $H({\bf x})=\mathbf{1}(x_{1}\geq 0)$ . We now construct a neural network that approximates $H({\bf x})$ . Let $\Theta_{H}=\left({({\bf W}^{(1)},{\bf b}^{(1)}),({\bf W}^{(2)},{\bf b}^{(2)})}\right)$ , where ${\bf W}^{(1)}=(W_{i,j}^{(1)})_{i=1,2;j=1,\ldots,d}$ with $W_{i,1}^{(1)}=\xi^{-1}$ and $W_{i,j}^{(1)}=0$ for $j=2,\dots,d$ , and for $i=1,2$ , ${\bf b}^{(1)}=(0,-1)^{\top}$ , ${\bf W}^{(2)}=(1,-1)$ and ${\bf b}^{(2)}=0$ . It can be shown that $|H({\bf x})-f({\bf x}|\Theta_{H})|\leq\mathbf{1}(0\leq{\bf x}_{1}\leq\xi/2)$ for every ${\bf x}$ with $\|\Theta_{H}\|_{\infty}\lesssim\xi^{-1}$ . See Lemma A.2 of Petersen and Voigtlaender (2018) for details.

Finally, define

[TABLE]

so that $f({\bf x}|\Theta)=f(f({\bf x}|\Theta_{\Phi})|\Theta_{H})$ . We then have

[TABLE]

We show that both terms on the right-hand side of the preceding display are zero.

For the first term, we note that

[TABLE]

Note also that, on $B_{\xi,g,j}$ , $|x_{1}-g({\bf x}_{-1})-\xi/4|>\xi/4$ , and by the construction of $\tilde{g}$ , $\left|{g({\bf x}_{-1})-\tilde{g}({\bf x}_{-1})}\right|<\xi/4$ for any ${\bf x}\in[0,1]^{d}$ . Combining these two facts, we obtain

[TABLE]

which implies that the first term is equal to zero.

For the second term, we have that, for every ${\bf x}\in B_{\xi,g,j}$

[TABLE]

where the second ineqaulity holds since $x_{1}-\tilde{g}({\bf x}_{-1})-\xi/4=x_{1}-g({\bf x}_{-1})+g({\bf x}_{-1})-\tilde{g}({\bf x}_{-1})-\xi/4<0$ if $x_{1}-g({\bf x}_{-1})<0$ and $x_{1}-g({\bf x}_{-1})+g({\bf x}_{-1})-\tilde{g}({\bf x}_{-1})-\xi/4>\xi/2$ if $x_{1}-g({\bf x}_{-1})>\xi$ . Thus, $\left\|{H(\Phi)-H(f(\cdot|\Theta_{\Phi}))}\right\|_{\infty,B_{\xi,g,j}}=0$ , which completes the proof. $\square$

Proof of Proposition 4.

We give the proof only for the case of $T=1$ . An extension of the cases $T\geq 2$ is straightforward. Thus, we omit the subscript $t$ in all expressions.

Let $f(\cdot|\Theta_{k})$ be a neural network such that

[TABLE]

for any ${\bf x}\in B_{\xi,k}=\left\{{{\bf x}\in[0,1]^{d}:x_{j_{k}}-g_{k}({\bf x}_{-j_{k}})>\xi}\right\}\cup\left\{{{\bf x}\in[0,1]^{d}:x_{j_{k}}-g_{k}({\bf x}_{-j_{k}})<0}\right\}$ , as in Lemma 5. Define the neural network $f(\mathbf{z}|\Theta_{+})$ with $K$ -dimensional inputs as

[TABLE]

where $\sigma$ denotes the ReLU activation function, and define

[TABLE]

We now show that

[TABLE]

If ${\bf x}\in A_{t}^{c}$ , then there is $k^{*}$ such that $f({\bf x}|\Theta_{k^{*}})=0$ . Hence, $\sum_{k=1}^{K}f({\bf x}|\Theta_{k})\leq K-1$ and thus $f({\bf x}|\Theta)=-1$ . If ${\bf x}\in\left\{{\bf x}\in A_{t}:x_{j_{k}}-g_{k}({\bf x}_{-j_{k}})>\xi,\forall k\in[K]\right\}$ , then $f({\bf x}|\Theta_{k})=1$ for all $k$ , and hence $f({\bf x}|\Theta)=1$ . $\square$

A.10 DNN architectures used for the experiments

For the MNIST dataset, we used a DNN with five hidden layers, whose numbers of nodes were 1200, 600, 300, 150, and 150, respectively. All hidden layers are followed by batch normalization (Ioffe and Szegedy, 2015). In addition, for the SVHN and CIFAR10 datasets, we used the CNN models whose architectures are provided in Table 3.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Audibert and Tsybakov [2007] Jean-Yves Audibert and Alexandre B Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics , 35(2):608–633, 2007.
2Bartlett et al. [2006] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006.
3Collobert et al. [2006] Ronan Collobert, Fabian Sinz, Jason Weston, and Léon Bottou. Large scale transductive svms. Journal of Machine Learning Research , 7(Aug):1687–1712, 2006.
4Eldan and Shamir [2016] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory , pages 907–940, 2016.
5Friedman et al. [2000] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The Annals of Statistics , 28(2):337–407, 2000.
6Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning . MIT Press, 2016.
7Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems , pages 1135–1143, 2015.
8Hinton and Salakhutdinov [2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science , 313(5786):504–507, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Fast convergence rates of deep neural networks for classification

Abstract

1 Introduction

1.1 Notations

2 Estimation of the classifier with DNNs

2.1 Necessity of the hinge loss

2.2 Learning DNN with the hinge loss

3 Fast convergence rates of DNN classifiers with the hinge loss

3.1 Case 1: Smooth boundary

Theorem 1**.**

3.2 Case 2: Smooth conditional class probability

Theorem 2**.**

3.3 Case 3: Margin condition

Theorem 3**.**

3.4 Remarks regarding adpative estimation

4 Use of cross-entropy

Theorem 4**.**

5 Concluding Remarks

Acknowledgement

Appendix A Appendix

A.1 Complexity measures of a class of functions

A.2 Convergence rate of the excess ϕ\phiϕ-risk for general surrogate losses

Lemma 1** (Theorem 3 of Shen and Wong (1994)).**

Theorem 5**.**

Proof.

A.3 Generic convergence rate for the hinge loss

Lemma 2** (Lemma 6.1 of Steinwart and Scovel (2007)).**

Theorem 6**.**

Proof.

A.4 Entropy of the class of DNNs

Proposition 1** (Lemma 3 of Suzuki (2018), Lemma 5 of Schmidt-Hieber (2017)).**

A.5 Proof of Theorem 1

Proposition 2** (Corollary 3.7 of Petersen and Voigtlaender (2018)).**

Proof of Theorem 1.

A.6 Proof of Theorem 2

Proposition 3**.**

Proof.

Proof of Theorem 2.

A.7 Proof of Theorem 3

Proposition 4**.**

Proof.

Proof of Theorem 3.

A.8 Proof of Theorem 4

Lemma 3**.**

Proof.

Lemma 4** (Lemma 6.1. of Park (2009)).**

Proof of Theorem 4.

A.9 Proof of Proposition 4

Lemma 5** (Approximation of horizon functions).**

Proof.

Proof of Proposition 4.

A.10 DNN architectures used for the experiments

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

A.2 Convergence rate of the excess $\phi$ -risk for general surrogate losses

Lemma 1 (Theorem 3 of Shen and Wong (1994)).

Theorem 5.

Lemma 2 (Lemma 6.1 of Steinwart and Scovel (2007)).

Theorem 6.

Proposition 1 (Lemma 3 of Suzuki (2018), Lemma 5 of Schmidt-Hieber (2017)).

Proposition 2 (Corollary 3.7 of Petersen and Voigtlaender (2018)).

Proposition 3.

Proposition 4.

Lemma 3.

Lemma 4 (Lemma 6.1. of Park (2009)).

Lemma 5 (Approximation of horizon functions).