This paper introduces a generalized framework for high-dimensional learning that relaxes traditional assumptions, demonstrating that poly-logarithmic sample complexity suffices for nonsmooth models and neural networks, even without restricted strong convexity.
Contribution
It extends high-dimensional statistical learning theory by relaxing sparsity and RSC conditions using folded concave penalties, enabling analysis of nonsmooth and neural network models with minimal sample complexity.
Findings
01
Poly-logarithmic sample complexity for high-dimensional models.
02
Regularization ensures generalizability of over-parameterized neural networks.
03
Framework applies to nonsmooth learning problems.
Abstract
High-dimensional statistical learning (HDSL) has wide applications in data analysis, operations research, and decision-making. Despite the availability of multiple theoretical frameworks, most existing HDSL schemes stipulate the following two conditions: (a) the sparsity, and (b) the restricted strong convexity (RSC). This paper generalizes both conditions via the use of the folded concave penalty (FCP). More specifically, we consider an M-estimation problem where (i) the (conventional) sparsity is relaxed into the approximate sparsity and (ii) the RSC is completely absent. We show that the FCP-based regularization leads to poly-logarithmic sample complexity; the training data size is only required to be poly-logarithmic in the problem dimensionality. This finding can facilitate the analysis of two important classes of models that are currently less understood: the high-dimensional…
Tables4
Table 1. Table 1 : Summary of sample complexities. ε A subscript 𝜀 𝐴 \varepsilon_{A} is the parameter for A-sparsity as in Assumption 1 . p 𝑝 p and n 𝑛 n are the sample size and the dimensionality, respectively. “ReLU-NN” stands for an NN with ReLU activation.
HDSL under A-sparsity
S3ONC
initialized with Lasso
S3ONC
with suboptimality gap
Nonsmooth HDSL under A-sparsity
S3ONC
initialized with Lasso
Neural network (with -many layers and -many fitting parameters)
S3ONC to a general NN
with suboptimality gap and any
S3ONC
to an NN for a flexible choice of activation functionswith suboptimality gap , when the target function is polynomial
A pseudo-polynomial-time computable solution
in traininga ReLU-NN in the same settings by Cao and Gu (2020)
Table 2. Table 2 : Classification errors of NN variants with and without the FCP on MNIST dataset. “ ⟨ ⟨ \langle Model Name ⟩ ⟩ \rangle -FCP” refers to the an FCP-regularized NN. “Param #” stands for the number of nonzero fitting parameters after training. “‘R.Gap” standards for the relative gap; that is, the ratio between the difference and the value obtained before introducing the FCP.
Model
CNN
CNN-FCP
R. Gap
Test Error
0.80%
0.70%
12.50%
Param #
1,199,882
265,517
77.87%
Model
LN-S
LN-S-FCP
R. Gap
Test Error
0.66%
0.64%
3.03%
Param #
22,000*
14,417
34.47%
Model
VGG-g
VGG-g-FCP
R. Gap
Test Error
0.25%
0.23%
8.00%
Param #
16,853,584
15,115,902
10.31%
Table 3. Table 3 : Classification errors of NN variants with and without the FCP on CIFAR-10 dataset. “ ⟨ ⟨ \langle Model Name ⟩ ⟩ \rangle -FCP” refers to the an FCP-regularized NN. “Param #” stands for the number of nonzero fitting parameters after training. “R.Gap” standards for the relative gap; that is, the ratio between the difference and the value obtained before introducing the FCP.
Model
VGG19
VGG19-FCP
R.Gap
Test Error
6.86%
6.84%
12.50%
Param #
20,051,546
10,789,567
46.19%
Model
shk-RN
shk-RN-FCP
R.Gap
Test Error
2.29%
2.16%
5.67%
Param #
11,932,743
7,303,200
38.79%
Model
FMix
FMix-FCP
R.Gap
Test Error
1.36%
1.31%
3.68%
Param #
26,422,068
21,485,594
18.68%
Table 4. Table 4 : Classification errors of SVM with different regularization schemes when the design has lower correlation. “Mean” stands for the average out-of-sample classification error (%) out of 100 random replications, and “SE” is the corresponding standard error (%).
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Machine Learning and Algorithms
HDSL Under Approximate Sparsity with Applications to Nonsmooth Estimation and Regularized Neural Networks
\TITLE
High-Dimensional Learning under Approximate Sparsity with Applications to Nonsmooth Estimation and Regularized Neural Networks
\ARTICLEAUTHORS\AUTHOR
Hongcheng Liu
\AFFDepartment of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, \[email protected] \AUTHORYinyu Ye
\AFFDepartment of Management Science and Engineering, Stanford University, Stanford, CA 94305, \[email protected]
\AUTHORHung Yi Lee
\AFFDepartment of Industrial and Systems Engineering, University of Florida, Gainesville, FL 32611, \[email protected]
\ABSTRACT
High-dimensional statistical learning (HDSL) has wide applications in data analysis, operations research, and decision-making. Despite the availability of multiple theoretical frameworks, most existing HDSL schemes stipulate the following two conditions: (a) the sparsity, and (b) the restricted strong convexity (RSC). This paper generalizes both conditions via the use of the folded concave penalty (FCP). More specifically, we consider an M-estimation problem where (i) the (conventional) sparsity is relaxed into the approximate sparsity and (ii) the RSC is completely absent. We show that the FCP-based regularization leads to poly-logarithmic sample complexity; the training data size is only required to be poly-logarithmic in the problem dimensionality. This finding can facilitate the analysis of two important classes of models that are currently less understood: the high-dimensional nonsmooth learning and the (deep) neural networks (NN). For both problems, we show that the poly-logarithmic sample complexity can be maintained. In particular, our results indicate that the generalizability of NNs under over-parameterization can be theoretically ensured with the aid of regularization.
This paper is concerned with high-dimensional statistical learning (HDSL), which refers to the problems of estimating a large number of parameters with few training data. The HDSL problems are found in wide applications ranging from imaging, bioinformatics, and deep learning, etc. A standard setup of the HDSL is summarized below: We are given a sequence of n-many i.i.d. sample observations, denoted Zi, i=1,...,n. Those observations are copies of a random vector Z, which has unknown support W⊆ℜq (for some positive integer q) and an unknown probability distribution. In addition to the sample observations above, we are also given a function L(β,Zi), where L:ℜp×W→ℜ measures the statistical loss with respect to the data point Zi and the vector of fitting parameters β:=(βj)∈ℜp. Here, the positive integer p is called the problem dimensionality (which is equal to the number of fitting parameters). Throughout this paper, we assume that L is measurable and deterministic, the expectation E[L(β,Z)] over Z is well-defined for all β∈ℜp, and infβ\leavevmodeE[L(β,Z)]>−∞. \Copyone sentenceThough no convexity assumption is imposed explicitly, many of our results are mainly useful when L(⋅,z) is convex. Given the above, it is often essential to estimate the solution to the following population-level problem in many applications:
[TABLE]
Here, β∗
is intuitively the vector of fitting parameters which yields the smallest population-level statistical loss (a.k.a., population risk). Therefore, β∗ is considered the target of estimation and referred to as the vector of “true parameters”. The HDSL problem of interest is then how to estimate (or approximate) β∗, given the a-priori knowledge of the samples Z1n:=(Z1,Z2,...,Zn) and the formulation of L, when p≥n. We are especially interested in the more challenging case where the sample size n is much smaller than the dimensionality p (i.e., p≫n). In measuring the approximation quality (a.k.a., recovery quality) of an estimator β∈ℜp, we consider a metric of generalization error calculated as L(β)−infβ\leavevmodeL(β). This metric is the same as the excess risk, which is discussed by Bartlett et al. (2006), Koltchinskii (2010), and Clémençon et al. (2008), among others, as an important, if not the primary, measure of generalization performance for their results.
For the HDSL problems above, most traditional schemes are not applicable, because they usually stipulate that n>p. For example, one popularly adopted scheme is to construct a surrogate for the population-level formulation in (1) through the sample average approximation (SAA) below:
[TABLE]
where the objective function Ln(β,Z1n) is often also called the empirical risk function in the context of statistical and machine learning. The SAA entails desirable computational and statistical properties (many of which are discussed by Shapiro et al. 2014, and references therein) but is not designed for handling high dimensionality. Indeed, the best known upper bound on the approximation error of the SAA solution is of the order O(p/n), where O(⋅) hides some quantities independent of, or poly-logarithmic in, “⋅”. Consequently, the estimator of the true parameters generated by solving the SAA, as well as by most other traditional statistical learning approaches, may incur non-trivial errors when p≫n.
To address high dimensionality, several statistical schemes have already been made available. (See Bühlmann and van de Geer 2011, Fan et al. 2014, for excellent reviews.) Among them, this paper follows and generalizes one of the most successful HDSL techniques introduced by Fan and Li (2001) and Zhang (2010) as in the formulation below:
[TABLE]
where Pλ:ℜ+→ℜ+ is a term of sparsity-inducing regularization in the form of a folded concave penalty (FCP). One mainstream special case of the existing FCPs, called the minimax concave penalty (MCP) (Zhang 2010), is of our particular consideration. The MCP is formulated as
[TABLE]
with [⋅]+:=max{0,⋅} and tuning parameters a,λ>0. (Hereafter, we use the term “FCP” to refer to the MCP exclusively.) Eq. (3) is nonconvex, to which the local and/or global solutions have been shown to entail desirable statistical performance (Loh and Wainwright 2015, Wang et al. 2013, 2014, Zhang and Zhang 2012, Loh 2017). \Copyto understand the roles copyTo understand the roles of the tuning parameters a and λ to the FCP, we may observe that its first derivative, Pλ′(θ), is a non-increasing function with Pλ′(0)=λ and Pλ′(θ)=0 for all θ≥aλ. This means that λ determines how intense the penalty is to induce a fitting parameter that is almost zero to be exactly zero. The intensity of this penalty becomes smaller as the magnitude of the corresponding fitting parameter increases. Once the absolute value of that parameter is beyond the threshold aλ, the penalty becomes a constant and thus (locally) ineffective. Furthermore, we also observe that Pλ′′(θ)=−a1 for all θ∈(0,aλ) and Pλ′′(θ)=0 for all θ>aλ. Therefore, a determines the curvature of the FCP near the origin.
\Copy
Alternative sparsity-inducing penalties senAlternative sparsity-inducing penalties, such as the smoothly clipped absolute deviation (SCAD) introduced by Fan and Li (2001), the least absolute shrinkage and selection operator (Lasso) proposed by Tibshirani (2011), and the bridge penalty (a.k.a., the ℓq penalty with 0<q<1) as discussed by Frank and Friedman (1993), have all been shown to be very effective in HDSL by many results due to Fan and Li (2001), Bickel et al. (2009), Fan and Lv (2011), Fan et al. (2014), Loh and Wainwright (2015), Raskutti et al. (2011), Negahban et al. (2012), Wang et al. (2013, 2014), Zhang and Zhang (2012), Zou (2006), Zou and Li (2008), Liu et al. (2017, 2018) and Loh (2017), to name only a few. Many of those results provide oracle inequalities, which “relates the performance of a real estimator with that of an ideal estimator” (Candes 2006).
Ndiaye et al. (2017), Ghaoui et al. (2010), Fan and Li (2001), Chen et al. (2010), and Liu et al. (2017) have presented thresholding rules and bounds on the number of nonzero dimensions for a high-dimensional linear regression problem with different penalty functions.
Despite the availability of several analytical frameworks for HDSL in the current literature, most existing HDSL theories require the two assumptions below, which are sometimes overly critical, to guarantee any generalization performance:
(A).
The satisfaction of the (conventional) sparsity condition, written as ∥β∗∥0≪p, where ∥⋅∥0 denotes the number of nonzero entries of a vector.
(B).
The satisfaction of regularity conditions on the eigenvalues of the Hessian matrix of L(⋅,Z) in the form of the restricted strong convexity (RSC) (Negahban et al. 2012), the restricted isotropic property (RIP) (Candes and Tao 2007), or the restricted eigenvalue (RE) condition (Bickel et al. 2009).
The sparsity assumption essentially means that few dimensions “matter” despite that the total number of dimensions is very high. Meanwhile, the RSC, RIP, and RE can all be interpretable as the stipulation that L(⋅,Z1n) is strongly convex everywhere in some subset of ℜp. The RSC is implied by the RE and RIP for some choices of parameters (Negahban et al. 2012, van de Geer et al. 2009). Except for some special cases of the generalized linear models (as discussed by, e.g., Bickel et al. 2009), when both (A) and (B) above are violated, little is known about the performance of (3) or that of most other HDSL schemes in terms of their generalization performance in general. Negahban et al. (2012) has considered HDSL under weak sparsity, but the RSC is still assumed for establishing the generalization error bounds.
In contrast to the literature, this paper is concerned with the effectiveness of (3) in addressing the HDSL problems when the RSC is completely absent and the traditional sparsity is relaxed into the approximate sparsity (A-sparsity) as below.
{assumption}
L(βεA∗)−infβL(β)≤εA and s:=∥βεA∗∥0≪p
for some εA≥0, βεA∗:∥βεA∗∥∞≤R, and R≥1.
\CopyIntuition Assumption A-sparsity CopyIntuitively, Assumption 1 means that, although β∗ can be dense, replacing most of the nonzero entries of β∗ by zero does not cause the population risk to increase too much. It is evident that, if εA=0, Assumption 1 is reduced to the (traditional) sparsity.
In certain applications of HDSL (e.g., the deep neural networks to be discussed subsequently), it is more convenient to consider a (slight) generalization to Assumption 1 in the following.
{assumption}
L(βεA∗)−Lg∗≤εA and s:=∥βεA∗∥0≪p
for some εA≥0, βεA∗:∥βεA∗∥∞≤R, Lg∗≤infβ\leavevmodeL(β), and R≥1.
Apparently, Assumption 1 is more general than Assumption 1, and the two are equivalent when Lg∗=infβ\leavevmodeL(β). Hereafter, both Assumptions 1 and 1 are referred to as A-sparsity when there is no ambiguity. Without loss of generality, we let s>1 throughout this paper.
\Copy
to copy 2The assumption of ∥βεA∗∥∞≤R is non-critical. It is comparable to, if not less restrictive than, some common assumptions in the literature. For example, in addressing HDSL under (the conventional) sparsity, Loh (2017) and Loh and Wainwright (2015) both assume the estimator and the vector of true parameters to be contained within a convex and bounded set of {β:∣β∣≤Rℓ1} for some Rℓ1>0. Verifiably, under their assumptions, ∥βεA∗∥∞≤R holds with some R≤Rℓ1. Furthermore, we later show that our generalization error bounds depend only logarithmically on R. Thus, it is flexible to pick the value of R in practice; we only need to have a coarse estimation of an upper bound on ∥βεA∗∥∞. Even if R overestimates ∥βεA∗∥∞ too much, the performance of the proposed scheme would probably not be impacted significantly.
We believe that the flexibility of A-sparsity and the relaxation of the RSC can allow the HDSL theories to cover a more comprehensive class of applications. Indeed, as we are to articulate later, our results on HDSL under A-sparsity can facilitate the comprehension of two important classes of problems whose theoretical underpinnings are currently lacking from the literature: (i) A high-dimensional nonsmooth learning problem (nonsmooth HDSL), that is, an HDSL problem with a nonsmooth empirical risk function, and (ii) a (deep and over-parameterized) neural network (NN) model.
\Copy
Weak sparsity discussion 1 contentMore general forms of sparsity, such as the weak sparsity assumption (Negahban et al. 2012), have been discussed previously. However, the only existing discussions on simultaneously relaxing both the sparsity and the RSC assumptions are due to Liu et al. (2018), to our knowledge. Their results imply that the excess risk of an estimator β∈ℜp generated as a certain stationary point to the formulation (3) can be bounded by O(n1/4lnp⋅(1+εA)+εA).
This bound is reduced to O(n1/4lnp) when εA=0. In contrast, our findings in the current paper can strengthen the previous results. More specifically, we relax the subgaussian assumption stipulated by Liu et al. (2018) and impose the weaker, subexponential, condition instead. In addition, the assumption of twice-differentiability made by Liu et al. (2018) is also weakened. In the more general settings, we further show that sharper error bounds can be achieved at a stationary point that (a) satisfies a set of significant subspace second-order necessary conditions (S3ONC) to be formalized subsequently, and (b) has an objective function value no worse than that of the solution to the Lasso problem, formulated below:
[TABLE]
We are to discuss some S3ONC-guaranteeing algorithms to meet the first requirement soon afterwards. To meet the second requirement, we may always initialize the S3ONC-guaranteeing algorithm with a solution to (5), which is often polynomial-time solvable if Ln(⋅,Z1n) is convex.
Our new bounds on those S3ONC solutions are summarized below. First, in the case where εA=0, we can bound the excess risk by O(n2/3lnp+n1/3lnp), which is better than the aforementioned result by Liu et al. (2018) in terms of the dependance on n. Second, when εA is nonzero, the excess risk is then bounded by
[TABLE]
Third, if we further relax the requirement above and consider an arbitrary S3ONC solution, then the excess risk becomes
[TABLE]
where Γ≥0 is (an underestimation of) the suboptimality gap that this S3ONC solution incurs in minimizing Ln,λ(⋅,Z1n) (as defined in (3)).
Admittedly, our excess risk bounds are less appealing than the generalizability results made available in some important previous works by Loh (2017), Raskutti et al. (2011), and Negahban et al. (2012), etc., under the assumption of the RSC. In contrast, we argue that our results are established under a more general set of conditions and can complement the existing results in the HDSL problems beyond the RSC. \Copypara in GammaIt is also worth noting that (7) is in the parameterization of Γ, which can only be explicitly controlled when Ln(⋅,Z1n) is convex in general. Nonetheless, we argue that, in some interesting special cases, one may still control Γ despite the absence of convexity. One of such examples is presented in this paper as we discuss the theoretical applications of HDSL under A-sparsity to the NNs in Sections 6 and 9.
The S3ONC is a necessary condition for local minimality. Compared to the second-order KKT conditions, the S3ONC is weaker and potentially easier computable.
To generate a solution that satisfies the S3ONC admits pseudo-polynomial-time algorithms, such as the variants of Newton’s method proposed by Haeser et al. (2017), Bian et al. (2015), Ye (1992, 1998) and Nesterov and Polyak (2006). All those algorithms provably ensure a γopt-approximation (with a user-specified error tolerance γopt>0) to the second-order KKT conditions at the best-known iteration complexity of the rate O(1/γopt3). The second-order KKT conditions then imply the S3ONC. To add to the current solution schemes, we derive a new gradient-based method that provably guarantees the S3ONC. In contrast to the literature, the iteration complexity of this new algorithm is O(1/γopt2), which improves upon the existing alternatives. Due to the gradient-based nature of the proposed algorithm, it does not access the Hessian matrix or its inverse. Therefore, we think that this gradient-based algorithm may be of some independent interest.
1.1 Some theoretical applications
As mentioned, our results on HDSL under A-sparsity can be employed in the analysis of two important classes of statistical and machine learning models: (a) nonsmooth HDSL, and (b) deep NNs. Some additional details are provided below.
1.1.1 Nonsmooth HDSL.
Although several special cases of HDSL with nonsmoothness, such as high-dimensional least absolute regression, high-dimensional quantile regression, and high-dimensional support vector machine (SVM) have been discussed by Wang (2013), Belloni and Chernozhukov (2011), Zhang et al. (2016b, c) and Peng et al. (2016), there exist few theories that apply to scenarios without an everywhere differentiable loss function in general, especially when non-differentiability may occur at, or in a near neighborhood of, the vector of true parameters.
In contrast, our theories on HDSL under A-sparsity can be utilized to understand the generalization performance of a flexible set of nonsmooth HDSL problems. Indeed, their nonsmooth statistical loss functions can be approximated by another formulation that preserves the continuous differentiability, and the resulting approximation error can then be handled through the notion of A-sparsity. Analyzing this approximation leads to the following bound on the excess risk at an S3ONC solution when the vector of true parameters is A-sparse in the sense of Definition 1:
[TABLE]
In particular, under the conventional sparsity assumption (that is, when εA=0), the rate above becomes O(n3/4lnp+n1/4lnp).
To our knowledge, this is perhaps the first generic theory for the high-dimensional M-estimation problems in which the empirical risk function may not be everywhere differentiable.
1.1.2 Regularized neural network.
The NNs have been frequently discussed and widely applied in recent literature (Schmidhuber 2015, LeCun et al. 2015, Yarotsky 2017). Despite the frequent and exciting advancements in the NN-related algorithms, models, and applications, the development of their theoretical underpinnings is seemingly lagging behind. DeVore et al. (1989), Yarotsky (2017), Mhaskar and Poggio (2016), and Mhaskar (1996), etc., have explicated the expressive power of the NNs in the approximation of different types of functions. As for the generalizability of NNs, one of the focuses of this paper, effective theoretical frameworks have been discussed by Cao and Gu (2019), Li and Liang (2018), Brutzkus et al. (2017), Allen-Zhu et al. (2019), Wang et al. (2019b), Daniely (2017), Neyshabur et al. (2015), Bartlett et al. (2017), Hardt et al. (2015), Zhang et al. (2016a), Li et al. (2018), Jakubovitz et al. (2019), among others. However, for the vast majority of the existing results on the deep NNs, the generalization error bounds grow polynomially in the dimensionality (which is equal to the number of fitting parameters and is also called the network size) and sometimes even increase exponentially in the depth of the network. Such a high sensitivity to dimensionality and depth is inconsistent with the empirical performance of the NNs in many practical applications, where over-parameterization and deep architectures are common and often preferred by practitioners.
In contrast, we analyze the NNs through the lens of HDSL under A-sparsity and consider an FCP-regularized NN training formulation as
a special case of (3) in binary classification. Our results indicate that the NN’s generalization errors at local solutions can be both poly-logarithmic in the number of fitting parameters and polynomial in the network depth. Thus, we think that the results herein can facilitate understanding the powerful performance of the NNs in practice, especially for the over-parameterized and deep models. Barron and Klusowski (2018) have shown the existence of fitting parameters for an NN with ramp activation functions to achieve the poly-logarithmic sample complexity. Compared with Barron and Klusowski (2018), our analysis may present better flexibility in the choice of activation functions and provide more insights towards the computability of the desired fitting parameters in training a deep NN to ensure the proven error bounds.
More specifically, we show that the generalization error incurred by an S3ONC solution to the FCP-regularized training formulation of an NN is bounded by
[TABLE]
for any fixed sA:1≤sA≤p,
with overwhelming probability. Here, D is the number of NN layers, Γ≥0 is the suboptimality gap incurred by the S3ONC solution of consideration, and Ω(p′), for any p′:1≤p′≤p, is the architecture-dependent representability gap (a.k.a., the model misspecification error or the expressive power) of an NN with p′-many nonzero fitting parameters.
By (9) above, the generalization error of an NN consists of four terms: (i) a generalization error term of the order O(n−1/3+n−1/2Dlnp); (ii) the suboptimality gap; (iii) a term that measures the NN’s representability; and (iv) a term that is dependent on suboptimality gap, sample size, and representability, simultaneously. It is worth noting that (9) is obtained with little restriction on the NN architecture and the data generation process. Combining (9) with the existing results on the representability analysis of NNs, we further derive more explicit generalization error bounds. For example, we show that the error yielded by an NN with smooth activation functions can be bounded by O(n1/3D⋅lnp+n1/3Γ+Γ), when we assume that data from different categories are separable by a polynomial function (as well as a couple of other conditions on the NN architecture).
\Copy
Error bound in dependsThe error bound in (9) depends on Γ, the suboptimality gap. To explicitly bound its value is challenging in general because of the nonconvexity of an NN’s training formulation. Nonetheless, we show that some pseudo-polynomial-time computable solutions generated with the aid of an efficient initialization provably ensure the explicit control of Γ in the same settings considered by Cao and Gu (2020). In such a case, the generalization error is further explicated into
[TABLE]
which becomes independent of Γ. In achieving this result, our settings seem more general than Wang et al. (2019a), and our rates on both D and p are perhaps more appealing than most of the existing results. In particular, Wang et al. (2019a) focus on ReLU-NNs (that is, the NNs where the activation functions are ReLU, as discussed by Glorot et al. 2011) with one hidden layer, but our approach can handle deep NNs under more general hyper-parameters. For deep and wide NNs, Cao and Gu (2020) have established generalization error bounds, which, however, increase exponentially in the number of layers in the same settings of our discussion. In contrast, our bound is both poly-logarithmic in dimensionality and polynomial in the number of layers. The computational complexity of training an NN with the claimed error bound is in pseudo-polynomial time.
In obtaining our results, we do not artificially impose any condition on sparsity or alike. As we articulate in Section 6.2, our findings are based on the observation that the A-sparsity (as in Assumption 1) is an intrinsic property implied by the NN’s expressive power.
1.2 Summary of results
Table 1 summarizes the sample complexity results proven in this paper. In contrast to the literature, we claim that our results could lead to the following contributions:
We provide the first HDSL theory for problems where the three conditions—the twice-differentiability, the RSC or alike, and the sparsity—are simultaneously relaxed. In the more general settings, we show that HDSL is still possible even if the sample size is only poly-logarithmic in the dimensionality. In Table 1, the results are presented in the rows for “HDSL under A-sparsity”.
2.
We have derived a pseudo-polynomial-time gradient-based method to compute an S3ONC solution. Even though the S3ONC is a set of second-order necessary conditions, the proposed algorithm does not need to access the Hessian matrix. Furthermore, the iteration complexity of the proposed method is provably O(γopt21) in achieving a γopt-approximation to the S3ONC, which is sharper than the more generic algorithms such as the variations of Newton’s method.
3.
\Copy
As theoretical applications CopyAs theoretical applications of our error bounds for HDSL under A-sparsity, we derive generalizability results for nonsmooth HDSL problems and deep NNs. More specifically, for a flexible class of high-dimensional nonsmooth M-estimation problems, we prove perhaps the first poly-logarithmic sample complexity bound without the RSC assumption. The corresponding result is summarized in Table 1 in the rows for “Nonsmooth HDSL under A-sparsity”. As for the NNs, our sample requirement is only poly-logarithmic in the network size and polynomial in the number of layers, providing theoretical underpinnings for the generalizability of an NN under over-parameterization. These results are summarized in the rows for “Neural Network” of Table 1.
1.3 Organization of the paper
The rest of the paper is organized as below:
Section 2 summarizes the settings and assumptions.
Section 3 introduces the S3ONC. Section 4 states our main results concerning HDSL under A-sparsity. A pseudo-polynomial-time solution scheme that guarantees the S3ONC is discussed in Section 5. Section 6 discusses the theoretical applications to nonsmooth HDSL and the regularized (deep) NNs. Some numerical experiments are presented in Section 7. Sections 9 and 10 of the electronic companion, respectively, present some additional theoretical results on the NN and supplementary numerical results on both the SVM and the NN. Section 8 concludes the paper.
Our notations are summarized below. We use p and n to represent the numbers of dimensions (fitting parameters) and the sample size. We let ∥⋅∥p (1≤p≤∞) be the p-norm, except that 1- and 2-norms are denoted by ∣⋅∣ and ∥⋅∥, respectively. When there is no ambiguity, we also denote by ∣⋅∣ the cardinality of a set, if the argument is a finite set. Let ∥⋅∥F of a matrix be its Frobenius norm and let ∥⋅∥0 of a vector be the number of its nonzero entries. For a random vector v=(vj)∈ℜp, we denote that ∥v∥∞≤R if P[∣vj∣≤R,∀j=1,...,p]=1. For a random variable X, its subexponential and subgaussian norms are denoted by ∥X∥ψ1 and ∥X∥ψ2, respectively. \Copydefi of norm Copy∥A∥1,2:=maxx∈ℜm1,u∈ℜm2{u⊤Ax:∥x∥1=1,∥u∥2=1} for integers m1,m2 and a matrix A∈ℜm2×m1.
For a function f, denote by ∇f its gradient, whenever it exists. For a vector β=(βj)∈ℜp and a set S⊂{1,...,p}, let βS=(βj:j∈S) be a sub-vector of β. For any vector v=(vj), the notation diag(v) represents the diagonal matrix whose jth diagonal entry is vj. We denote by vec(M1,M2,...,Mm) the vector that collects all the entries of the matrices M1, M2, …, Mm. The vector ej is the jth standard basis. ⌈x⌉ (or ⌊x⌋) for any x≥0 is the smallest (or largest) integer that is greater (or smaller, respectively) than or equal to x. Finally, we denote by O(⋅)’s and O(⋅)’s, respectively, the complexity rates that hide (potentially different) universal constants and quantities at most logarithmically dependent on “⋅”.
2 Settings and assumptions
In this section, we summarize our assumptions in addition to the aforementioned settings. We assume that
the gradient ∇L(β,z):=(∂βj∂L(β,z):j=1,...,p) of
L(β,z) w.r.t. β is well-defined for all β∈ℜp and almost every z∈W. Furthermore, we also suppose that ∂βj∂L(β,z) is Lipschitz continuous for all β∈ℜp; that is, there exists a scalar UL>0 such that
[TABLE]
for almost every z∈W and for all β∈ℜp, δ∈ℜ, j=1,...,p. These regularities are to be relaxed when we later discuss the nonsmooth HDSL problems and the ReLU-NNs. Apart from the above, two additional assumptions are imposed.
{assumption}
For all β∈ℜp:∥β∥∞≤R and i=1,...,n, it holds that E[L(β,Zi)] is finite-valued and L(β,Zi)−E[L(β,Zi)] follows a subexponential distribution; that is,
∥L(β,Zi)−E[L(β,Zi)]∥ψ1≤σ,
for some σ≥1.
Remark 2.1
As an implication of Assumption 2, for all β∈ℜp:∥β∥∞≤R, (combined with the assumption that Zi, i=1,...,n, are i.i.d.) a well-known Bernstein-like inequality holds as below:
[TABLE]
for some absolute constant c∈(0,0.5]. Interested readers are referred to Vershynin (2012) for more detailed discussions on the subexponential distributions.
{assumption}
For some measurable and deterministic function C:W→ℜ+, the random variable C(Zi) satisfies that
∥C(Zi)−E[C(Zi)]∥ψ1≤σL,
for all i=1,...,n, for some σL≥1. Furthermore,
∣L(β1,z)−L(β2,z)∣≤C(z)∥β1−β2∥,
for all β1,β2∈ℜp∩{β:∥β∥∞≤R} and almost every z∈W.
Hereafter, we let E[C(Zi)]≤Cμ for all i=1,...,n for some Cμ≥1.
Remark 2.2
Assumptions 2 and 2 are general enough to cover a wide spectrum of M-estimation problems. More specifically, Assumption 2 requires that the underlying distribution is sub-exponential, and Assumption 2 essentially imposes the Lipschitz(-like) continuity on Ln(⋅,Z1n). Examples of sub-exponential distributions include uniform, Gaussian, exponential, and χ2 distributions, as well as any distribution that has a bounded support set.
\Copyto copy remark 2As for the Lipschitz continuity, it is a condition satisfied by many statistical learning problems, such as linear regression, Huber regression, SVM, and NNs. We are to show that the generalization error bounds only grow logarithmically in the Lipschitz constant. The combination of our Assumptions is non-trivially weaker than the settings in Liu et al. (2017, 2018).
It is also worth mentioning that the stipulations of σ≥1, Cμ≥1, and σL≥1 can be easily relaxed and are needed only for notational simplicity in presenting our results.
Because the FCP is nonconvex, so is Eq. (3). Thus, computing the global solution to (3) is intractable. Nonetheless, our theories concern only local stationary points. We show that these local solutions are good enough to ensure the promised statistical performance.
In particular, we consider the stationary points that are characterized by the satisfaction of the significant subspace second-order necessary conditions (S3ONC), which are closely similar to the necessary conditions discussed by Chen et al. (2010) for linear regression with bridge regularization and by Liu et al. (2017, 2018) under the assumption that the empirical risk function is everywhere twice differentiable. This paper generalizes the characterizations of the S3ONC to scenarios where the twice-differentiability may not hold everywhere.
Definition 3.1
Given Z1n∈Wn, a vector β∈ℜp is said to satisfy the S3ONC (denoted by S3ONC(Z1n)) of Problem (3) if both of the following sets of conditions are satisfied:
a.
\Copy
the first-order KKT copyThe first-order KKT conditions are met at β:=(βj); that is, there exists ϰj∈∂(∣βj∣), for all j=1,...,p, such that
[TABLE]
where ∇Ln(β,Z1n) is the gradient of Ln(⋅,Z1n) as defined in (2), ∂(∣βj∣) is the subdifferential of ∣⋅∣ at βj, and Pλ′(⋅) is the first derivative of Pλ(⋅).
2. b.
The following inequality holds at β: for all j=1,...,p, if ∣βj∣∈(0,aλ), then
[TABLE]
where Pλ′′ is the second derivative of Pλ(⋅), the quantity UL is defined as in (11), and a and λ are (hyper-)parameters of the FCP as in (4).
It is worth noting that the S3ONC is verifiably implied by the conventional second-order KKT conditions when they are well-defined. We show in Section 5 that an S3ONC solution (i.e., a solution that satisfies the S3ONC) can be computed by the proposed gradient-based method at pseudo-polynomial-time complexity.
4 Statistical performance bounds
This section presents the promised sample complexity results for a generic HDSL problem under A-sparsity. More specifically, Proposition 1 shows the most general result of this paper. In that proposition, a hyper-parameter ϱ is left to be determined in different special cases. One of those cases is then presented in Theorem 4.9. For convenience, we adopt a short-hand notation as follows: ζ:=ln(3eR⋅(σL+Cμ)).
Proposition 1
Suppose that Assumptions 1, 2, and 2 hold. For any ϱ:0<ϱ<21 and the same c in (12), let a<UL1 and λ:=c⋅a⋅n2ϱ8σ[ln(nϱp)+ζ]. Consider any random vector β∈ℜp such that ∥β∥∞≤R and the S3ONC(Z1n) to (3) is satisfied at β almost surely. The following statements hold:
(i)
For any fixed Γ≥0 and some universal constant C1>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βεA∗,Z1n)+Γ almost surely, then
[TABLE]
with probability at least
1−2(p+1)exp(−n/C1)−6exp(−2cn4ϱ−1), where L is defined in Eq. (1) and Lg∗ is defined in Assumption 1.
(ii)
For almost every Z1n∈Wn, assume that the minimization problem in (5) admits a finite optimal solution denoted by βℓ1:=βℓ1(Z1n). For some universal constant C2>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n) almost surely, then
[TABLE]
with probability at least
1−2(p+1)exp(−n/C2)−6exp(−2cn4ϱ−1).
Proposition 1 is the most general result in this paper. It does not rely on convexity, RSC, or alike, although to ensure Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n) almost surely in Part (ii) usually requires Ln,λ(⋅,Z1n) to be convex.
Remark 4.3
\Copy
to copy 3The assumption that ∥β∥∞≤R is comparable to, or less restrictive than, some similar conditions in the literature. For example, Loh (2017) and Loh and Wainwright (2015) require that the estimator is within the set of {β:∣β∣≤Rℓ1}. Under the same requirement, we may have Rℓ1≥R.
Because the error bounds in (15) and (18) are logarithmic in R (with ζ:=O(lnR)), one may let the value of R to be a coarse overestimation of ∥β∥∞.
Remark 4.4
Because L(β)−infβ\leavevmodeL(β)≤L(β)−Lg∗, the first part of this proposition indicates that, for all the S3ONC solutions, the excess risk can be bounded by a function in the parameterization of the suboptimality gap Γ. (Technically speaking, Γ is an underestimation of the suboptimality gap in this proposition.) This bound on the excess risk explicates the consistency between the statistical performance of a stationary point to an HDSL problem and the optimization quality of that stationary point in minimizing the objective function of Problem (3).
The second part of Proposition 1 concerns an arbitrary S3ONC solution β that has an objective function value smaller than that of βℓ1. The corresponding error bound becomes independent of Γ.
Remark 4.5
To compute β in Part (ii) of this proposition, we can adopt a two-step approach: In the first step, we solve for βℓ1, which is often polynomial-time computable if Ln,λ(⋅,Z1n) is convex given Z1n. Then, in the second step, we invoke an S3ONC-guaranteeing algorithm (such as the gradient-based method to be discussed in Section 5). This algorithm should be initialized with βℓ1.
Remark 4.6
We may as well let a−1=2UL to satisfy the stipulation on a in Proposition 4.9. Here, UL can be considered as the largest diagonal of the Hessian matrix of L(⋅,z), if it exists. In many applications of HDSL, this quantity can satisfy UL≤O(1)lnp with high probability under data normalization. For example, in the special case of high-dimensional linear models, UL≤1 is implied by the common assumption of column normalization (Raskutti et al. 2011, Negahban et al. 2012).
Remark 4.7
\Copy
to copy 1The proof of Proposition 1 makes use of the coincidence that, at the S3ONC solutions, the FCP behaves similarly as the ℓ0 penalty (as discussed by, e.g., Shen et al. (2013)). Thus, it is possible that adopting the ℓ0 penalty instead of the FCP in our formulation (3) may lead to similar results on the generalization errors with less technical difficulty. Nonetheless, the ℓ0 penalty introduces discontinuity to the formulation and thus may usually lead to higher computational ramification. We leave for the future research the study of the trade-offs between computational and sample complexities for the formulations with alternative regularization terms.
Remark 4.8
For any fixed ϱ:0<ϱ<21, each of the two parts of Proposition 1 has already established the poly-logarithmic sample complexity. Based on this proposition, polynomially increasing the sample size can compensate for the exponential growth in the dimensionality. We may further pick a reasonable value for ϱ and obtain more detailed bounds as in Theorem 4.9 below, which confirms the promised complexity rates as previously mentioned in (6) and (7) for a general HDSL problem under A-sparsity.
Theorem 4.9
Let a<UL1 and λ:=c⋅a⋅n2/38σ[ln(n2/3p)+ζ] for the same c in (12). Suppose that Assumptions 1, 2, and 2 hold. For any random vector β∈ℜp such that ∥β∥∞≤R and S3ONC(Z1n) to (3) is satisfied at β almost surely, the following statements hold:
(i)
For any fixed Γ≥0 and some universal constant C3>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βεA∗,Z1n)+Γ almost surely, then the excess risk is bounded by
[TABLE]
with probability at least
1−2(p+1)exp(−C3n)−6exp(−C3n1/3).
(ii)
For almost every Z1n∈Wn, assume that the minimization problem in (5) admits a finite optimal solution denoted by βℓ1:=βℓ1(Z1n).
For some universal constant C4>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n) almost surely, then the excess risk is bounded by
[TABLE]
with probability at least
1−2(p+1)exp(−C4n)−6exp(−C4n1/3).
Proof 4.10
Proof.
Invoking Proposition 1 with ϱ=31 and noticing that Assumption 1 implies Assumption 1 with Lg∗:=infβ\leavevmodeL(β), we obtain both parts of the desired results.
□
Theorem 4.9 ensures the desired poly-logarithmic sample complexity for HDSL under A-sparsity. Our remarks concerning Proposition 1 above also apply to Theorem 4.9, since the latter is a special case when ϱ=31 and Lg:=infβ\leavevmodeL(β). We would like to point out that, if εA=0, then A-sparsity is reduced to the conventional sparsity. In such a case, the excess risk in (22) is simplified into L(β)−infβ\leavevmodeL(β)≤O(n2/3lnp+n1/3lnp).
5 An S3ONC-Guaranteeing Algorithm
This section presents a pseudo-polynomial-time S3ONC-guaranteeing algorithm. For convenience, we consider a slightly more abstract optimization problem than (3) as below:
[TABLE]
where f:ℜp→ℜ is a continuously differentiable function with ∥∇f(β1)−∇f(β2)∥≤UL,2⋅∥β1−β2∥ for some UL,2≥1 and all β1,β2∈ℜp. Consequently, the partial derivative ∂βj∂f(β), for all j=1,...,p, is also globally Lipschitz continuous in the sense that [∂βj∂f(β)]β=β+δ⋅ej−[∂βj∂f(β)]β=β≤UL,∞⋅∣δ∣ for every β∈ℜp, any δ∈ℜ, and some 1≤UL,∞≤UL,2. (Note that UL in (11) becomes UL,∞ here.) The pseudo-code of the proposed algorithm is summarized below.
Algorithm 1. An S3ONC-guaranteeing gradient-based algorithm
Step 1.
Fix parameters γopt,M,λ, and a such that a<M−1. Initialize k=0 and β0∈ℜp.
Step 2.
Compute βk+21 by solving the following problem
[TABLE]
Step 3.
Compute βk+1 by solving the following problem
[TABLE]
Step 4.
Algorithm terminates and outputs βk if the stopping criteria are met. Otherwise, let k:=k+1 and go to Step 2.
We design the termination criterion to be that the algorithm stops when the below is satisfied for the first time
[TABLE]
where M>0 and γopt>0 are specified in Step 1 of Algorithm 1. Intuitively, M−1 can be interpreted as the step size of the algorithm, and γopt, as the error tolerance in approximating the S3ONC. At termination, the iteration count is denoted by k∗.
\Copy
to our analysisTo our analysis, Algorithm 1 relies on solving two per-iteration subproblems (24) and (25), repetitively. Subproblem (24) in Step 2 ensures that a non-trivial reduction in the objective function value can be achieved whenever the first-order KKT conditions are not met. This step is essential to the promised O(1/γopt2)-rate of the algorithm. Meanwhile, the presence of Subproblem (25) in Step 3 leads to a solution sequence that approaches a desired S3ONC solution without affecting the convergence rate. We may formalize the above analysis to prove the theorem below on the iteration complexity of Algorithm 1 in computing an S3ONC solution.
Theorem 5.1
Suppose that fλ∗:=infβfλ(β)>−∞, M≥UL,2, and a<M1. For any γopt:0<γopt<aλ⋅M, the following statements hold true:
(a)
Algorithm 2 terminates at iteration k∗≤⌊2M⋅γopt2fλ(β0)−fλ∗⌋+1.
(b)
At termination, βk∗=(βjk∗) is a γopt-S3ONC solution to (23); that is,
there exists ϰj∈∂(∣βjk∗∣), for all j=1,...,p, such that
[TABLE]
and, for all j=1,...,p, if ∣βjk∗∣∈(0,aλ), then
UL,∞+Pλ′′(∣βjk∗∣)≥0,
where a and λ are defined in (4).
(c)
fλ(βk∗)≤fλ(β0).
(d)
βjk∈/(0,aλ)* for all k=1,...,k∗, where βjk is the *jth entry of βk.
We would like to make a few remarks on Theorem 5.1 in the following.
•
The assumptions of this theorem include the stipulation of a<M1, which is consistent with the requirement on a in the generalizability results in the previous section. More specifically, we may let a<min{UL,∞−1,M−1} to satisfy the conditions for both Theorem 5.1 and Proposition 1, simultaneously. This observation can be generalized to almost all of our main sample complexity results. Another important assumption we have made is that f is smooth; that is, ∇f is (globally) Lipschitz continuous. While many machine learning problems satisfy such a condition, it is violated by a nonsmooth HDSL problem and a ReLU-NN. Nonetheless, as we show in Section 6, the nonsmooth learning problems, including the SVM, can be analyzed through a smooth approximation. As for a ReLU-NN, we demonstrate that Algorithm 1 can still be effective with the aid of a tractable initialization scheme.
•
From Part (b) of the result, the γopt-S3ONC solution is an γopt-approximation to the S3ONC as in Definition 3.1, if we let Ln(⋅,Z1n):=f(⋅). One may see that (27) is a γopt-approximation to the first-order KKT conditions in (13). Meanwhile, the second set of conditions in (14) are met exactly.
•
It is easy to re-organize the results from Parts (a) and (b) of Theorem 5.1 to see that the algorithm runs for O(γopt−2)-many iterations to generate an γopt-S3ONC solution. This iteration complexity is polynomial in the problem dimensionality and the numeric value of the problem data input.
Since the per-iteration problems admit closed forms, we can then see that Algorithm 2 is among the class of pseudo-polynomial-time algorithms. It is worth noting that many existing alternatives are more generic and can compute stronger necessary conditions than the S3ONC. Nonetheless, the new algorithm can still be of independent interest. Compared to O(γopt−3), the best-known rate to ensure an γopt-approximation to the second-order necessary conditions in the literature, our proposed gradient-based method yields a significantly better computational complexity.
•
Part (c) indicates that the output of the algorithm is no worse than the initial solution in terms of minimizing the objective function fλ. This property ensures conditions like Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n) in the sample complexity results in, e.g., Part (ii) of Theorem 4.9, if Algorithm 1 is initialized with βℓ1.
•
Part (d) is useful for our subsequent analysis. One may verify that the proof of this part holds even if f(⋅) is not continuously differentiable.
We observe that both the per-iteration problems (24) and (25) admit closed-form solutions. To see this, we note that (24) is essentially a soft thresholding problem, whose closed form is well-known. As for (25), we observe that it can be decomposed into p-many one-dimensional problems. Enumerating all the KKT solutions to each of these decomposed problems and noticing that a<M−1, one may verify that, for all j=1,...,p,
[TABLE]
6 Theoretical Applications
In this section, we discuss two important theoretical applications of Proposition 1 and Theorem 4.9. Section 6.1 presents our results for a flexible class of nonsmooth HDSL problems. Section 6.2 then considers the generalizability of an FCP-regularized (deep) NN.
6.1 Nonsmooth HDSL under A-sparsity
The nonsmooth HDSL problem of our consideration is formulated as below:
[TABLE]
where A(⋅):W→ℜm×p is deterministic and measurable (and may be nonlinear in “⋅”), U⊆ℜm is a convex and compact set with a diameter D:=max{∥u1−u2∥:u1,u2∈U}, and f1:ℜp×W→ℜ and ϕ:U×W→ℜ are deterministic, measurable functions. Let f1(⋅,z) be continuously differentiable with
[∂βj∂f1(β,z)]β=β+δ⋅ej−[∂βj∂f1(β,z)]β=β≤Uf1⋅∣δ∣ for almost every z∈W and for all β∈ℜp, δ∈ℜ, and j=1,...,p. Let ϕ(⋅,z) be convex and continuous for almost every z∈W. As some standard and non-critical regularity conditions, it is assumed that E[n−1∑i=1nLns(β,Zi)] is well-defined for all β∈ℜp with infβE[n−1∑i=1nLns(β,Zi)]>−∞ and there exists some vector βεA′∗∈ℜp:∥βεA′∗∥∞≤R, such that E[n−1∑i=1nLns(βεA′,Zi)]−βinfE[n−1∑i=1nLns(β,Zi)]≤εA′ for some εA′≥0. In the foregoing settings, A-sparsity (in the sense of Assumption 1) holds with εA:=εA′ and we are again interested in estimating the vector of true parameters β∗∈arginfβE[n−1∑i=1nLns(β,Zi)].
Such a problem is general enough to cover some important nonsmooth learning problems, such as the least quantile linear regression, the least absolute deviation regression, and the SVM.
Compared to our results in Section 4, a nuance here is that Problem (28) has an empirical risk function that is not everywhere differentiable due to the presence of a maximum operator. The non-differentiable point may reside anywhere, such as at, or in some near neighborhood of, the vector of true parameters.
In view of this subtlety, we propose the following FCP-based formulation.
[TABLE]
for a user-specific u0∈U and δ>0 (which is chosen to be δ=41 later in our theory).
Note that the proposed formulation in (29) is not an immediate instantiation of (3) for the population-level problem βinfE[n−1∑i=1nLns(β,Zi)]. Indeed, apart from the FCP-based regularization term, an additional quadratic function −2nδ∥u−u0∥2 is also included in (29). The purpose of this extra term is to add regularities in order to facilitate our analysis; although Ln(β,Z1n):=n1∑i=1nLns(β,Zi) is not everywhere differentiable,
[TABLE]
is verifiably a continuously differentiable approximation to Ln(β,Z1n). The error incurred by this approximation can be controlled by properly determining the hyper-parameter δ. Furthermore, invoking Theorem 1 by Nesterov (2005) (restated as Theorem 13.15 for completeness), one may derive the Lipschitz constant of the gradient of Ln,δ(⋅,Z1n). This observation is formalized in Part (a) of Theorem 6.2 below.
With this approximation, the nonsmooth HDSL problem can now be analyzed via the framework of HDSL under A-sparsity; we can consider the approximation error as a composite of εA in the definition of A-sparsity. Via this perspective, we may easily apply results from Proposition 1 or Theorem 4.9 to (30) after some conversions of the settings.
In doing so, we impose the following two assumptions, which are instantiations of Assumptions 2 and 2, respectively:
{assumption}
For all β∈ℜp:∥β∥∞≤R and i=1,...,n, it holds that ∥Lns(β,Zi)−E[Lns(β,Zi)]∥ψ1≤σ,
for some σ≥1.
{assumption}
For some measurable and deterministic function C:W→ℜ+, the random variable C(Zi) satisfies that
(i)
∥C(Zi)−E[C(Zi)]∥ψ1≤σL,
for all i=1,...,n for some σL≥1, and
(ii)
E[C(Zi)]≤Cμ for all i=1,...,n for some Cμ≥1.
Furthermore,
∣Lns(β1,z)−Lns(β2,z)∣≤C(z)∥β1−β2∥,
for all β1,β2∈ℜp∩{β:∥β∥∞≤R} and almost every z∈W.
Remark 6.1
Similar to Assumptions 2 and 2, the foregoing two conditions ensure that the underlying distribution is subexponential and that a Lipschitz-like inequality holds for Lns(⋅,z).
We are now ready to present our results on nonsmooth HDSL in the following theorem, which leads to what is claimed in Eq. (8). Similar to Section 4, we adopt the short-hand, ζ:=ln(3eR⋅(σL+Cμ)).
Theorem 6.2
Suppose that ∥A(z)∥1,22≤UA for some UA≥0 and for almost every z∈W. Let Assumptions 1, 6.1, and 6.1 hold (where εA and L(⋅) from Assumption 1 become εA′ and E[Lns(⋅,Z)], respectively). The following statements hold:
(a)
For any δ>0, all j=1,...,p, every β∈ℜp, and almost every Z1n∈Wn, the partial derivative ∂βj∂Ln,δ(β,Z1n) is well-defined and Lipschitz continuous with [∂βj∂Ln,δ(β,Z1n)]β=β+h⋅ej−[∂βj∂Ln,δ(β,Z1n)]β=β≤(Uf1+nδUA)⋅∣h∣ for any h∈ℜ.
(b)
Let δ=41, a=2(Uf1+n1/4UA)1, and λ:=c⋅a⋅n3/88σ[ln(n83p)+ζ] for the same c in (12). For almost every Z1n∈Wn, assume that the minimization problem minβLn,δ(β,Z1n)+λ∣β∣ admits a finite optimal solution denoted by βℓ1,δ:=βℓ1,δ(Z1n). Consider any random vector β∈ℜp such that ∥β∥∞≤R,
Ln,δ,λ(β,Z1n)≤Ln,δ,λ(βℓ1,δ,Z1n)
almost surely, and β satisfies the S3ONC(Z1n)* to (29)
w.p.1. For some universal constant C5>0, if*
[TABLE]
where D:=max{∥u1−u2∥:u1,u2∈U}, then
[TABLE]
with probability at least 1−2(p+1)exp(−n/C5)−6exp(−2cn1/2).
theorem Copy remark additionalIt is possible to generalize Part (b) of the above theorem to obtain an error bound in the parameterization of any δ>0. Nonetheless, the optimal choice to balance all the error terms would be δ=1/4.
Remark 6.5
Theorem 6.2 is general enough to cover a flexible class of nonsmooth HDSL problems under A-sparsity. Particularly, in the case of the high-dimensional SVM, Problem (28) becomes
[TABLE]
where (xi,yi), for i=1,...,n, are i.i.d. random pairs of the feature values and the categorial labels with support {x∈ℜp:∣x∣≤1}×{−1,+1}, and ρ≥0 is a user-specific constant. (The assumption that ∣xi∣≤1, a.s., can always be ensured by normalization.)
We may enable the SVM to handle high dimensionality via the formulation below:
[TABLE]
where the value of u0∈[0,1] can be specified arbitrarily.
As a special case to (29), Problem (34) satisfies both Assumptions 6.1 and 6.1. For example, when ρ=0.01, both of the assumptions are met with σ≤O(1), R≤O(1), σL=0, and Cμ≤O(1)⋅p. (More detailed derivations are provided in Section 12 of the electronic companion.) Also observe that we may let f1, Uf1, D, and A(Zin) from Theorem 6.2 to be
[TABLE]
respectively, in the SVM. Thus, Uf1≤O(1), D=1 and UA≤maxy,x{∥y⋅x⊤∥1,22:y∈{−1,1},∣x∣≤1}≤1 in this special case. Recall here that the error bound in, e.g., (32) is poly-logarithmic in Cμ. Theorem 6.2 then implies that the poly-logarithmic sample complexity can also be achieved for the FCP-regularized SVM.
In contrast to (34), an alternative formulation as below has been previously discussed in the literature:
[TABLE]
where Pλ(∣⋅∣):ℜ→ℜ is some sparsity-inducing regularization function, such the SCAD and the Lasso.
Compared with (34), this alternative does not incorporate the smoothing term of −2nδ(ui−u0)2.
Such a formulation has been shown to be successful in multiple realistic classification problems (e.g., Zhang et al. 2006). Furthermore, recovery theories in different high-dimensional settings have been established by Zhang et al. (2016b, c) and Peng et al. (2016), etc. Nonetheless, the existing results commonly stipulate a strictly positive lower bound on the eigenvalues of some principal submatrices of X⊤X or E[X⊤X], where X:=(xi⊤:i=1,...,n). Some of these conditions are the instantiations of the RE condition in the SVM problem. In contrast, our bound on the excess risk is established without these eigenvalue conditions.
6.2 Regularized deep neural networks
This subsection presents a generalization error bound for a flexible set of NN architectures. Additional results are provided in Section 9 of the electronic companion, where we derive more explicit error bounds under additional regularities.
\Copy
For some CopyWhile NNs can be applied to a wide spectrum of data-driven tasks, our analysis herein is focused on a binary classification problem in the following settings.
For some X:={x∈ℜd:∥x∥=1} and Y∈{−1,1} (where d>0 is some integer), let (x,y)∈X×Y be a random pair that follows an unknown probability distribution D on X×Y with support supp(D). Here, x is the vector of random feature values and y is the corresponding class label. We assume that there exists an unknown, deterministic, and measurable separating function g:X→ℜ such that inf(x,y)∈supp(D){y⋅g(x)}≥v for some v∈(0,1); that is, the two categories of data are separable by function g. Also assume that E[∣g(x)∣]<∞.
The learning problem of interest here, as a special case of (1), is to train a classifier using the knowledge of a sequence of i.i.d. random samples, (xi,yi), i=1,...,n, of (x,y).
In applying an NN to solving this learning problem, we narrow down the search of the optimal classifier to the determination of the best fitting parameters for the NN. Some relative details are below. Denote by Ψ:ℜ→ℜ an activation function, such as the ReLU, ΨReLU(x)=max{0,x}, the softplus, Ψsoftplus(x)=ln(1+ex), and the sigmoid, Ψsigmoid(x)=1+exex. The NN model is then a network that consists of multiple layers (groups) of neurons (or units). Each neuron is a computing unit that performs the operations of the chosen activation function on the input signals.
Architectures among those layers are formed in the sense that the signals are passed from the layer of input neurons to the layer of output units, transversing a predetermined collection of candidate paths. Each path may comprise multiple neurons and connections. Fitting parameters often exist in the forms of connection weights and biases to (dis)amplify and offset the signals, respectively. A layer that is neither the input layer nor the output layer is called a hidden layer. Throughout our discussions on the NNs, we let D≥2 be the number of layers (excluding the input layer but including the output layer). A neuron in a hidden layer is called a hidden neuron.
We denote this NN by FNN(x,β), where FNN:X×ℜp→ℜ is a deterministic, measurable function that captures the output of an NN given input x and fitting parameters β. We also assume that there exists a deterministic function Ω:{1,...,p}→ℜ+ such that
[TABLE]
Intuitively, Ω(p′) measures the model misspecification error incurred by the NN in representing g, when only p′-many fitting parameters are nonzero (active).
In training the NN, we focus on the following formulation as a special case to (3):
[TABLE]
where we follow Cao and Gu (2020, 2019) in defining F:ℜ→ℜ+ to be F(z):=ln(1+exp(−z)). Note that, if we drop the regularization term ∑j=1pPλ(∣βj∣), then (37) is reduced to the conventional training formulation for an NN. Hereafter, we assume that E[∣FNN(x,β)∣]<∞ for all β:∥β∥∞≤RΩ for some RΩ>0. This quantity should be properly large to ensure the satisfaction of the assumption below.
{assumption} \CopyFor all Copy
For all 1≤sA≤p, it holds that ∅=[−RΩ,RΩ]p∩{β∈ℜp:E[∣g(x)−FNN(x,β)∣]≤Ω(sA),∥β∥0≤sA}.
\CopyIntuitively, Assumption Copy Intuitively, Assumption 6.2 means that the NN can represent the separating function g with a model misspecification error of no more than Ω(sA) when (a) no more than sA-many fitting parameters are nonzero and (b) the absolute values of these fitting parameters are bounded from above by RΩ>0.
We also impose the following non-critical condition on the architecture of an NN.
{assumption}\CopyFor any constant Copy
For any constants C∈ℜ, RΩ>0, p′≥1, and fitting parameters β1∈ℜp:∥β1∥∞≤RΩ,∥β1∥0≤p′, it holds that FNN(x,β1)⋅C=FNN(x,β2) for some β2∈ℜp:∥β2∥∞≤C⋅RΩ,∥β2∥0≤p′, for every x∈X.
\CopyIt can beIt can be verified that Assumption 6.2 holds for many NN architectures, including many convolutional neural networks and residual networks that have linear or ReLU activation functions in the output layer.
Remark 6.6
\Copy
By the satisfaction of Assumptions
By the satisfaction of Assumptions 6.2 and 6.2, we argue that the generalizability of an NN trained by solving (37) can be analyzed through the framework of HDSL under A-sparsity. Based on the existing results on the representability of NNs, e.g., by DeVore et al. (1989), Yarotsky (2017), Mhaskar and Poggio (2016), and Mhaskar (1996), an NN with a reasonably small network size sA may well represent g (such that Ω(sA) is small) under some plausible conditions. These representability results imply the innate presence of A-sparsity in an NN model. Observe that F is 1-Lipschitz continuous. Thus, E[F(y⋅2vlnn⋅FNN(x,β1))]−E[F(y⋅2vlnn⋅g(x))]≤2vlnn⋅E[∣FNN(x,β1)−g(x)∣] for any β:∥β∥∞≤RΩ. Invoking Assumption 6.2 and the fact that infuF(u)=0, we obtain that
[TABLE]
where the last inequality is due to the assumption that, for all (x,y)∈supp(D), it holds that y⋅g(x)≥v⟹E[F(y⋅2vlnn⋅g(x))]≤ln(1+exp(−0.5lnn))≤n1. Further note that, by Assumption 6.2, 2vlnn⋅FNN(x,β) can be represented by the same NN architecture; that is, 2vlnn⋅FNN(x,β)=FNN(x,β′) for some new fitting parameters β′:∥β′∥∞≤2vlnnRΩ. Thus, we may have
[TABLE]
which matches the statement of Assumption 1 with s:=sA, R:=2vlnn⋅RΩ, εA:=2vlnn⋅Ω(sA)+n1, and Lg∗:=infuF(u)=0.
As mentioned, explicit forms of Ω(⋅) have been provided, e.g., by DeVore et al. (1989), Yarotsky (2017), Mhaskar and Poggio (2016), and Mhaskar (1996). With the above discussion, the generalizability of an NN can then be derived using the same machinery for HDSL under A-sparsity, under one more flexible assumption on the NN’s architecture as below.
{assumption}\Copy
Assumption to copy
For almost every x∈X, it holds that the gradient ∇βFNN(x,β) and Hessian ∇β2FNN(x,β) of FNN(x,⋅) are everywhere well-defined and satisfy that
[TABLE]
for all β∈ℜp and some UNN≥1.
\CopyAssumption 222 CopyAssumption 6.2 essentially allows the norms of gradient and Hession to grow exponentially in the number of layers D. Such an assumption is satisfied by a wide spectrum of NN architectures, especially when the activation functions are smooth. Some NNs with nonsmooth activation functions, such as the ReLU, may still be analyzed. We discuss such a case later in Subsection 9.2.
We are now ready to present our result on the generalizability of a regularized NN. With some abuse of notations, the S3ONC(Z1n), in this special case, is referred to as the S3ONC(X,y) to problem (37), where X:=(xi⊤) and y:=(yi).
Theorem 6.7
Consider any random vector β such that ∥β∥∞≤2vlnn⋅RΩ and the S3ONC(X,y) holds at β almost surely.
Suppose that Assumptions 6.2, 6.2, and 6.2 hold.
For any fixed Γ≥0, assume that Tn,λ(β)−infβTn,λ(β)≤Γ, w.p.1.,
where Tn,λ is as defined in (37).
There exists a universal constant C6>0, such that, for any sA:1≤sA≤p, if a<21⋅exp{−2UNN⋅D⋅ln[2p⋅v−1⋅UNN⋅RΩ⋅lnn]}, λ:=c⋅a⋅n2/38σ[ln(2v3e⋅RΩpn4/3)+UNN⋅D⋅ln(UNNRΩpnv−1)],
and
[TABLE]
then it holds that
[TABLE]
with probability at least 1−C6pexp(−C6n)−C6exp(−C6n1/3). Here, Ω(⋅) is defined as in (36).
We would like to make a few remarks on the results presented in this theorem.
(i)
E[1(y⋅FNN(x,β)<0)]=P[y⋅FNN(x,β)<0]* is also referred to as the expected 0-1 loss and is a commonly adopted measure of generalization performance, such as by Cao and Gu (2020, 2019)**, in a binary classification problem.*
2. (ii)
This theorem provides the promised poly-logarithmic dependence between the sample size n and the dimensionality p; polynomially increasing n can compensate for the exponential growth in p. With this result, the generalizability of an over-parameterized NN is ensured, and the promised result in (9) is proven. The error bound can be made more explicit under some additional conditions as discussed in Section 9.1.
3. (iii)
Although Assumption 6.2 allows the Lipschitz constant to grow exponentially in the number of layers D, the generalization error increases no more than linearly in D.
4. (iv)
Many sparsity-inducing regularization schemes have been discussed in the literature, including Dropout (Srivastava et al.2014), sparsity-inducing penalization (Han et al.2015, Scardapane et al.2017, Louizos et al.2017, Wen et al.2016), DropConnect (Wan et al.2013), randomDrop (Huang et al.2016), and pruning (Alford et al.2018), etc. Many of these studies are focused on the numerical aspects, yet the theoretical guarantees on the effectiveness of regularization are still largely lacking. Although Wan et al. (2013)* presented generalization error analyses for DropConnect, the dependence among the dimensionality, the generalization error, and the sample size is not explicated therein.
\CopyIt is ourIt is our conjecture that our results could be extended to and combined with the alternative regularization schemes to facilitate the analysis of the regularized NNs.***
5. (v)
Theorem 6.7 informs us that the generalization performance of the NNs is consistent with the optimization quality. If all other quantities are fixed, the generalization error can be bounded by O(Γ+Γ), where we recall that Γ≥0 is the suboptimality gap.
6. (vi)
\Copy
Admittedly CopyAdmittedly, how to control Γ is still an open question. The traditional training formulation of an NN is usually nonconvex. Thus, it is generally prohibitive to compute a global solution. The challenge is further increased by the incorporation of the FCP, which is also nonconvex. Fortunately, in spite of the current theoretical challenge, it has been observed empirically that some local optimization algorithms could well approximate a global optimum in NN training, e.g., in the experiments reported by Wan et al. (2013)* and Alford et al. (2018). To explain these observations, several theoretical paradigms have already been provided by, e.g., Du et al. (2018), Liang et al. (2018), Haeffele and Vidal (2017) and Wang et al. (2019a). Based on those results, it is promising that the structures of an NN (even with regularization) can often be exploited to facilitate global optimization. An excellent review of this topic is provided by Sun (2019). To add to the literature, we present an interesting special case where a suboptimality-independent generalization error bound for the FCP-regularized NN can be achieved at a pseudo-polynomial-time computable solution in Subsection 9.2 of the electronic companion.*
7 Numerical Experiments
We report in this section several numerical experiments. In Sections 7.1 and 7.2, we consider the high-dimensional Huber regression under A-sparsity and the NNs, respectively. Then, Section 10 of the electronic companion presents our test results on the high-dimensional SVM (as a special nonsmooth learning problem) and some additional numerical examples on the NNs.
Unless otherwise stated explicitly, most of our experiments, including those in the electronic companion, were implemented in Matlab 2014b and run with a single thread on a PC with 40 Intel (R) Xeon (R) E5-2640-v4 CPU cores (2.40 GHz, 64 bits), and 128 GB memory. A different implementation environment was involved in the tests on some larger-scale NN models, as presented in Section 7.2.
7.1 Experiments on HDSL under A-sparsity
This section reports our test results on high-dimensional Huber regression (HR) under A-sparsity (in the sense of Assumption 1). Our settings for experiments are summarized below: Denote by N(0,σ2) a centered normal distribution with variance σ2>0 and by Np(0,Σ) a centered p-variate normal distribution with covariance matrix Σ=(ςj1,j2) and ςj1,j2=0.3∣j1−j2∣. The training data set {(xi,yi):i=1,...,n} was generated as per a linear system yi=xi⊤β∗+ωi, for i=1,...,n. Here, (xi,yi) denotes a pair of (observed) design and response, and β∗ denotes the vector of true parameters to be recovered. Some additional details are summarized below:
•
The training sample size was chosen as n=100.
•
ωi, i=1,...,n, were i.i.d. white noises such that ωi∼N(0,σ2) for all i.
•
xi∼Np(0,Σ), i=1,...,n, were i.i.d. random vectors.
•
The vector of true parameters was prescribed as β∗=βεA∗+E⋅v⋅∣v∣1, where \boldsymbol{\beta}_{\varepsilon_{A}}^{*}:=(3,\,5,\,0,\,0,\,1.5,\underbrace{0,\,...,0}_{\text{(p-5)-many 0's}})^{\top} and E⋅v⋅∣v∣1 stands for some dense perturbation. Here, E>0 denotes a user-specific scalar and v=(vj) denotes a random vector with i.i.d. entries of uniform random variables on [−1,1]. Note that the magnitude of the perturbation can be calculated as E⋅v⋅∣v∣1=E
Given the above, this experiment was focused on the following HR problem:
[TABLE]
The corresponding FCP-regularized formulation, referred to as the HR-FCP, is then given as
[TABLE]
This problem was solved via Algorithm 1, for which the initial solution was prescribed as βℓ1∈argminβn−1∑i=1nLHR(β,xi,yi)+λ⋅∑j=1p∣βj∣ for the same λ as in (41).
The hyper-parameters of Algorithm 1 were set to be M=10 and γopt=10−5. For the FCP, we fixed a=0.09 (such that a<M−1) and prescribed that λ:=Cfcp⋅n2/3lnp for some Cfcp>0. In choosing Cfcp, three independent validation datasets, with 100 data observations for each, were generated following the same approach as the training data above. The dimensions of those validation sets were p∈{500,750,1000}. The value of Cfcp was chosen to be the best-performing on the validation data among the candidate values of {0.5,0.75,1,1.25,1.5}. More specifically, a linear model was trained on the training data when Cfcp and p were fixed at every combination of their candidate values listed above. We let β1,Cfcp, β2,Cfcp, and β3,Cfcp be the resultant estimators for a fixed Cfcp when p=500, 750, and 1000, respectively. The chosen value of Cfcp was the one that minimized the average performance on all the validation sets, calculated as per the below:
[TABLE]
Here, (xival,k′,yival,k′), for k′∈{1,2,3}, is the ith data from the k′th validation set. As it turned out, Cfcp:=1.
The HR-FCP was compared with two alternative schemes: (i) the HR without any regularization, denoted by HR, and (ii) the HR with the ℓ1-norm regularization, denoted by HR-L1. (The HR-L1 has been discussed by Owen (2007), among others.)
The coefficient for the ℓ1-norm penalty was chosen to be λℓ1:=Cℓ1⋅nlnp for some Cℓ1>0. The dependence of λℓ1 on p and n is consistent with the theoretical results for the ℓ1-norm regularization (e.g., by Negahban et al. (2012)). We determined Cℓ1:=0.5 using the same approach as in choosing Cfcp above.
To evaluate the out-of-sample performance, 5000-many independent test data observations were simulated for each problem instance, following the same data generation process for the training data above. If we let (xitest,yitest), i=1,...,5000, be the test data of a problem instance, the out-of-sample error of an estimator β was calculated by
[TABLE]
Each experiment was randomly replicated 100 times. Figure 1 presents the numerical results. We discuss this figure in relative detail below.
•
\Copy
In all the subplots CopyIn all the subplots (a) through (g) of Figure 1, blue solid lines, red dot-dashed lines, and yellow dashed lines represent the out-of-sample errors generated by the HR-FCP, the HR-L1, and the HR. The green dotted lines stand for the estimated values of εA, a quantity involved in the definition of A-sparsity. The values of εA were estimated by (43) with β:=βεA∗. The error bars in the plot are all centered at the average levels out of 100 random replications, and the radii of the error bars are 1.96 times the corresponding standard errors.
•
\Copy
Subplots (a) and (b)Subplots (a) and (b) show the comparison of the HR-FCP with the HR-L1 and with the HR, respectively, when the logarithm of the dimensionality (lnp) was increased gradually with p∈{200,300,...,5000} and E=0. From both subplots (a) and (b), one can see that the out-of-sample errors generated by HR-FCP were small for all the values of lnp, especially when the HR-FCP was compared with both the HR and the HR-L1. In particular, (as in Subplot (b)), the performance of the HR deteriorated rapidly as lnp grew, while the performance of the HR-FCP remained approximately constant. Because our error bounds for HR-FCP are polynomial in lnp, it appears that an even sharper dependence on lnp may be pursued in our analysis, at least for certain HDSL special cases.
•
\Copy
Subplots (c) and (d)Subplots (c) and (d) present the performance of all the three schemes above when the sample size n was increased from 100 to 1000 (with E=10 and p=1000). From both subplots, one can observe that the HR-FCP outperformed both the HR and the HR-L1. Also shown in these two subplots are the values of εA (denoted by “ϵA” in the figure).
It can be observed that the out-of-sample errors of the HR-FCP matched with the values of εA, especially when the sample size was relatively large. This pattern was consistent with our error bounds.
•
\Copy
Subplots (e) and (f) CopyAs shown in Subplots (e) and (f), all the three schemes above were compared again when E was increased gradually
(and, as a result, εA would tend to grow). Consistent with our theoretical results, the out-of-sample errors yielded by the HR-FCP approximately matched the values of εA (denoted by “ϵA” in the plots). Furthermore, regardless of the values of εA, the HR-FCP achieved better generalization errors than the HR and the HR-L1 in almost all of the instances. We can also observe from both subplots that, even if the magnitudes of the perturbation E were comparable to ∣βεA∗∣, the corresponding values of εA remained to be small. So did the out-of-sample errors generated by the HR-FCP, especially when compared with the HR’s performance. For example, when E=10, the magnitude of perturbation was larger than ∣βεA∗∣=9.5. Yet, the corresponding εA was below 0.1, and the out-of-sample error of the HR-FCP was almost equal to εA. Both values were significantly lower than the corresponding out-of-sample error of the HR.
•
\Copy
The dependence of the HR-FCP CopyIn Subplot (g), the dependence of the HR-FCP and the HR-L1 on the sparsity level s was evaluated when E=10, p=1000, n=100, and \boldsymbol{\beta}_{\varepsilon_{A}}^{*}:=(3,\,5,\,0,\,0,\,1.5,\underbrace{2,\,...,2}_{\text{(\tau)-many 2's}},\,\leavevmode\nobreak\ \underbrace{0,\,...,0}_{\begin{subarray}{c}\text{(p-\tau-5)}\\
\text{-many 0's}\end{subarray}})^{\top} for all τ=0,1,...,13. Thus, the corresponding values of s were s=3,4,...,16. As one may see from Subplot (g), the performance of both the HR-FCP and the HR-L1 deteriorated when s increased. Yet, the HR-L1 seemed to be more sensitive to the change in s than the HR-FCP.
•
\Copy
Finally, Subplot (h) CopyFinally, Subplot (h) presents the numerical evaluation of the dependence of the HR-FCP’s out-of-sample performance on Γ. Note that, in the case of Huber regression, Γ:=[n−1∑i=1nLHR(β,xi,yi)+∑j=1pPλ(∣βj∣)]−[n−1∑i=1nLHR(βεA∗,xi,yi)+∑j=1pPλ(∣βεA,j∗∣)] is an underestimation of the suboptimality gap in minimizing (41). To generate this plot, we solved for the S3ONC solutions with random initialization for 2000-many repetitions. A “+” in the plot corresponds to one of those S3ONC solutions, and the dot-dashed line stands for the linear function of Y=X. If a “+” is below the line of Y=X, then it indicates that the out-of-sample error of that point was smaller than the corresponding value of Γ. As can be seen from this subplot, almost all the “+”s are below (but in the proximity of) the aforementioned linear function. This pattern was consistent with our error bound in (20), which is indeed of O(Γ) when Γ≥1.
7.2 Experiments on neural networks
We report two sets of experiments on the FCP-regularized NNs. The first set, as presented in this subsection, was focused on image classification using two mainstream testbeds, the MNIST (LeCun et al. 2013) and the CIFAR-10 datasets (Krizhevsky 2009). Leaderboards that report the state-of-the-art results can be found at, e.g., https://paperswithcode.com/. The second set of tests, as presented in Section 10.2 of the electronic companion, involved the comparison between the non-regularized NNs and their FCP-regularized counterparts in a task of binary classification with simulated data.
In this experiment of image classification, we considered a few popular or highly-ranked NN architectures (as well as their regularization and data augmentation schemes, if applicable), as below:
LN-S: A convolutional neural network called LeNet5 (LeCun et al. 1995) trained with a sparse learning strategy by Dettmers and Zettlemoyer (2019).
•
VGG-g: A deep convolutional neural network (a.k.a., VGG8B) that is trained with global loss and cutout (DeVries and Taylor 2017) regularization. This model is presented by Nøkland and Eidnes (2019).
(B) For the CIFAR-10 dataset:
•
VGG19: A deep convolutional neural network with 19 layers. The architecture was first discussed by (Simonyan and Zisserman 2014), and the codes for this network were made available by Li (2019).
•
shk-RN: A residual network (He et al. 2016) with a regularization scheme that combines shake-shake (Gastaldi 2017), cutout (DeVries and Taylor 2017), and mixup (Zhang et al. 2017). The code for this network were made available by Li (2019).
•
FMix (Harris et al. 2020): An NN architecture that adopts a modified mixed sample data augmentation (MSDA).
We replaced the training algorithms of the above NN implementations into Algorithm 1 with γopt=10−6, using the outputs of the original implementations as the initial solutions. Some heuristic modifications were incorporated into Algorithm 1 in the above replacement: First, the gradient in Algorithm 1 was changed into an unbiased estimator of the gradient constructed on a mini-batch of the whole dataset. The mini-batch sizes remained the same as the original implementations. Second, the values of M could be varying over the iterations and were specified to be the multiplicative inverse for the learning rates (a.k.a., step sizes) of the original implementations. Third, a, the parameter in FCP, was always set to be 0.99 times the current value of M−1 at each iteration (a.k.a., epoch) during the NN training. \Copylambda determineLast, the value of λ, the other parameter of FCP, was assigned to be λ:=Cλ⋅U−1 heuristically, where Cλ≥0 was determined as below for each NN: We first randomly selected 10% of the training data points to construct a balanced validation set. Then, we found the 1st, 1.25th, 2.5th, 5th, 10th, and 15th percentile absolute values of the nonzero fitting parameters in the initial solution. After rounding these percentile values to their first significant digits, the resulting numbers were considered as the candidates for Cλ. From these candidates, we then selected the one that led to the best classification result for the validation set, when the NN model was trained on the rest of the training set.
As it turned out, Cλ was 1×10−2, 5×10−6, and 2×10−4, respectively, for CNN-FCP, LN-S-FCP, and VGG-g-FCP in the experiments on the MNIST dataset, and 1×10−3, 3×10−2, and 1×10−3, respectively, for VGG-19-FCP, shk-RN-FCP, and FMix-FCP in the experiments on the CIFAR-10 dataset.
The tests in this subsection were implemented using Pytorch (Paszke et al. 2017), and most of the tests were conducted on a single thread on a PC with 40 Intel (R) Xeon (R) E5-2640-v4 CPU cores (2.40 GHz, 64 bits), 128 GB memory, and one Quadro M4000 GPU (8GB memory), except that shk-RN and shk-RN-FCP were implemented using one GPU-enabled thread on Floydhub, a cloud computing platform with an Intel Xeon CPU (4 Cores), 61GB RAM, and an NVIDIA Tesla K80 GPU (12 GB Memory) and FMix and FMix-FCP were tested on the same cloud computing platform with different configurations (Intel Xeon CPU with 8 Cores, 61GB RAM, and an NVIDIA Tesla V100 GPU with 16 GB Memory).
The out-of-sample classification errors are reported in Tables 2 and 3 for results on MNIST and CIFAR-10, respectively. One may tell from the tables that the performance of all the NN architectures involved in the test were sharpened by incorporating the proposed FCP regularization. In particular, the best out-of-sample classification errors achieved by the FCP-regularized schemes for MNIST and CIFAR-10 were 0.23% and 1.31%, respectively, both of which were competitive against some high-performance NNs on the leaderboards (available at https://paperswithcode.com/), especially if we notice that no external data were used.
The number of nonzero fitting parameters of the NNs after training with and without the FCP are also reported in Tables 2 and 3. One may observe that the FCP significantly reduced the number of active fitting parameters. For the case of LN-S, the FCP was able to further reduce the dimensionality on top of the sparsity-inducing mechanisms in the original model.
8 Conclusion
In this paper, we provide a theoretical framework for HDSL under A-sparsity; that is, the high-dimensional learning problems where the vector of the true parameters may be dense but can be approximated by a sparse vector. We show that, for a problem of this type, an S3ONC solution for an FCP-based learning formulation yields a poly-logarithmic sample complexity: the required sample size is only poly-logarithmic in the number of dimensions, even if the common assumption of the RSC is absent. To compute a solution with the proven sample complexity, we propose a novel, pseudo-polynomial-time gradient-based algorithm.
Our results on HDSL under A-sparsity can be applied to the analysis of two important learning problems that are currently less understood: (i) the nonsmooth HDSL problems, where the empirical risk functions are not necessarily differentiable; and (ii) an NN with a flexible choice of the network architectures. We show that for both problems, the incorporation of the FCP regularization can ensure the generalization performance, as measured by the excess risk, to be insensitive to the increase of the dimensionality. Particularly, our results indicate that, with regularization, an over-parameterized deep NN can be provably generalizable.
Our numerical results are consistent with our theoretical predictions and point to the interesting potential of combining the proposed FCP with some other recent techniques in further enhancing an NN’s performance. For future research, we will extend the results to other regularization schemes. \Copyweak sparsity discussionWe will also study how our results can be adapted to the analysis of HDSL under the assumption of weak sparsity (Negahban et al. 2012).
\ECSwitch\ECHead
Appendices
9 Additional Results on the Neural Networks
This section of the electronic companion is focused on the generalizability of the neural networks (NN) in binary classification. The problem settings of this classification problem follow Section 6.2. Section 9.1 presents a corollary of Theorem 6.7, where quantities like Ω(sA) are made more explicit. Then Section 9.2 presents a suboptimality-independent generalization error bound for a ReLU-NN.
9.1 Generalizability of NNs under additional regularities.
This subsection presents a corollary of Theorem 6.7 under some additional assumptions on the separating function g, activation functions, and the network architecture. Below we start by introducing those assumptions.
First, we impose additional regularities on the separating function g following Mhaskar (1996). \CopyFollowing Copy mhaskarWe let Dk represent the partial derivative with order k=(k1,...,kd)⊤≥0 and ∣k∣=k1+...+kd; that is, Dkg:=∂x1k1,⋯,∂xdkd∂∣k∣g, for a function g. Define that
Fd,r:={g∈Wr,∞([−1,1]d):∥g∥Wr,∞([−1,1]d)≤1}.
Here Wr,∞([−1,1]d) is the Sobolev space of functions on [−1,1]d with continuous derivatives with order r for all r∈Zd∩[0,r]d, where Z is the set of integers. Meanwhile,
∥g∥Wr,∞([−1,1]d):=∑k∈Zd:k∈[0,r]dx∈[−1,1]desssup∣Dkg(x)∣. By this definition, Fd,r is a fairly flexbile class of functions. The corollary to be presented subsequently is focused on the cases that the separating function g is an element from Fd,r. An important special case is where g is a polynomial.
Second, we make the following assumption on the activation functions also following Mhaskar (1996):
{assumption}
\CopyLet the activation CopyLet the activation function Ψ be infinitely many times continuously differentiable in some open interval in ℜ. Furthermore, ∂zk∂kΨ(z)=0 for some z in that interval, for any integer k≥0.
\CopyThe same assumption CopyAccording to Mhaskar (1996), commonly adopted activation functions, such as sigmoid, hyperbolic tangent, Gaussian, and multiquadratics, all obey Assumption 9.1.
Third, for convenience of discussion, \Copywe focus on Copywe focus on an NN architecture as in Figure 2. In this NN, there are “skip connections” from the input layer to the lth hidden layer, for all l=2,...,D−1. Meanwhile, there are also “skip connections” from the l hidden layer, for all l=1,...,D−2, to the output layer. We let D and K be the network depth and the number of neurons in each hidden layer, respectively. Without loss of generality, we assume that all hidden layers have the same number of neurons, and all hidden neurons adopt the same activation function Ψ. We also assume that the output layer involves no nonlinear transformation. The output of this NN, given input x and fitting parameters β=vec((Wl−1,l),(bl−1,l),(wl,D),(bl,D),(W0,l),(b0,l))∈ℜp, can be captured by the nonlinear system below, where fNN,l:ℜd×ℜp→ℜK is the output from the lth layer.
[TABLE]
With the foregoing settings, below is our result on the NN’s generalization error.
Corollary 9.1
Let g∈Fd,r.
Consider a deep neural network FNN defined as in (44)-(46). Suppose that Assumptions 6.2, 6.2, and 9.1 hold.
Let β∈ℜp be any random vector such that ∥β∥∞≤21v−1⋅RΩ⋅lnn and the S3ONC(X,y) holds at β almost surely. For a fixed Γ≥0, assume that Tn,λ(β)−infβTn,λ(β)≤Γ, w.p.1. Let C7>0 be a universal constant and CNN>0 be some constant that depends only on d and r. If a<21⋅exp{−UNN⋅D⋅ln[p⋅v−1⋅UNN⋅RΩ⋅lnn]}, λ:=c⋅a⋅n2/38σ[ln(2v3e⋅RΩpn4/3)+UNN⋅D⋅ln(UNN⋅(1+npRΩv−1))], and
[TABLE]
then it holds that
[TABLE]
with probability at least 1−C7pexp(−C7n)−C7exp(−C7n1/3).
We attain the the poly-logarithmic sample complexity again in this corollary. Similar to Theorem 6.7, the generalization error bound in (48) is strictly monotone in the suboptimality gap Γ.
•
If g is a polynomial function, which is infinitely many times differentiable, and if the network is over-parameterized with n≤(KD)3, then we may as well let d=r and obtain from (48) that
[TABLE]
with overwhelming probability.
•
\Copy
By a closer CopyBy a closer examination, Corollary 9.1 is obtained by explicating the misspecification error Ω(⋅) in Theorem 6.7. In doing so, we reduce the NN defined as in (44)-(46) to a one-hidden-layer subnetwork with (K⋅D)-many hidden neurons by assigning 0 to all the connection weights between any pair of hidden layers. We can then use the existing upper bounds on the misspecification error of a one-hidden-layer NN, such as the results by Mhaskar (1996), to provide a (conservative) estimate of Ω(⋅). We conjecture that the same argument can be extendable to many other NN architectures, given that they can represent a one-hidden-layer subnetwork with (K⋅D)-many hidden neurons. Here, we say that one NN (denoted by FNN,1) can be represented by another NN (denoted by FNN,2), if it holds that, for any β1 and almost every x∈X, FNN,1(x,β1)=FNN,2(x,β2) for some β2. Because many NN architectures entail strong representability, we think that Corollary 9.1 can be used to understand a broader spectrum of NN-based models.**
9.2 A suboptimality-independent generalization bound at tractable local solutions.
This subsection presents a result on the generalizability of a ReLU-NN at a pseudo-polynomial-time computable solution. Different from the above, the error bound herein is independent of the suboptimality gap Γ. This is possible under the following assumption on the data generation process.
{assumption}
\CopyThere exists a constant Copy Assumption 11There exists a constant v∈(0,1) and
[TABLE]
where P(u) is the density of a standard Gaussian vector, such that y⋅g(x)≥v for all (x,y)∈supp(D).
\CopyAssumption follows in their analysisAssumption 9.2 follows Assumption 4.10 by Cao and Gu (2020) and Assumption A.1 by Cao and Gu (2019) in their analysis on the generalization performance of the ReLU-NNs trained with a stochastic gradient descent (SGD) algorithm. The same assumption is also equivalent to the condition discussed by Rahimi and Recht (2009), for some choices of parameters, in analyzing a one-hidden-layer NN. According to Cao and Gu (2020), Assumption 9.2 holds for all the functions representable by an infinite-width one-hidden-layer ReLU-NN with a rapidly decaying second-layer weights (faster than P(u)). Because of the strong representability of an infinite-width ReLU-NN, we think that the set of functions defined in Assumption 9.2 is reasonably flexible.
Though our results can be adapted to facilitate the analysis of a more flexible class of NN architectures, we focus on a ReLU-NN architecture FNN:X×ℜp→ℜ that is in accordance with the following system, given fitting parameters β=vec((Wl−1,l:2≤l≤D−1),(bl−1,l:2≤l≤D−1),wD−1,D,w1,D,,bD−1,D,W0,1,b0,1)∈ℜp:
[TABLE]
where we let Ψ(z):=max{0,z} be the ReLU activation function. The system in (50)-(52) captures a fully-connected D-layer NN (with D−1 hidden layers), where the first hidden layer is connected with the output layer directly through “skip connections”. We assume that there are K-many neurons in the every hidden layer.
In order to effectively train the above ReLU-NN, we propose the following initialization scheme (Algorithm 2) modified from the Weighted Sums of Random Kitchen Sinks (WSRKS) fitting procedure by Rahimi and Recht (2009) for training shallow networks.
Algorithm 2. A tractable initialization scheme
Step 0.
Specify an integer K∗:1≤K∗≤K. Consider a subnetwork in Figure 4 (where the subnetwork is highlighted in red) of the complete ReLU-NN (50)-(52). Denote this subnetwork by FNNsub:X×ℜp→ℜ, which writes as FNNsub(x,(W0,1,w1,D)):=w1,D⊤Ψ(W0,1x). Here, we let W0,1=(ω0,1,k,ι:k=1,...,K∗,ι=1,...,d)∈ℜK∗×d and w1,D=(ω1,D,k:k=1,...,K∗)∈ℜK∗.
Step 1.
Generate each entry of W0,1initial=((w0,l,kinitial)⊤:k=1,...,K∗), independently, from a standard normal distribution N(0,1).
Step 2.
Compute w1,Dinitial=(w1,D,kinitial:k=1,...,K∗) by solving the following (convex) optimization problem, where all the entries of W0,1initial are fixed to be the values from Step 1:
[TABLE]
Step 3.
Let βinitial∈ℜp be a vector of fitting parameters. Set the components of βinitial that correspond to the subnetwork to be vec(W0,1initial,w1,Dinitial). Let all other components of βinitial be zero.
Step 4.
Output βinitial.
Algorithm 2 essentially trains the subnetwork constructed in Step 0 of Algorithm 2 with the WSRKS fitting procedure. Meanwhile, all the fitting parameters outside the subnetwork are set to be zero. Subsequent to this initialization scheme, we may then invoke Algorithm 1 to generate the desired solution to the FCP-regularized training formulation in (37).
A subtlety arises when applying Algorithm 1 to the ReLU-NN. The ReLU activation function Ψ(z):=max{0,z} is nonsmooth. Resultantly, the empirical risk function is not everywhere differentiable in general. A common approach in the literature (e.g., Berner et al. (2019)) to avoid this irregularity is to consider a modified first derivative of Ψ defined as ∂z∂Ψ(z):=1(z>0). By this definition, a chain rule is preserved as per Berner et al. (2019). Correspondingly, the (modified) gradient can be calculated with the detailed formula provided in Section 11. We adopt this modification in Algorithm 1. Despite the use of these modifications, we show that the combination of Algorithms 1 and 2 can lead to a generalizable ReLU-NN within pseudo-polynomial time, and the resulting sample complexity is poly-logarithmic in p. Furthermore, the generalization error is independent of Γ, the suboptimality gap.
Theorem 9.4 below shows the promised suboptimality-independent generalization error bound. Note that this theorem adopts the following settings and hyper-parameters:
[TABLE]
where K∗ is defined in Algorithm 2 and (a,λ) are tuning parameters of the FCP. For invoking Algorithm 1 in training the ReLU-NN, we let f(⋅):=n−1∑i=1nF(yi⋅FNN(xi,⋅)) and ∇f(⋅):=n−1∑i=1n∇βF(yi⋅FNN(xi,⋅)) with ∇βF(yi⋅FNN(xi,⋅)) defined in Section 11. Finally, it is worth noting that the output of Algorithm 1 can be understood as a deterministic (and implicit) function of its initial solution β0 and training data (X,y). When β0, X, and y are random, the algorithm’s output is also a random vector.
Theorem 9.4
Consider the ReLU-NN in (50)-(52) with K≥max{2,d,10n1/3⋅(lnn)5/3+1}. Suppose that Assumption 9.2 holds and that β∈ℜp with ∥β∥∞≤R for some R≥n is the output of Algorithm 1 when it terminates as per the stopping criterion in (26). Given hyper-parameters as in (54), the following statements hold.
(a)
For any initial solution β0∈ℜn and training data (X,y), Algorithm 1 terminates at the k∗(β0,X,y)-th iteration, for some integer k∗(β0,X,y)<(⌈2M⋅γopt2Tn,λ(βinitial)⌉+1).
(b)
Further assume that the initial solution of Algorithm 1 is the output of Algorithm 2; that is, β0:=βinitial. At the termination of Algorithm 1, there exists a universal constant C8>0 such that, if
[TABLE]
then, with probability at least 1−C8⋅pexp(−C8n1/3)−C8⋅n1/3dexp(−n2/2)−C8⋅(d⋅n)−d/3, the generalization error of the trained ReLU-NN is bounded by
In this theorem, the generalization error bound, as measured in terms of the expected 0-1 loss, is no longer dependent on the suboptimality gap Γ, yet the promised poly-logarithmic sample complexity is maintained; the sample size should grow only poly-logarithmically to compensate for the growth in p. In addition, the dependence on the number of layers D is polynomial. In contrast to the literature, we argue that our result here may provide a significantly better rate in terms of both p and D, especially when considering that the training algorithm to ensure the desired sample complexity is provably in pseudo-polynomial time as per the remark below.
Remark 9.7
The combination of Algorithms 1 and 2 in Theorem 9.4 yields a pseudo-polynomial-time complexity.
•
In the initialization step, Algorithm 2 is a polynomial-time algorithm. The main computational effort is on solving (53), which is convex and thus in polynomial time. (Note that an approximate solution to (53) with a suboptimality gap of O(n1/3d⋅D⋅lnp) would actually suffice for deriving the same sample complexity as in Theorem 9.4.)
•
Subsequent to Algorithm 2, Algorithm 1 computes a solution that entails the desired sample complexity. The iteration complexity of Algorithm 1, as proven in Part (a) of Theorem 9.4, is polynomial in both the dimensionality and the numeric values of the problem data. Thus, Algorithm 1 yields a pseudo-polynomial-time complexity.
With the above, we know that the total computational effort of the combined algorithm is in pseudo-polynomial time.
Remark 9.8
The proof of Theorem 9.4 does not depend on how the gradient is defined or modified. Nonetheless, there is some benefit of using the “modified gradient” as in Section 11, as discussed in Remark 9.9 below.
Remark 9.9
By a closer examination of the proof, one may notice that Algorithm 2 (invoked for initialization) alone is already capable of identifying a solution with provable generalizability. Nonetheless, as per (56), Algorithm 1 sharpens the generalization error; the more iterations that Algorithm 1 would run for, the shaper is the performance of the trained NN. A natural question would be whether the initial solution identified by Algorithm 2 would render the stoping criterion in (26) to be satisfied at the first iteration of Algorithm 1. If so, k∗(βinitial,X,y)=0 and Algorithm 1 would not be effective.
We think it to be a possible scenario for some problem instances. However, because n1∑i=1nF(yi⋅FNN(xi,⋅)) is a piecewise smooth function and Algorithm 2 trains only a small subset of the fitting parameters, it is more likely that the initial solution generated by Algorithm 2 is a non-KKT point within a continuously differentiable neighborhood. In such a case, the “modified gradient” as in Section 11 becomes the exact formulation of the gradient. One may then show that k∗(βinitial,X,y)>0 must hold, if M is properly large and greater than the Lipschitz constant of the gradient of n1∑i=1nF(yi⋅FNN(xi,⋅)) for every β in that neighborhood.
Remark 9.10
\Copy
The results of Theorem CopyThe results of Theorem 9.4 is obtained via a similar argument as in proving Theorem 6.7, except that the misspecification error Ω(⋅) and the suboptimality gap Γ in Theorem 6.7 are now explicated in Theorem 9.4 under the specific assumptions made on the neural network and the data generating process. To make explicit both Ω(⋅) and Γ, our proofs are largely focused on analyzing the subnetwork constructed in Step 0 of Algorithm 2 and illustrated in Figure 4. The misspecification error of this subnetwork serves as a conservative estimate of Ω(⋅), and the suboptimality gap obtained after training this subnetwork becomes an overestimate of the initial suboptimality gap to bound Γ. We conjecture that the above argument can be extended to any NN architecture that contains, or can represent, the above subnetwork. Such NN architectures include the conventional ReLU networks and the residual networks with ReLU activation, among others.
10 Additional Numerical Experiments
This part of the electronic companion presents some additional numerical experiments. Sections 10.1 and 10.2 below are focused on a high-dimensional SVM and a ReLU-NN, respectively.
10.1 Experiments on high-dimensional SVM
This section presents our experiments on high-dimensional SVM, whose training formulation entails a nonsmooth statistical loss function. For each experimental instance, a training set and a test set were randomly generated in two different cases below: (a) The first case involved data with less correlated design. With the same notations as in (33), let x1,x2,...,xn be i.i.d. samples of Np(0,Σ) with Σ=(ςj1,j2) and ςj1,j2=0.3∣j1−j2∣. Let the class labels of the samples yi,i=1,...,n, be determined by
yi=+1 if xi⊤β∗+ωi≥0, and yi=−1, otherwise.
Here, ω1,ω2,...,ωn are i.i.d. standard normal random variables and \boldsymbol{\beta}^{*}=(3,\,5,\,0,\,0,\,1.5,\underbrace{0,\,...,0}_{\text{(p-5)-many 0's}})^{\top}. We let n=100 for both the training and test sets. (b) In the second case, data with more correlated design were generated. In doing so, the same approach as in the first case above was followed, except that Σ=(ςj1,j2) was simulated differently. We first calculated ςj1,j2=0.3∣j1−j2∣ and then shrank all the singular values of Σ below the 80th percentile to be 0.01 times their original values.
Linear classifiers were trained on the training data via three different schemes to be explained subsequently. Their performance was measured by the out-of-sample classification error on the test data, calculated as Total number of observationsNumber of wrongly classified observations×100%.
Our numerical comparisons involved the following schemes:
(i). SVM: The canonical SVM in (33) with ρ=0. (ii) SVM-ℓ2: The SVM with ℓ2 regularization, that is, the estimator generated by solving (33) with ρ>0. (iii) SVM-ℓ1: The SVM variant with ℓ1 regularization, that is, the estimator generated by solving (35) with ρ=0, and Pλ(∣⋅∣)=λ∣⋅∣.
(iv) SVM-FCP: The SVM variant with the proposed FCP-based regularization, that is, the estimator generated by solving for an S3ONC solution via Algorithm 1 to Problem (34) with ρ=0. Note that Algorithm 1 in (iv) was initialized with solutions generated by the SVM-ℓ1. Hyper-parameters of Algorithm 1 was specified as γopt=10−5 and M=3.5≥n1/4. The SVM, the SVM-ℓ2, and the SVM-ℓ1 were all solved by calling Mosek (ApS 2015) through CVX (Grant and Boyd 2013, 2008).
\Copy
In determining copyIn determining the hyper-parameters, namely, ρ in the SVM-ℓ2, λ in the SVM-ℓ1 as well as λ in the SVM-FCP (where we fixed the value of a, the other tuning parameter of the FCP, to be 0.3), three training sets with p∈{100,500,1000} and n=100 were generated as per the above data generation process in the first case (with less correlated design). On these data sets, the SVM-ℓ2, the SVM-ℓ1, and the SVM-FCP models were then trained for fixed hyper-parameters, λ or ρ, chosen from {0.05,0.1,0.15,0.20,...,0.4}. The trained SVM variants were then evaluated in terms of their classification errors on three validation sets, one for each value of p∈{100,500,1000}. These validation sets were generated with the same sample sizes and probability distributions as the three training datasets above. From the pool of candidate values for λ and ρ, the best ones were chosen in terms of minimizing the average classification errors on the validation sets over all the three cases of p=100,500,1000. It turned out that λ=0.25 for both the SVM-FCP and the SVM-ℓ1, and ρ=0.1 for the SVM-ℓ2.
In testing the impact of dimensionality on the out-of-sample performance of all the four SVM variants, p was increased gradually with values chosen from {100,200,...,1000}. For each choice of dimensionality, 100 random replications were conducted. The performance of each SVM variant is reported in Tables 4 and 5, where we compare the averages and standard errors of the out-of-sample classification errors for the cases with lower and higher correlations in the design, respectively. From both tables, one can see that the classification errors generated by the proposed SVM-FCP were noticeably better than all other alternative approaches involved in this test. A representation of the comparisons are provided in the two subplots of Figure 3, where the center and radius of each of the error bars are the average classification error and 1.96 times the corresponding standard error, respectively, from the 100 replications. This figure shows that the SVM-FCP persistently outperformed the other three SVM variants involved in the test.
10.2 Numerical Experiments on ReLU-NN in Binary classification
This subsection presents our numerical tests on the efficacy of the FCP-based regularization on a ReLU-NN. A training set, a validation set, and a test set were generated as below: (A) Training set: 2000 data were first generated in line with Assumption 9.2, where d=10 and Cg(u):=sin(∑ι=1duι)/d with u=(uι). For the given Cg, (the integration involved in defining) the separating function g(x) was evaluated via numerical integration. For each sample data with feature values xi, the corresponding (actual) label yi was set to be +1 if g(x)≥0, and −1, otherwise. Some mislabels were introduced. Specifically, out of these 2000-many data, a subset of data points was selected as per a Bernoulli distribution; each data point was selected with probability 0.05. All the data points in this subset were assigned the wrong labels (opposite to their actual labels calculated previously).(B) Validation set: Following the same approach as the above, we generated another set of 2000 validation data. (C) Test set: A set of 5000 independent test data were generated following Assumption 9.2, with the same d, Cg, and g as the above. However, no test data was mislabeled.
We followed (50)-(52) in constructing the architecture of a D-layer ReLU-NN model,
where the width K (i.e., the number of hidden neurons per hidden layer) was identical across all the hidden layers.
We employed Algorithm 1, initialized by Algorithm 2, in training the FCP-regularized ReLU-NN formulated in Eq. (37). In choosing the hyper-parameters, we set a=0.5 and λ=Cfcp⋅D⋅lnK. Here, Cfcp=0.001 was determined through a process to be detailed subsequently. For Algorithm 1, we let γopt=10−6 and M=1 (such that a<M1). For Algorithm 2, K∗=⌈10n1/3⋅(lnn)5/3⌉ as per Theorem 9.4.
To determine Cfcp, three ReLU-NN architectures with 10, 50, and 100 hidden layers and K=150 were trained with the combination of Algorithms 1 and 2, when Cfcp was fixed at each of the candidate values from the set {0.0001,0.0005,0.001,0.005,0.01,0.05,0.1}. The performance of these trained ReLU-NNs was evaluated on the validation set in terms of the classification errors. Then, for each candidate value of Cfcp, an classification error over all the three NN architectures above was calculated. The value of Cfcp was chosen to be the one that led to the best average performance. It turned out that Cfcp=0.001.
Involved as a benchmark in the experiment was the ReLU-NN model generated by solving the conventional training formulation given as
[TABLE]
In computing a solution to this problem, we employed an SGD algorithm based on Cao and Gu (2019), who have shown the generalizability of the ReLU-NNs trained by an SGD in spite of the nonconvexity of the formulation. The SGD in our experiment was integrated with a three-step multi-start strategy: In Step 1, we repeated, for five times, the training of the same ReLU-NN using the conventional SGD with the He initialization (He et al. 2015). Because both the He initialization and the SGD are stochastic, five potentially different local solutions could be generated by Step 1. In Step 2, we trained the ReLU-NN using the conventional SGD again, but the initial point was specified as the output of Algorithm 2. Finally, in Step 3, we compared all the solutions from Steps 1 and 2 and chose the solution with the smallest objective value (in terms of (57)) as the output of this multi-start strategy. While there could be different strategies in the literature to boost the performance of the SGD, such as a wise determination of the batch size, the momentum, and the learning rate (i.e., the step size), we did not employ those strategies; our purpose was to compare the non-regularized ReLU-NN formulation in (57) with the proposed FCP-regularized ReLU-NN. Thus, given that the SGD well optimized the problem in (57) globally, the performance of the resulting solutions were considered to well represent the efficacy of the non-regularized ReLU-NN. Indeed, \Copyin evaluating the Copyin evaluating the optimization quality of the SGD, we found that the average, maximal, and minimal objective function values out of all the numerical instances were 0.0013, 0.0052, and 0.0000, respectively. (In contrast, the average initial objective value of all the SGD runs in this experiment was 19.3101.) In view of the fact that infuF(u)≥0, we claim that the global optimal solutions to (57) were well approximated, if not always achieved, by the above SGD scheme.
Our numerical results are presented in Figure 5. Some discussions on this figure are as below.
(i)
Subplot (a) of Figure 5 reports the out-of-sample classification errors of the FCP-regularized ReLU-NN and the non-regularized ReLU-NN (referred to as the NN-FCP and the NN, respectively, in the figure) when the width was fixed at K=150, and the number of hidden layers was chosen from a pool of candidate values {10,20,...,150}. For each combination of width and depth, we replicated the experiment for ten times. The center and the radius of each error bar in the plot are the average classification error and 1.96 times the corresponding standard error out of the ten replications. One can see that the performance of the FCP-regularized ReLU-NN was significantly better than the non-regularized ReLU-NN. Meanwhile, the performance of the former was insensitive to the growth in the depth of the network. This pattern was consistent with Theorem 9.4, but it also may have identified room for further improvement in terms of the dependence on D, at least for some regions of the hyper-parameters.
(ii)
Subplot (b) of Figure 5 shows the out-of-sample classification errors of the FCP-regularized ReLU-NN and the non-regularized ReLU-NN when the number of hidden layers was fixed to be two and the width of the hidden layers was set to be K∈{150,200,250,300,350,400,450,500,750,1000,1500}. Note that the number of fitting parameters p is polynomial in K. In order to show the dependence of the generalization performance on lnp, the X-axis of Subplot (b) is on lnK. We can see from this subplot that the performance of the FCP-regularized ReLU-NN remained almost constant as lnK increased. In contrast, the non-regularized ReLU-NN deteriorated significantly when lnK became larger.
(iii)
\Copy
To show how well the FCP CopyTo show how well the FCP-regularized ReLU-NN training formulation was optimized in our experiments through the combination of Algorithms 1 and 2, we present in Subplot (c) of Figure 5 a test on the ReLU-NN with 100 hidden layers and 150 neurons per hidden layer — the largest network among all the ReLU-NNs involved in (i) and (ii) above.
For this model, we generated 5000 random solutions to (37) and compared their objective function values (in terms of (37)) with that of the solution β∈ℜp computed by combining Algorithms 1 and 2 as above. The mth (for all m∈{1,...,5000}) random solution was generated as per the following two-step process: Step 1. We generated a random vector vm1:=β+νm, where νm∈ℜp was a random sample of a centered Gaussian random vector with i.i.d. entries. The covariance matrix of each νm was prescribed to be mod(m,25)⋅Rm⋅I, where mod(m,25) is the remainder of the Euclidean division of m by 25, Rm denotes a uniformly distributed random number on (0, 1), and I stands for the identity matrix. Step 2. For all m=1,...,5000, we invoked Algorithm 1 to generate a new solution vm2∈ℜp using vm1 as the initial point. Here, Algorithm 1 was terminated whenever either the stopping criterion in (26) was met (γopt=10−6 and M=1) or a maximal iteration number of 15 was reached. Of all these random solutions, if any could entail a smaller objective value (w.r.t. the objective function in (37)) than β, then it would mean that β was not the global minimizer. A blue point in Subplot (c) of Figure 5 represents one of those random solutions. The corresponding Y-axis of that point indicates the difference between the objective values of vm2 and β. One may observe from the plot that, for all m=1,...,5000, the gaps in the objective were always above zero. This indicates that β well approximated, if not coincided with, a globally minimal solution to (37).
(iv)
In Subplot (d) of Figure 5, we reorganized data from (iii) above to show the correspondence between the in-sample training errors and the out-of-sample errors. More specifically, we sorted the random solutions vm2 in the ascending order of their objective values (w.r.t. the objective function in (37)) and showed in this subplot the corresponding out-of-sample classification errors of those solutions. In the subplot, each blue “+” represents one of the random solutions vm2. The X- and Y-axis values at the center of each “+” are the corresponding objective function value and the out-of-sample error, respectively. One may observe that these “+”s tend to cluster around an affine function.
Finally, it is worth noting that Algorithm 1 (which was initialized by Algorithm 2) always ran for more than one iteration in all the test instances. If we combine this observation with Remark 9.9 about Theorem 9.4, we then know that k∗(βinitial,X,y)>0 (where k∗(βinitial,X,y) is defined as in Theorem 9.4) and, hence, Algorithm 1 was indeed effective in our test.
11 The “modified gradient” of the ReLU-NN
In using Algorithm 1 to train the ReLU-NN of consideration, we follow the commonly adopted definition (e.g., by Berner et al. (2019)) of the (modified) gradient (denoted by ∇βF(yFNN(x,β))) of the training formulation. In this definition, we denote that H(v):=diag(1(v1>0),1(v2>0),...), for any vector v=(v1,v2,...)⊤. More specifically, we let ∇βF(yFNN(x,β)):=dtdF(t)t=y⋅FNN(x,β))⋅y⋅dβdFNN(x,β), where the formula for the components of dβdFNN(x,β)=(∂βj∂FNN(x,β):j=1,...,p) are given below:
[TABLE]
Meanwhile,
[TABLE]
[TABLE]
[TABLE]
[TABLE]
The above calculation can be conducted via back-propagation. The function F(yFNN(x,⋅)) is piecewise continuously differentiable. At points where the gradient is well-defined, the above calculation equals to the gradient exactly.
12 The Applicability of Theorem 6.2 to the high-dimensional SVM
This section discusses how Theorem 6.2 can be used to analyze the generalization performance of SVM. In particular, we determine here the proper values of R, σ, σL, and Cμ in the instantiation of Assumptions 6.1 and 6.1. We start by introducing a few short-hand notations. Let X=(xi⊤:i=1,...,n), y=(yi),
[TABLE]
We first determine R for the case of SVM. Observe that infβE[Lns(β,Zi)]≤E[Lns(0,Zi)]=1 and β∗∈arginfβE[Lns(β,Zi)]. Recall that we have let ρ=0.01. Therefore, ρ∥β∗∥2≤1⟹∥β∗∥≤10. If β∗ is dense and entails A-sparsity (in Assumption 1), there must exist a sparse βεA∗∈[−10,10]p that approximates β∗ in the sense of Assumption 1 by the continuity of E[Lns(⋅,Zi)].
Meanwhile, one may also observe that any solution β as defined in Part (b) of Theorem 6.2 with (where Ln,δ(β,Z1n) and Ln,δ,λ(β,Z1n) in that theorem become Ln,δSVM(β,(X,y)) and Ln,δ,λSVM(β,(X,y)), respectively, in the case of SVM) must satisfy that ρ∥β∥2−1≤Ln,δ,λSVM(β,(X,y))≤Ln,δ,λSVM(βℓ1,δ,(X,y))≤Ln,δSVM(βℓ1,δ,(X,y))+λ∣βℓ1,δ∣, w.p.1., where the last inequality is due to the observation that Pλ(∣t∣)≤λ∣t∣ (which is an immediate result of the FCP’s definition). Because βℓ1,δ is the minimizer to the ℓ1-regularized problem, we thus may continue the above as ρ∥β∥2−1≤Ln,δSVM(βℓ1,δ,(X,y))+λ∣βℓ1,δ∣≤LnSVM(βℓ1,δ,(X,y))+λ∣βℓ1,δ∣≤[Ln(β,(X,y))+λ∣β∣]β=0≤1, with probability one. Therefore, ∥β∥≤200=102, a.s. Thus, R=102.
Second, we verify Assumption 6.1 and determine σ. Because, with probability one, it holds simultaneously that
[TABLE]
and
[TABLE]
for all β∈[−R,R]p. Thus, the random variable Ln,δSVM(β,(X,y)) has a bounded support. As an immediate result, Ln,δSVM(β,(X,y)) is subexponential with σ≤O(1).
Third, we verify Assumption 6.1 and determine σL and Cμ. To that end, we observe that Ln,δSVM(β,(X,y)) is verifiably Lipschitz continous in β. To see this, note that the gradient of the above function w.r.t. β is given as ∇βLn,δSVM(β,(X,y))=2ρβ−n1∑i=1nui∗yixi, where ui∗, for i=1,...,n, is the maximizer to the (inner) maximization problem: maxui:0≤ui≤1\leavevmode{ui⋅(1−yixi⊤β)−2nδ(ui−u0)2}. The norm of the gradient is bounded from above by 2ρRp+1, almost surely, for all β∈[−R,R]p. Thus, Assumption 6.1 holds with σL=0 and Cμ=2ρRp+1=0.22p+1≤O(1)⋅p.
In sum, the FCP-based formulation (34) for the high-dimensional SVM satisfies both Assumptions 6.1 and 6.1 with R≤O(1), σ≤O(1), σL=0, and Cμ≤O(1)p. Because the generalization error bound in (32) is logarithmic in Cμ, Theorem 6.2 can then be applied to show the poly-logarithmic sample complexity for the FCP-regularized SVM. Finally, we would like to remark that some more careful analysis may relax the stipulation on data normalization (such that ∣xi∣≤1 a.s.) and improve the aforementioned quantities.
13 Technical proofs
13.1 Proof of sample complexities of HDSL under A-sparsity
The proofs for Propositions 1 through 5 are provided below. The demonstration of Proposition 1 is an immediate result of Proposition 5, which further relies on Propositions 2 through 4.
Proof 13.1
Proof of Proposition 1.
Invoking Proposition 2 under the assumption that a<UL−1, we have that \mathbb{P}\left[\left\{\text{|\widehat{\beta}_{j}|\notin(0,,a\lambda)forallj}\right\}\right]=1. This, combined with Proposition 5, yields the desired result. □
Proposition 2
Suppose that a<UL−1. For any random vector β∈ℜp such that β∈ℜp:∥β∥∞≤R and the S3ONC(Z1n) is satisfied at β almost surely. Then,
[TABLE]
Proof 13.2
Proof.
Since β satisfies the S3ONC(Z1n) almost surely, Eq. (14) implies that, for any j∈{1,...,p}:∣βj∣∈(0,aλ), it holds that
0≤UL+Pλ′′(∣βj∣)=UL−a1,
which, combined with the fact that ∂t2∂2Pλ(t)=−a−1 for t∈(0,aλ), contradicts with the assumption that UL<a1. The above contradiction implies that \mathbb{P}[\{\widehat{\boldsymbol{\beta}}\text{ satisfies the S{}^{3}ONC(\mathbf{Z}{1}^{n})}\}\cap\{|\widehat{\beta}_{j}|\in(0,\,a\lambda)\}]=0\Longrightarrow 0\geq 1-\mathbb{P}[\{\widehat{\boldsymbol{\beta}}\text{ does not satisfy the S{}^{3}ONC(\mathbf{Z}{1}^{n})}\}]-\mathbb{P}[\{|\widehat{\beta}_{j}|\notin(0,\,a\lambda)\}]. Since \mathbb{P}[\{\widehat{\boldsymbol{\beta}}\text{ satisfies the S{}^{3}ONC(\mathbf{Z}_{1}^{n})}\}]=1, it holds that P[{∣βj∣∈/(0,aλ)}]=1 for all j=1,...,p, which immediately leads to the desired result.
□
Proposition 3
Suppose that Assumptions 2 and 2 hold.
Let ϵ∈(0,1], p′:p′>s, ζ1(ϵ):=ln(ϵ3⋅(σL+Cμ)⋅p⋅eR), and Bp′,R:={β∈ℜp:∥β∥∞≤R,∥β∥0≤p′}. Then, for the same c∈(0,0.5] as in (12) and for some universal constant c>0,
[TABLE]
with probability at least 1−2exp(−p′ζ1(ϵ))−2exp(−cn).
Proof 13.3
Proof.
We follow the “ϵ-net” argument as discussed by Vershynin (2012) and Shapiro et al. (2014) to construct a net of discretization grids S(ϵ):={βk}⊆Bp′,R such that for any β∈Bp′,R, there is βk∈S(ϵ) that satisfies ∥βk−β∥≤2σL+2Cμϵ for any fixed ϵ∈(0,1].
To that end, we first consider a fixed index set I⊆{1,...,p}:∣I∣=p′ and an arbitrary β∈Bp′,R∩{β∈ℜp:βj=0,∀j∈/I}. To ensure that there always exists βk∈S(ϵ) such that
[TABLE]
it is sufficient to have a covering number of no more than
(⌈ϵ2(σL+Cμ)p′R⌉)p′.
Now we consider how to cover all p′-dimensional subspaces by enumerating all possible I⊆{1,...,p}:∣I∣=p′. For each I, an ϵ-net with (⌈ϵ2(σL+Cμ)Rp′⌉)p′-many grids can be constructed to ensure (66) and there could be (p′p)-many possible choices of I’s. Therefore, to guarantee the existence of βk∈S(ϵ) that satisfies ∥βk−β∥≤2σL+2Cμϵ for any fixed ϵ∈(0,1] and β∈Bp′,R, it is sufficient to let ∣S(ϵ)∣:=(p′p)(⌈ϵp′⋅(2σL+2Cμ)R⌉)p′.
We notice that ϵp′(σL+Cμ)R≥1 and thus
⌈ϵp′⋅(2σL+2Cμ)R⌉≤ϵp′⋅(2σL+2Cμ)R+1≤ϵ3p′⋅(σL+Cμ)R.
Therefore, ∣S(ϵ)∣≤(ϵ3⋅(σL+Cμ)peR)p′ due to (p′p)≤(p′pe)p′ and, further invoking union bound and De Morgan’s Law, it holds that
[TABLE]
Further invoking the Bernstein-type inequality for a subexponential distribution as mentioned in Remark 2.1, for c is as in (12), it holds that
[TABLE]
Furthermore, in view of Lemma 13.11, it holds that
[TABLE]
with probability at least 1−2exp(−c⋅n) for some universal constant c>0. Therefore, for any β∈Bp′,R and βk∈S(ϵ), it holds with the same probability that
with probability at least 1−2(ϵ3⋅(σL+Cμ)⋅peR)p′⋅exp(−ct)−2exp(−c⋅n). Always picking the closest βk to β, we have, in view of (66), for any ϵ:0<ϵ≤1:
P[maxβ∈Bp′,Rn1∑i=1nL(β,Zi)−E[n1∑i=1nL(β,Zi)]≤σnt+nσt+ϵ]≥1−2⋅exp(−ct)⋅(ϵ3⋅(σL+Cμ)⋅p⋅eR)p′−2exp(−cn).
Further letting t:=c2p′ζ1(ϵ), where we recall that ζ1(ϵ):=ln(ϵ3⋅(σL+Cμ)⋅p⋅eR), we then obtain the desired result.
□
Proposition 4
Let Γ≥0, ϵ∈(0,1], and ζ1(ϵ):=ln(ϵ3⋅(σC+Cμ)⋅p⋅eR). Suppose that Assumptions 1, 2 and 2 hold. Consider any random vector β=(βj:j=1,...,p)∈ℜp such that ∥β∥∞≤R and βj∈/(0,aλ), for all j, almost surely, and
[TABLE]
For any fixed positive integer pu′:pu′>s, if
[TABLE]
for all p′:pu′≤p′≤p, then
P[∥β∥0≤pu′−1]≥1−2pexp(−cn)−4exp(−pu′ζ1(ϵ)) for the same c in (12) and some c>0.
Proof 13.4
Proof.
Let BR:={β∈ℜp:∥β∥∞≤R}.
Consider an arbitrary p′:pu′≤p′≤p. Since p′>s by the assumption that pu′>s, we may consider the following sets:
[TABLE]
Note that β∈Ep′2∩E4, which means that β has p′-many nonzero dimensions and the absolute value for each nonzero dimension must not be within the interval (0,aλ). Then, for all (β,Z1n)∈{(β,Z1n)∈EΓ1}∩{β∈E4∩Ep′2}, where Z1n=(Z1,...,Zn), it holds that
[TABLE]
Since βεA∗∈BR:∥βεA∗∥0=s<p′, we may obtain that, for all β∈Ep′2,
[TABLE]
where the last inequality is due to Lg∗≤L(βεA∗) and L(βεA∗)−Lg∗≤εA⟹L(βεA∗)−L(β)≤εA for all β∈BR.
For any p′:pu′≤p′≤p, if we suppose that ∅={(β,Z1n)∈EΓ1}∩{β∈Ep′2∩E4}∩{Z1n∈Ep′3}, then (72), (73) and the definition of Ep′3 together would imply that (p′−s)⋅Pλ(aλ)≤n2σc2p′ζ1(ϵ)+n4σcp′ζ1(ϵ)+2ϵ+Γ+εA, which contradicts with the assumed inequality (71). Therefore, under the assumption that (71) holds and β satisfies the S3ONC(Z1n) with probability one,
[TABLE]
for all p′:pu′≤p′≤p. Now, invoke Proposition 2, P[(β,Z1n)∈EΓ1,β∈E4]=1, since β satisfies both the S3ONC(Z1n) and (70) with probability one. Therefore, (74) implies that, for all p′:pu′≤p′≤p,
P[Z1n∈/Ep′3]≥P[β∈Ep′2].
Consequently,
P[∥β∥0=p′]≤1−P[Z1n∈Ep′3]
for all p′:pu′≤p′≤p. This, combined with Proposition 3, yields that
[TABLE]
where c>0 is the same constant as in Proposition 4. Observing that ζ1(ϵ)=ln(ϵ3⋅(σL+Cμ)⋅p⋅eR)>0 (since p>s>1, R,σL,Cμ≥1, and ϵ≤1) and ∑p′=pu′p2exp(−p′⋅ζ1(ϵ)) is the sum of a geometric sequence, we have
[TABLE]
The above can be simplified into
P[∥β∥0≤pu′−1]≥1−4exp(−pu′ζ(ϵ))−2pexp(−cn).
□
Proposition 5 below uses the short-hand notation that ζ:=ln(3eR⋅(σL+Cμ)).
Proposition 5
Let a<1. Suppose that Assumptions 1, 2, and 2 hold. For any ϱ:0<ϱ<21, let λ:=c⋅a⋅n2ϱ8σ[ln(nϱp)+ζ] with the same c in (12). Consider any random vector β=(βj:j=1,...,p)∈ℜp such that ∥β∥∞≤R and ∣βj∣∈/(0,aλ) for all j almost surely:
(i)
For any fixed Γ≥0 and some universal constants c,C1>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βεA∗,Z1n)+Γ almost surely, then
P[Ea∩Eb]≥1−2(p+1)exp(−cn)−6exp(−2cn4ϱ−1), where the events Ea and Eb are defined as
[TABLE]
[TABLE]
(ii)
For some universal constants c,C2>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n) almost surely, then
[TABLE]
with probability at least
1−2(p+1)exp(−cn)−6exp(−2cn4ϱ−1).
Proof 13.5
Proof.We denote by c0,c1,c2,... potentially different universal constants throughout this proof.
To show Part (i), let ϵ:=nϱ1∈(0,1], and ζ1(ϵ):=ln(ϵ3⋅(σC+Cμ)⋅p⋅eR)=ln(nϱp)+ζ>0. Then λ=c⋅a⋅n2ϱ8σζ1(ϵ)=c⋅a⋅n2ϱ8σ[ln(nϱp)+ζ].
We first invoke Proposition 4 to bound the sparsity level of β. To that end, we need to derive an explicit form for pu′ as defined in that proposition. Let T1:=2Pλ(aλ)−cn8σζ1(ϵ). We may explicate pu′ by solving the following inequality (where PX is the unknown), which is equivalent to (71) of Proposition 4 with p′:=PX,
[TABLE]
for the same c∈(0,0.5] in (12).
The solution to the above inequality yields that
PX>T1n2σc2ζ1(ϵ)+T1cn2(2σ)2⋅ζ1(ϵ)+2T1[Γ+εA+2ϵ+sPλ(aλ)].
Since we aim only to find a feasible PX, we may as well require that
PX>cT12⋅n32σ2ζ1(ϵ)+8T1−1[Γ+εA+2ϵ+sPλ(aλ)].
For λ=c⋅a⋅n2ϱ8σζ1(ϵ), we have Pλ(aλ)=2aλ2=c⋅n2ϱ4σζ1(ϵ). Further noticing that 2Pλ(aλ)=c⋅n2ϱ8σζ1(ϵ)>c⋅n2ϱ4σζ1(ϵ)+nc8σζ1(ϵ) as per our assumption (i.e., (76) implies that n1−2ϱ>2) we therefore know that T1=2Pλ(aλ)−nc8σζ1(ϵ)>c⋅n2ϱ4σζ1(ϵ). As a result, to satisfy (71) of Proposition 4, it suffices to let pu′ be any integer that satisfies
pu′≥ζ1(ϵ)2cn4ϱ−1+σζ1(ϵ)2cn2ϱ⋅[Γ+εA+2ϵ+sPλ(aλ)],
which is satisfied by specifying
[TABLE]
hereafter in this proof. (Here the last equality is due to our choice of parameter, ϵ=nϱ1.)
In the meantime, the right-hand-side of (82) is strictly larger than s.
Since (82) is a sufficient condition to (81), we know that, if (82) holds, then (71) in Proposition 4 holds for all p′:pu′≤p′≤p.
Invoking Proposition 4, we have with probability at least
1−4exp(−⌈ζ1(ϵ)2cn4ϱ−1+σζ1(ϵ)2cn2ϱ⋅(Γ+εA+nϱ2)+8s⌉⋅ζ1(ϵ))−2pexp(−cn), it holds that ∥β∥0≤pu′−1=⌈ζ1(ϵ)2cn4ϱ−1+σζ1(ϵ)2cn2ϱ⋅(Γ+εA+nϱ2)+8s⌉−1.
In view of the assumption that Ln,λ(β,Z1n)≤Ln,λ(βεA∗,Z1n)+Γ, w.p.1., together with Assumption 1 and the fact that Pλ(∣⋅∣)≥0, we know that
n1∑i=1nL(β,Zi)≤n1∑i=1nL(βεA∗,Zi)+sPλ(aλ)+Γ,a.s.⟹{n1∑i=1nL(β,Zi)−E[n1∑i=1nL(β,Zi)]}+E[n1∑i=1nL(β,Zi)]≤{n1∑i=1nL(βεA∗,Zi)−E[n1∑i=1nL(βεA∗,Zi)]}+E[n1∑i=1nL(βεA∗,Zi)]+sPλ(aλ)+Γ,a.s.
Given the event E1∩E2, where
[TABLE]
with Bpu′,R:={β∈ℜp:∥β∥∞≤R,∥β∥0≤pu′} and E2:={∥β∥0≤pu′} with pu′>s,
we may obtain from the above that
L(β)−L(βεA∗)≤s⋅Pλ(aλ)+n2σc2pu′ζ1(ϵ)+n4σcpu′ζ1(ϵ)+2ϵ+Γ, a.s.
In the analysis above, we have derived the probability for {∥β∥0≤pu′−1}. Combining this with Proposition 3, we have that the event E1∩E2 holds with probability at least 1−6exp(−⌈ζ1(ϵ)2cn4ϱ−1+σζ1(ϵ)2cn2ϱ⋅(Γ+εA+nϱ2)+8s⌉⋅ζ1(ϵ))−2(p+1)exp(−cn)≥1−6exp(−2cn4ϱ−1)−2(p+1)exp(−cn). Further noticing that L(βεA∗)≤Lg∗+εA as per Assumption 1, we have both ∥β∥0≤pu′ and
[TABLE]
where pu′=⌈ζ1(ϵ)2cn4ϱ−1+σζ1(ϵ)2cn2ϱ⋅(Γ+εA+nϱ2)+8s⌉, hold simultaneously with probability at least 1−6exp(−2cn4ϱ−1)−2(p+1)exp(−cn). Thus, we have already proven (77) in Part (i) of the Proposition.
To obtain (78) in Part (i) of the Proposition, we simplify (84) while preserving the rates in n and p. Firstly, we have
[TABLE]
which is obtained by observing the fact that x+y≤x+y for any x,y≥0 and the relations that 0<a≤1, 0<c≤0.5, σ≥1, and ζ1(ϵ)≥ln2 (as a result of the assumed inequality (76)).
Similar to the above, we also have
[TABLE]
*Invoking (76) and ζ1(ϵ)=ln(nϱp)+ζ, we have
n2−4ϱ4+σn1−2ϱ4(Γ+εA+nϱ2)≤c0 and nc2ζ1(ϵ)[8s+1]≤c1.
Therefore, it holds that
cn2pu′ζ1(ϵ)≤c2⋅n2−4ϱ4+σn1−2ϱ4(Γ+εA+nϱ2)+c2⋅nc2ζ1(ϵ)⋅(8s+1).
Further invoking (85) and (86), the inequality in (84) can be simplified into
L(β)−Lg∗≤c⋅n2ϱ4sσζ1(ϵ)+c3⋅σ⋅n2−4ϱ1+σn1−2ϱΓ+εA+nϱ2+c3⋅σncs+1ζ1(ϵ)+nϱ2+Γ+εA.
Further invoking a few known inequalities such as ζ1(ϵ)≥ln2, 0<ϱ<1/2, σ≥1, and 0<c≤0.5, we may obtain a further simplification that
*
[TABLE]
which immediately leads to (84) as claimed in Part (i) since ζ1(ϵ):=ln(3nϱ(σL+Cμ)⋅p⋅eR)=ln(nϱp)+ζ, a−1>1, s>1, R≥1, and the satisfaction of (76). This immediately leads to the claimed inequality in (78) of Part (i).
For Part (ii), due to Lemma 13.13, we know that Ln,λ(β,Z1n)≤Ln,λ(βℓ1,Z1n)≤Ln,λ(βεA∗,Z1n)+λ∣βεA∗∣ with probability one. Therefore, we may apply the results from Part (i) for Γ=λ∣βεA∗∣. Thus σΓ≤σλ∣βεA∗∣≤σ∥βεA∗∥∞⋅s⋅c⋅a⋅n2ϱ8σ[ln(nρp)+ζ]. Combining this inequality with the assumption of (17)
(which implies that n>c5⋅a−1⋅[ln(nϱp)+ζ]⋅smax{1,2−4ϱ1,2ϱ1}⋅(max{1,∥βεA∗∥∞})max{2−4ϱ1,2ϱ1})
as well as the assumption of σ≥1, we then know that σΓ≤∥βεA∗∥∞⋅s⋅cσ⋅a⋅n2ϱ8[ln(nϱp)+ζ]≤c6⋅a1−2ϱ∥βεA∗∥∞⋅s[ln(nϱp)+ζ]1−2ϱ. Therefore,
[TABLE]
Recall that a<1 and observe that [ln(nϱp)+ζ]≥1. We then have that, if n satisfies (17), then
with probability at least
1−2(p+1)exp(−cn)−6exp(−2cn4ϱ−1). The above bound can be further simplified by noticing that a<1, s>1, 0<ϱ<21, σ≥1, p≥1, [ln(nϱp)+ζ]≥1 and nsζ1(ϵ)≤n21−ϱs⋅ζ1(ϵ). As a result,
L(β)−Lg∗≤c10⋅σ⋅[n2ϱs⋅(ln(nϱp)+ζ)+nϱ1+n1−2ϱ1]+c10⋅min{a1/2nϱ,a1/4n21−ϱ}s⋅max{1,∥βεA∗∥∞}⋅σ3/4[ln(nϱp)+ζ]1/2+c10⋅n1−2ϱσεA+εA, which immediately leads to the desired result in Part (ii).
□
13.2 Proof of results for nonsmooth HDSL
Proof 13.6
Proof of Theorem 6.2.
To show Part (a), we invoke Theorem 13.15 and obtain that fμ(β,A(Zi)):=maxu∈U{u⊤A(Zi)β−2nδ1∥u−u0∥2} is continuously differentiable with Lipchitz continuous gradient, and the corresponding Lipschitz constant is n−δ1∥A(Zi)∥1,22, with n−δ1∥A(Zi)∥1,22≤nδ⋅UA, a.s. Therefore, it holds that, for all j=1,...,p, the partial derivative, ∂βj∂Ln,δ(β,Z1n), is well-defined for all β∈ℜp and Lipschitz continuous for almost every Zi∈W. Further noticing that n−δ1∥A(Zi)∥1,22+Uf1≤Uf1+nδUA with probability one, we have the desired result in Part (a).
To show Part (b), we denote by c1,c2,... potentially different universal constants throughout this proof. Let βεA′∗ be the sparse vector as in Assumption 1 (where εA is now denoted by εA:=εA′ in this theorem) and Ln,δ as in (30) (where L(⋅) in the statement of the assumption is replaced by E[n−1∑i=1nLns(⋅,Zi)]). We claim that
[TABLE]
To see this, one may observe that, under Assumption 1,
[TABLE]
where the last inequality is due to the definition of βεA′∗.
Now we consider the hypothetical population-level learning problem of infβ\leavevmodeE[Ln,δ(β,Z1n)]. The foregoing derivation indicates that this hypothetical problem also satisfies Assumption 1 with εA:=(2nδD2+εA′) and a sparsity level s. We thus may analyze this hypothetical problem by employing Proposition 1, where we let Ln, εA, δ, ϱ, and a−1 in the original definition to be Ln:=Ln,δ, εA:=2nδD2+εA′, δ:=41, ϱ:=83, and a−1:=2(Uf1+nδUA), respectively,
to bound the excess risk. To that end, we first verify the satisfaction of (17) by the assumption of (31). In other words, we need to ensure that
Similarly, (31) also implies that
n>C5⋅(Uf1+UA)4/3⋅[ln(n83p)+ζ]4/3⋅s8/3(max{1,∥βεA′∗∥∞})8/3⟹2n34>2c3⋅(Uf1+UA)4/3n1/3⋅[ln(n83p)+ζ]4/3⋅s8/3(max{1,∥βεA′∗∥∞})8/3⟹n>c4⋅(Uf1+UA)n1/4⋅[ln(n83p)+ζ]⋅s2(max{1,∥βεA′∗∥∞})2≥c5⋅(Uf1+n1/4UA)⋅[ln(n83p)+ζ]⋅s2(max{1,∥βεA′∗∥∞})2. Therefore, if (31) holds then n>c6⋅(σεA)1/(1−2ϱ)+c6⋅a−1⋅[ln(nϱp)+ζ]⋅smax{1,2−4ϱ1,2ϱ1}(max{1,∥βεA′∗∥∞})max{2−4ϱ1,2ϱ1}, which then means that (17) is verified.
We may now invoke Proposition 1 with εA:=2nδD2+εA′ and a:=[2(Uf1+nδUA)]−1, respectively, in bounding E[Ln,δ(β,Z1n)]−infβ\leavevmodeE[Ln,δ(β,Z1n)]. This proposition immediately leads to
[TABLE]
with probability at least 1−2(p+1)exp(−cn)−6exp(−2cn4ϱ−1) for a universal constant c>0.
Further notice that E[Ln,δ(β,Z1n)]≤E[Ln(β,Z1n)]≤E[Ln,δ(β,Z1n)]+2nδD2 for any β∈ℜp and 2(Uf1+n1/4UA)≤2(Uf1+UA)⋅n1/4. Also recall that δ=41, and ϱ:=83. We then can obtain from the above
[TABLE]
with probability 1−2(p+1)exp(−cn)−6exp(−2cn1/2) for a universal constant c>0. The desired result is then immediately implied from the above after some simplification under Uf1≥1 and σ≥1. □
13.3 Proof of generalizability for regularized NNs
13.3.1 Proof of generalizability of regularized NNs in a generic case
Proof 13.7
Proof of Theorem 6.7.
We denote by c1,c2,... potentially different universal constants throughout this proof.
We invoke Part (i) of Proposition 5 to show the desired result. To that end, we are to verify that all the conditions for the proposition are met. We first recall that A-sparsity (as in Assumption 1) holds as per (38)
with Lg∗=0, εA:=21v−1lnn⋅Ω(sA)+n1, s:=sA, and R:=21v−1lnn⋅RΩ.
Secondly, because min{ln2,F(y⋅FNN(x,β))}∈(0,ln2] with probability 1, we know that
∥min{ln2,F(y⋅FNN(x,β))}∥ψ1≤1, which means that Assumption 2 holds with σ=1. To see this, observe that min{ln2,F(y⋅FNN(x,β))}∈(0,ln2], w.p.1. Thus, as per Vershynin (2018) (Example 2.5.8.(c) therein), it holds that min{ln2,F(y⋅FNN(x,β))}ψ2≤ln21⋅x,y,βesssupmin{ln2,F(y⋅FNN(x,β))}=1. By the property that ∥X∥ψ22=∥X2∥ψ1, we thus know that
[TABLE]
Notice that ∂βj2∂2F(y⋅FNN(x,β))=y2⋅[∂z2∂2F(z)]z=y⋅FNN(x,β)⋅[∂βj∂FNN(x,β)]2+y⋅[∂z∂F(z)]z=y⋅FNN(x,β)⋅∂2βj∂2FNN(x,β). Because ∣∂z2∂2F(z)∣≤1, ∣∂z∂F(z)∣≤1, and ∥β∥≤p⋅21⋅RΩ⋅v−1⋅lnn for all β:∥β∥∞≤RΩv−1lnn1/2, by Assumption 6.2, we have ∂βj2∂2F(y⋅FNN(x,β))≤2exp{2UNN⋅D⋅ln(UNN⋅RΩ⋅p⋅21⋅v−1⋅lnn+UNN)}. Because a<21⋅exp{−2UNN⋅D⋅ln[p⋅v−1⋅UNN⋅RΩ⋅lnn+UNN]}, the S3ONC solution satisfies that βj∈/(0,aλ) for all j with probability 1, as per Proposition 2.
Thirdly, we now show that F(y⋅FNN(x,⋅)) obeys the Lipschitz-like condition as a special case to Assumption 2. By Assumption 6.2, we have ∥∇βF(y⋅FNN(x,β))∥=[y⋅∂z∂F(z)]z=y⋅FNN(x,β)⋅∇βFNN(x,β)≤exp[UNN⋅D⋅ln(UNN⋅∥β∥+UNN)]≤exp[UNN⋅D⋅ln(p⋅v−1⋅RΩ⋅UNN⋅lnn+UNN)], which indicates that ∣F(y⋅FNN(x,β1))−F(y⋅FNN(x,β2))∣≤exp[UNN⋅D⋅ln(pn⋅v−1⋅RΩ⋅UNN+UNN)]⋅∥β1−β2∥≤lnnnexp[UNN⋅D⋅ln(pn⋅v−1⋅RΩ⋅UNN+UNN)]⋅∥β1−β2∥, for all β1,β2∈[−21⋅RΩ⋅v−1⋅lnn,\leavevmode\leavevmode21⋅RΩ⋅v−1⋅lnn]p and almost every x∈X. Consequently, Assumption 2 holds with σL=0, R:=21v−1lnn⋅RΩ, and Cμ=lnnnexp[UNN⋅D⋅ln(pn⋅v−1⋅RΩ⋅UNN+UNN)]. Thus, ζ=ln(3eR⋅(σL+Cμ))=ln(23eRΩ⋅v−1⋅n⋅exp[UNN⋅D⋅ln(pn⋅v−1⋅RΩ⋅UNN+UNN)])=ln(23eRΩnv−1)+UNN⋅D⋅ln[UNN⋅(1+RΩpnv−1)].
So far, we have verified that all the conditions for Proposition 5 holds. Invoking this proposition with ϱ=1/3, we thus have, for any Γ≥0 and some universal constant c2>0, if
[TABLE]
and Ln,λ(β,Z1n)≤Ln,λ(βεA∗,Z1n)+Γ almost surely, then we obtain the below by invoking Proposition 5 (Part (i)) with ϱ=1/3 after some simplification:
[TABLE]
with probability at least
1−2(p+1)exp(−n/c4)−6exp(−2cn1/3).
Further noticing that 1(t<0)≤2⋅min{ln2,F(t)} for all t∈ℜ, we then have E[1(y⋅F(x,β)<0)]≤2⋅E[min{ln2,F(y⋅FNN(x,β))}], almost surely. This combined with (95) immediately leads to the desired result.
□
13.3.2 Proof of generalizability of a flexible set of NN architectures
Proof 13.8
Proof of Corollary 9.1.
Let c1,c2,… be universal constants. Because the output layer involves no nonlinear transformation, Assumption 6.2 holds. Observe that, when bl,D=0, for l=1,...,D−1, and Wl−1,l=0 and bl−1,l=0, for all l=2,...,D−1, the NN defined as in (44)-(46) can be reduced to FNN(x,β)=∑l=1D−1[wl,D⊤Ψ(W0,lx+b0,1)], which is essentially an NN with one hidden layer.
We therefore may invoke Theorem 2.1 of Mhaskar (1996), which is restated as Theorem 13.17 in this paper for completeness. It establishes the representation error of a single-hidden-layer NN in approximating g∈Fd,r under Assumption 9.1. As an immediate result of that theorem, if there are N-many (active) hidden neurons in that single-hidden-layer NN, captured by w⊤Ψ(Wx+b) for fitting parameters w∈ℜN, W∈ℜN×d, and b∈ℜN, then the model misspecification error Ω(N) is at most CNN⋅N−r/d, where CNN>0 is a quantity that depends only on d and r; more formally,
[TABLE]
Meanwhile, the total number of fitting parameters of this single-hidden-layer NN is (d+2)⋅N. Observing that this single-hidden-layer NN is a subnetwork of FNN(x,β) if N≤K⋅D, we obtain that
[TABLE]
for any positive integers N:N≤K⋅D.
We now invoke Theorem 6.7 with sA:=(d+2)⋅N, where we let N=min{K⋅D,n1/3}, and
Ω((d+2)N):=CNN⋅(N)−r/d≤CNN⋅max{n−3dr,(K⋅D)−r/d}≤CNN.
To satisfy (39), it suffices to stipulate both n>c1⋅(CNN⋅v−1lnn)3+c1⋅(Γ+1)3 and
[TABLE]
which are simultaneously satisfied by (47).
Then, the desired result is implied by Theorem 6.7.
□
13.3.3 Proof of suboptimality-independent generalizability of NN
Proof 13.9
Proof of Theorem 9.4. We first show Part (b). We denote by c0,c1,c2,... potentially different universal constants throughout this proof. The general idea is (i) to first show that Algorithm 1 always generates a sparse solution and that, with the initialization via Algorithm 2, the suboptimality gap is well controlled, and (ii) then, to invoke Proposition 3, which provides generalization error bounds for sparse solutions with a small suboptimality gap. Accordingly, this proof is divided into three steps, with the analysis for (i) provided in Steps 1 and 2, and the details for (ii) provided in Step 3.
Our proof relies on the analysis of the following hypothetical formulation:
infβ∈ℜpn1∑i=1nmin{ln2,F(yi⋅FNN(xi,β))}+∑j=1pPλ(∣βj∣). Meanwhile, because of the termination criterion in (26) (where f(⋅):=n−1∑i=1nF(yiFNN(xi,⋅))), we have, for all k=1,...,k∗(βinitial,X,y),
[TABLE]
Step 1.* For the above hypothetical problem, this step verifies that the conditions required by Proposition 5 are satisfied in the case where k∗(βinitial,X,y)≥1. We divide this step into five sub-steps as below.*
Step 1.1. We first verify Assumption 1. Because of (55), it holds that K∗:=⌈10n1/3⋅(lnn)5/3⌉≥dln(dK∗)=dln(⌈10n1/3⋅(lnn)5/3⌉d).
Thus, as a direct implication of Lemma 13.19,
[TABLE]
where ξ follow the same definition as in Lemma 13.19 and K∗ is defined as in Algorithm 1.
Observe that, as per Assumption 9.2,
y⋅g(x)=y⋅Eξ[Cg(ξ)⋅max{0,ξ⊤x}]≥v⟺vlnny⋅g(x)≥lnn, for all (x,y)∈supp(D). Also observe that the first and second derivatives of F are calculated as F′(z)=−1+exp(−z)exp(−z) and F′′(z)=(1+exp(z))2exp(z)=1+exp(2z)+2exp(z)exp(z). Thus F′ is 0.5-Lipschitz continuous and hence a well-known inequality yields that
F(x1)−F(x2)≤F′(x2)⋅(x1−x2)+0.5/2⋅(x1−x2)2. In view of ∣F′(z)∣=1+exp(−z)exp(−z)≤1+1/n1/n≤n1 for all z≥lnn, we then obtain that ∣F′(v−1⋅y⋅g(x)⋅lnn)∣≤n1. This, combined with (99), yields that
[TABLE]
with probability
1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Observe that vlnnK∗1∑k=1K∗Cg(ξk)⋅max{0,ξk⊤x} is representable by FNN(x,β) for some β:∥β∥0≤K∗⋅(d+1),∥β∥∞≤n. To see this, we can assign the fitting parameters in (50)-(52) to be the following: (i) Let
{\boldsymbol{w}}_{1,\mathcal{D}}:=(\widetilde{\boldsymbol{w}}_{1,\mathcal{D}}^{\top},\,\underbrace{0,\,...,\,0}_{\text{(K-K^{})-many 0's}})^{\top}\in\Re^{K} and W0,1:=[W0,1⊤,\leavevmode0d×(K−K∗)]⊤∈ℜK×d,
where w1,D=(K∗vy⋅lnn⋅Cg(ξk):k=1,...,K∗), W0,1=(ξk⊤:k=1,...,K∗), and 0d×(K−K∗) is a d-by-(K−K∗) all-zero matrix. (ii) Let the rest of the fitting parameters to be zero. With the foregoing assignment of values, no more than K∗⋅(d+1)-many of the fitting parameters are nonzero. Furthermore,
P[maxk∈{1,...,K∗}{∥ξk∥∞}≤n]≥1−dK∗⋅exp(−2n2), by the fact that each entry of ξk is an i.i.d. standard Gaussian random variable. Meanwhile, ∥w1,D∥∞≤v−1lnn≤n (because (55) implies that n≥vlnn).*
Consequently, (100) implies that
minβ:∥β∥0≤(d+1)K∗,∥β∥∞≤nE[F(y⋅FNN(x,β))]−E[F(v−1y⋅g(x)⋅lnn)]≤sup(x,y)∈supp(D){F(K∗vy⋅lnn∑k=1K∗Cg(ξk)⋅max{0,ξk⊤x})−F(vy⋅lnng(x))}≤c2⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c2⋅(lnn)2K∗⋅v2dln(d⋅K∗),
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗)−dK∗⋅exp(−2n2). Furthermore, because Assumption 9.2 and the definition of F (which is a decreasing function) imply that
[TABLE]
*for all (x,y)∈supp(D), we may continue from the above to obtain that
minβ:∥β∥0≤(d+1)K∗,∥β∥∞≤nE[min{ln2,F(y⋅FNN(x,β))}]−0≤c2⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c2⋅(lnn)2K∗⋅v2dln(d⋅K∗)+n1,
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗)−dK∗⋅exp(−2n2).
Because F(t)>0 for all t∈ℜ, we thus know that A-sparsity as in Assumption 1 (while we let L(⋅), Lg∗, s, R, and εA from that definition to be L(⋅):=E[min{ln2,F(y⋅FNN(x,⋅))}], Lg∗:=0, s:=(d+1)⋅K∗, R:=n, and εA:=c2⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c2⋅(lnn)2K∗⋅v2dln(d⋅K∗)+n1, respectively) holds with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗)−dK∗⋅exp(−2n2). This completes Step 1.1.
*
Step 1.2. Because min{ln2,F(yi⋅FNN(xi,β))}∈(0,ln2], we thus know that min{ln2,F(yi⋅FNN(xi,β))}ψ1≤1 from the same argument as in deriving (93). Therefore, Assumption 2 holds with σ=1.
Step 1.3. To verify Assumption 2, we observe that ∥Wl−1,l∥≤∥Wl−1,l∥F≤K⋅R for all β=vec((Wl−1,l:2≤l≤D−1),(bl−1,l:2≤l≤D−1),wD−1,D,w1,D,bD−1,D,W0,1,b0,1):∥β∥∞≤R, x:∥x∥=1 and 2≤l≤D−1 (because Wl−1,l has no more than K2-many entries and the absolute value of each entry has an upper bound of R). Likewise, it also holds that ∥W0,1∥≤∥W0,1∥F≤d⋅K⋅R≤KR (where the last inequality is due to K≥d)
and ∥bl−1,l∥≤K⋅R for all l=1,...,D. Therefore, by (50)-(52),
∥fNN,l(x,β)∥≤[∏l′=1l∥Wl′−1,l′∥]⋅∥x∥+∑ℓ=2l[∏l′=ℓl∥Wl′−1,l′∥]⋅∥bℓ−2,ℓ−1∥+∥bl−1,l∥≤(K⋅R)l+∑ℓ=2l+1(K⋅R)l−ℓ+1⋅K⋅R≤(K⋅R)l+1−(KR)−1(KR)l. Since K≥2 and R≥1 we have ∥fNN,l(x,β)∥≤3⋅(K⋅R)l for all l:2≤l≤D−1.
Based on the above, one may further verify that
∣n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β1))}−n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β2))}∣≤3p⋅(K⋅R)D⋅∥β1−β2∥,
for any β1,β2∈{β:∥β∥∞≤R}.
To see this, consider the case where β2=β1+ej⋅δ for any δ∈ℜ such that β1,β2∈{β:∥β∥∞≤R}, it holds that
∣n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β1))}−n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β2))}∣≤n−1∑i=1n∣min{ln2,F(yi⋅FNN(xi,β1))}−min{ln2,F(yi⋅FNN(xi,β2))}∣≤n−1∑i=1n∣F(yi⋅FNN(xi,β1))−F(yi⋅FNN(xi,β2))∣.
Recall that ∣F′(z)∣≤1 for all z∈ℜ (from which we obtain that F(z) is 1-Lipscthiz continuous). Together with the fact that yi∈{−1,1} for all i, the above implies that
n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β1))}−n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β2))}≤n−1∑i=1nyi⋅FNN(xi,β1)−yi⋅FNN(xi,β2)≤n−1∑i=1nFNN(xi,β1)−FNN(xi,β2). Recall that β2=β1+ej⋅δ.
Let the jth fitting parameter be the weight for the connection between the ι1th neuron in Layer (l−1) and the ι2th neuron in Layer l for any l:2≤l≤D−1. Then, (50)-(52) and ∥fNN,l(x,β)∥≤3⋅(K⋅R)l lead to
FNN(xi,β1)−FNN(xi,β2)≤∥wD−1,D∥⋅(∏ℓ=l+1D−1∥Wℓ−1,ℓ∥)⋅δ⋅∥fNN,l−1(x,β1)∥≤3(KR)D⋅δ.
We may generalize the above argument to all the dimensions of β. Consequently, if β2=β1+∑j=1pej⋅δj for any {δj}⊂ℜ:β1,β2∈{β:∥β∥∞≤R}, then
∣n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β1))}−n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β2))}∣≤3(K⋅R)D∑j=1p⋅∣δj∣≤3p⋅(K⋅R)D⋅∑j=1p∣δj∣2≤3p⋅(K⋅R)D⋅∥β1−β2∥.
Thus, Assumption 2 holds with σL=0 and Cμ=3p⋅(K⋅R)D.
Step 1.4. It is evident from the same argument as in proving Part (d) of Theorem 5.1 that β=(βj), where we let β:=βk∗(βinitial,X,y), satisfies that ∣βj∣∈/(0,aλ) for all j=1,...,p, if k∗(βinitial,X,y)≥1.
Step 1.5. This sub-step is to derive an estimate on the suboptimality gap Γ for the initial solution generated through Algorithm 2. As per Lemma 13.19, because K∗≥d⋅ln(d⋅K∗) and W0,1initial=((w0,1,kinitial)⊤:k=1,...,K∗) has i.i.d. standard normal entries (and thus w0,1,kinitial follows the same distribution as both ξ and ξk) it holds that
[TABLE]
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Following the same argument as in deriving (100), we obtain
n−1∑i=1nF(K∗vyilnn∑k=1K∗Cg(w0,1,kinitial)⋅max{0,(w0,1,kinitial)⊤xi})−n−1∑i=1nF(vyi⋅lnng(xi))≤c4⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c4⋅(lnn)2K∗⋅v2dln(d⋅K∗)
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗). As an immediate result,
[TABLE]
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Further recall that FNNsub(⋅,(W0,1initial,w1,Linitial))=FNN(⋅,βinitial). We thus have (combined with (101))
n−1∑i=1nF(yi⋅FNN(xi,βinitial))=n1∑i=1nF(yi⋅FNNsub(xi,(W0,1initial,w1,Linitial)))≤n1+c4⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c4⋅(lnn)2K∗⋅v2dln(d⋅K∗),
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Because Pλ(⋅)≤2aλ2 and ∥βinitial∥0≤(d+1)⋅K∗, we further obtain
[TABLE]
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗). Because of (98), we have n−1∑i=1nF(yi⋅FNN(xi,β))+∑j=1pPλ(βj)≤n−1∑i=1nF(yi⋅FNN(xi,βinitial))+∑j=1pPλ(βjinitial). It thus holds that
n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β))}+∑j=1pPλ(∣βj∣)≤n1+c4⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c4⋅(lnn)2K∗⋅v2dln(d⋅K∗)+K∗⋅(d+1)⋅2aλ2,
with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Further observing that inftF(t)=0 and inftPλ(∣t∣)=0, we then have that n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β))}+∑j=1pPλ(∣βj∣)≤infβ[n−1∑i=1nmin{ln2,F(yi⋅FNN(xi,β))}+∑j=1pPλ(∣βj∣)]+Γ with Γ:=n1+c3⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c3⋅(lnn)2K∗⋅v2dln(d⋅K∗)+K∗⋅(d+1)⋅2aλ2 with probability at least 1−2exp(−dln(dK∗))−exp(−d⋅K∗).
Step 2.*
In this step, we are to derive an upper bound on ∥β∥0. To that end, we differentiate the cases of k∗(βinitial,X,y)=0 and k∗(βinitial,X,y)≥1.*
Case 2.1. We first consider the case of k∗(βinitial,X,y)=0; that is, β=βinitial. In such a case, recall that K∗:=⌈10n1/3⋅(lnn)5/3⌉. By Algorithm 2, it is evident that
∥β∥0≤K∗⋅(d+1)=⌈10n1/3⋅(lnn)5/3⌉⋅(d+1).
Case 2.2. In the next, we consider the case where k∗(βinitial,X,y)≥1.
To that end, we may invoke Proposition 5 to bound ∥β∥0. According to Step 1, with probability at least 1−4exp(−dln(dK∗))−2exp(−d⋅K∗)−dK∗⋅exp(−2n2), all the assumptions required by Proposition 5 are satisfied with the following configurations: Ln,λ(β,Z1n):=n1∑i=1nmin{ln2,F(yi⋅FNN(xi,β))}+∑j=1pPλ(∣βj∣) and
[TABLE]
To satisfy (15) as required by Proposition 5, it suffices to stipulate (55). To see this, observe that σΓ+εA≤n2+(c2+c4)⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+(c2+c4)⋅(lnn)2K∗⋅v2dln(d⋅K∗)+K∗⋅(d+1)⋅2aλ2≤n2+(c2+c4)⋅n1⋅lnn⋅⌈10n1/3⋅(lnn)5/3⌉⋅v2dln(d⋅⌈10n1/3⋅(lnn)5/3⌉)+(c2+c4)⋅(lnn)2⌈10n1/3⋅(lnn)5/3⌉⋅v2dln(d⋅⌈10n1/3⋅(lnn)5/3⌉)+⌈10n1/3⋅(lnn)5/3⌉⋅(d+1)⋅cn2/38σ⋅[ln(n1/3p)+ξ]<(2n)1/3 under (55). Meanwhile, it holds that s(ln(np)+ξ)≤c5⋅d⋅n1/3ln(9eRDp)⋅(ln(n))7/3+c5⋅dn1/3Dln(KR)⋅(ln(n))5/3<2n under (55). In view of the above (and also D≤p and K≤p), (15) is satisfied. We may now invoke Part (i) in Proposition 5, which implies that
∥β∥0≤⌈ln(nϱp)+ζ2cn1/3+σ(ln(nϱp)+ζ)2cn2/3⋅(Γ+εA+nϱ2)+8s⌉≤ln(n1/3p)+ln(9eRDp)+Dln(KR)2cn1/3+2cn2/3⋅(n1/3c6+nc6+nc6⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c6⋅(lnn)2K∗⋅v2dln(d⋅K∗))+c6⋅s=:pNN,
with probability at least 1−c6⋅p⋅exp(−n1/3/c6)−c6⋅(d⋅n)−d/3−c6⋅n1/3(lnn)5/3dexp(−n2/c6)≥1−c7⋅p⋅exp(−n1/3/c7)−c7⋅(d⋅n)−d/3−c7⋅n1/3dexp(−n2/c7).
Combining the above two cases, we thus know that, for all k∗(βinitial,X,y)≥0,
[TABLE]
Step 3.* This step employs results from Step 2 and Proposition 3 to show the desired generalizability of β.
By (98) (where we let k=k∗(βinitial,X,y) and β=βk∗(βinitial,X,y)) and (103) (as well as Pλ(t)≥0 for any t≥0) together, we obtain that
n−1∑i=1nmin{ln2,F(yiFNN(xi,β))}≤n1+c3⋅n1⋅lnn⋅K∗⋅v2dln(d⋅K∗)+c3⋅(lnn)2K∗⋅v2dln(d⋅K∗)+K∗⋅(d+1)⋅2aλ2−2Mγopt2⋅k∗(βinitial,X,y),
with probability 1−2exp(−dln(dK∗))−exp(−d⋅K∗). Recall that K∗:=⌈10n1/3⋅(lnn)5/3⌉. Invoking (105) in Step 2 and Proposition 3 with the same σ, σL, ζ, and Cμ as in (104),
we have
E[min{ln2,F(yiFNN(xi,β))}]−n−1∑i=1nmin{ln2,F(yiFNN(xi,β))}≤n12⋅c−1⋅max{pNN,⌈10n1/3⋅(lnn)5/3⌉(d+1)}⋅[ln(n1/3p)+ξ]+n2σ⋅c−1⋅max{pNN,⌈10n1/3⋅(lnn)5/3⌉(d+1)}[ln(n1/3p)+ξ]+n1/31,
with probability at least 1−c7⋅p⋅exp(−n1/3/c7)−c7⋅(d⋅n)−d/3−c7⋅n1/3dexp(−n2/c7)−2exp(−max{pNN,⌈10n1/3⋅(lnn)5/3⌉}[ln(n1/3p)+ξ]))−2exp(−cn). Combining the above, we then have*
[TABLE]
with probability at least 1−c7⋅p⋅exp(−n1/3/c7)−c7⋅(d⋅n)−d/3−c7⋅n1/3dexp(−n2/c7)−2exp(−max{pNN,⌈10n1/3⋅(lnn)5/3⌉}[ln(n1/3p)+ξ]))−2exp(−cn)−2exp(−dln(dK∗))−exp(−d⋅K∗).
Observing that d≤p, D≤p, and K≤p, we obtain after some reorganization that
E[min{ln2,F(y⋅FNN(x,β))}]≤c7n1/3vd⋅D⋅[(lnn)4/3⋅ln(pR)]−k∗(βinitial,X,y)⋅2Mγopt2,
with probability 1−c7⋅p⋅exp(−n1/3/c7)−c7⋅(d⋅n)−d/3−c7⋅n1/3dexp(−n2/c7). Finally, because 2min{ln2,F(z)}≥1{z<0}, d≤p and D≤p, we thus have
E[1(y⋅FNN(x,β)<0)]≤c8⋅n1/3v2d⋅D⋅[(lnn)4/3⋅ln(pR)]−γopt2⋅2Mk∗(βinitial,X,y),
with probability 1−c8⋅p⋅exp(−n1/3/c8)−c8⋅n1/3dexp(−n2/c8)−c8⋅(d⋅n)−d/3. This then leads to Part (b) of the theorem.
To show Part (a), suppose that k∗(β0,X,y)≥(⌈2M⋅γopt2Tn,λ(β0)⌉+1) for the sake of contradiction. Then (98) would imply that Tn,λ(βk∗(β0,X,y))=n−1∑i=1nF(yiFNN(xi,βk∗(β0,X,y)))+∑j=1pPλ(∣βjk∗(β0,X,y)∣)≤n−1∑i=1nF(yiFNN(xi,β0))+∑j=1pPλ(∣βj0∣)−2Mγopt2⋅k∗(β0,X,y)≤Tn,λ(β0)−(⌈2M⋅γopt2Tn,λ(β0)⌉+1)⋅2Mγopt2<0. This contradicts with Tn,λ(βk∗(β0,X,y))≥0 (since infuF(u)≥0 and infuPλ(∣u∣)≥0).
□
13.4 Proof of Computational Complexity of Algorithm 1
Proof 13.10
Proof of Theorem 5.1. Note that M≥UL,2. The following is a useful inequality well-known for a function with Lipschitz gradient:
[TABLE]
The KKT conditions for (24) in Step 2 of Algorithm 1 yield that
[TABLE]
where ϰ(βjk+21)∈∂∣βjk+21∣ and ∂∣βjk+21∣ is the subdifferential of ∣⋅∣ at βjk+21. Combining (108) with the objective function of Eq. (24) yields that
[TABLE]
By the convexity of Pλ′(∣βjk∣)⋅∣t∣ in t for all t∈ℜ and all j, we may continue the above to have
[TABLE]
Invoking (107) with β1:=βk and β2:=βk+21, we obtain from the above that
[TABLE]
Since Pλ(t) is concave in t for all t≥0, we know that Pλ′(∣βjk∣)⋅(∣βjk+21∣−∣βjk∣)≥Pλ(∣βjk+21∣)−Pλ(∣βjk∣). Therefore,
[TABLE]
Consider the second subproblem (25) in Step 3 of Algorithm 1. Again, because of the inequality in (107), it holds that f(βk+1)−f(βk+21)+∑j=1pPλ(∣βjk+1∣)≤⟨∇f(βk+21),βk+1−βk+21⟩+2M∥βk+1−βk+21∥2+∑j=1pPλ(∣βjk+1∣)≤⟨∇f(βk+21),βk+21−βk+21⟩+2M∥βk+21−βk+21∥2+∑j=1pPλ(∣βjk+21∣)=∑j=1pPλ(∣βjk+21∣), where the last inequality is due to the fact that βk+1 is the minimizer to the subproblem in (25). By some reorganization, we obtain
f(βk+1)+∑j=1pPλ(∣βjk+1∣)≤f(βk+21)+∑j=1pPλ(∣βjk+21∣). Combining this with (109), we have that
[TABLE]
Before the termination criterion in (26) is met, it must hold that
[TABLE]
Invoking the above recursively, we have
[TABLE]
Therefore, there must exist some k∗:k∗≤⌊2M⋅γopt2(f(β0)+∑j=1kPλ(∣βj0∣))−fλ∗⌋+1 such that f(βk+1)+∑j=1pPλ(∣βjk+1∣)>f(βk)+∑j=1pPλ(∣βjk∣)−2Mγopt2. This is because, otherwise, Algorithm 1 would keep reducing the objective value as per (111). Consequently,
[TABLE]
which contradicts with the definition of fλ∗. This completes the proof for Part (a).
Suppose that j:∣βjk∣∈(0,aλ) for some j=1,...,p and k≥1. Because a global minimal solution to (25) must obey the second-order necessary conditions, which imply that ∂βj2∂2(21⟨∇f(βk+21),β−βk+21⟩+2M∥β−βk+21∥2)+∂βj2∂2Pλ(∣βj∣)βj:=βjk≥0. This inequality can be simplified equivalently into M−a1≥0, which, however, contradicts our assumption of a<M1. As a result, it must hold that ∣βjk∣∈/(0,aλ) for all j=1,...,p for all k≥1. This proves Part (d).
Let k∗ be the iteration count when the algorithm terminates with f(βk∗+1)+∑j=1pPλ(∣βjk∗+1∣)>f(βk∗)+∑j=1pPλ(∣βjk∗∣)−2Mγopt2 being satisfied for the first time. This, combined with (110) and the assumption that γopt≤aλM, implies that
Part (d) indicates that βjk=0⟹∣βjk∣≥aλ for all k≥1. In view of (112), we then know that ∣βjk∗+21−βjk∗∣<aλ for all j. Hence, βjk∗+21>0 if βjk∗>0 and ∂(∣βjk∗+21∣)=∂(∣βjk∗∣)={1} for all j:βjk∗>0. Likewise, it also holds that ∂(∣βjk∗+21∣)=∂(∣βjk∗∣)={−1} for all j:βjk∗<0. Furthermore, we also observe that ϰ(∣βjk∗+21∣)∈[−1,1]=∂(∣βjk∗∣) for all j:βjk∗=0. In view of (113), we have
γopt>∇f(βk∗)+(Pλ′(∣βjk∗∣)⋅ϰj,j=1,...,p), for some ϰ:=(ϰj) such that ϰj∈∂(∣βjk∗∣) for all j. We have now proven the satisfaction of the approximate first-order conditions in (27). Further, Part (d) implies that {(k,p):∣βjk∣∈(0,aλ),k≥1,p=1,...,p}=∅. Therefore, as part of the S3ONC, the necessary condition of optimality that UL,∞+Pλ′′(∣βjk∣)≥0 for any (k,p):∣βjk∣∈(0,aλ),k≥1,p=1,...,p is satisfied. We have thus proven Part (b).
Finally, invoking (111), we have the desired inequality of fλ(βk∗)≤fλ(β0), as claimed in Part (c).
□
13.5 Useful Lemmata
Lemma 13.11
Suppose that Assumption 2 holds and that ϵ>0 is an arbitrary scalar.
(a)
For some universal constant c>0,
[TABLE]
(b)
∣E[Ln(β1,Z1n)]−E[Ln(β2,Z1n)]∣≤Cμ⋅ϵ, for all (β1,β2)∈ℜp:∥β1∥∞≤R,∥β2∥∞≤R,∥β1−β2∥≤ϵ.
Proof 13.12
Proof.
This lemma and its proof are straightforward modifications from Shapiro et al. (2014).
To show Part (a), we invoke a Bernstein-like inequality under Assumption 2. Consequently, for all β∈ℜp:∥β∥∞≤R and some universal constant c>0, it holds that P[∑i=1nn1{C(Zi)−E[C(Zi)]}>σL(nt+nt)]≤2exp(−ct),\leavevmode∀t≥0.
With t:=n and E[C(Zi)]≤Cμ (due to Assumption 2), we immediately have
[TABLE]
If we invoke Assumption 2 given the event {∑i=1nnC(Zi)≤2σL+Cμ}, we have that for any (β1,β2)∈ℜp:∥β1∥∞≤R,∥β2∥∞≤R,∥β1−β2∥∞≤ϵ,
[TABLE]
This, combined with (115), yields the desired result in Part (a).
To show Part (b), by Assumption 2, it holds that E[Ln(β1,Z1n)−Ln(β2,Z1n)11]≤E[∑i=1nnC(Zi)∥β1−β2∥].
Due to the convexity of the function ∣⋅∣, it therefore holds that
[TABLE]
Invoking Assumption 2 again, it holds that E[∑i=1nnC(Zi)]=n∑i=1nE[C(Zi)]≤Cμ.
This combined with (116) immediately leads to the desired result in Part (b).
□
Lemma 13.13
For any fixed Z1n∈Wn, if βℓ1 is a finite optimal solution to the minimization problem minβLn(β,Z1n)+λ∣β∣, then Ln,λ(βℓ1,Z1n)≤Ln,λ(βεA∗,Z1n)+λ∣βεA∗∣.
Proof 13.14
Proof.
Let βεA,j∗ be the j-th dimension of βεA∗. By the definition of βℓ1, it holds that
[TABLE]
Now consider that, for βj (an arbitrarily chosen entry of β), it holds that
Pλ(∣βj∣)=∫0∣βj∣a[aλ−θ]+dθ≤∫0∣βj∣aaλdθ=λ∣βj∣.
This combined with (117) implies that
Ln(βℓ1,Z1n)+∑j=1pPλ(∣βjℓ1∣)≤Ln(βεA∗,Z1n)+λ∣βεA∗∣≤Ln(βεA∗,Z1n)+∑j=1pPλ(∣βεA,j∗∣)+λ∣βεA∗∣, which is as claimed.
□
Theorem 13.15
(Nesterov2005)*
For any convex and compact set Q⊂ℜm for an integer m>0.
Consider a function fμ(β,A):=maxu{⟨Aβ,u⟩−ϕ(u)−21μ∥u−u0∥2:u∈Q} for any A∈ℜm×p, convex and continuous function ϕ:Q→ℜ, and scalar μ>0. This function is well-defined, continuously differentiable, and convex. Its gradient given as ∇fμ(β,A)=A⊤uμ∗(β) is Lipschitz continuous with constant Lμ(A)=μ1∥A∥1,22, where uμ∗(β)=argmaxu{⟨Aβ,u⟩−ϕ(u)−21μ∥u−u0∥2:u∈Q}.*
Consider an arbitrary function g∈Fd,r. Let Ψ be an activation function that satisfies Assumption 9.1. There exist W∈ℜN×d, w∈ℜN, and b∈ℜN such that
[TABLE]
Proof 13.18
Proof.
The desired result is an immediate implication of Theorem 2.1 by Mhaskar (1996), where we set the quantities “p”, “d”, “s”, “r”, and “Wr,sp” in Mhaskar (1996) to be ∞, 1, d, r, and Fd,r, respectively, in this paper. □
Lemma 13.19
Suppose that Assumption 9.2 holds. Let K∗ be any integer such that K∗≥d⋅ln(d⋅K∗), let ξ follow the d-variate standard normal distribution, and let ξk, k=1,...,K∗, be a sequence of i.i.d. random samples of ξ. Then,
[TABLE]
Proof 13.20
Proof. Our proof below is divided into two steps, where we let c0,c1,... be some universal constants.
Step 1.*
For a fixed x∈X, consider a random variable defined as Gx(ξ):=Cg(ξ)⋅max{0,x⊤ξ}, where ξ is a d-variate standard normal random vector (and thus its entries are i.i.d.). Apparently, by Assumption 9.2, g(x)=Eξ[Gx(ξ)], where Eξ denotes the expectation over ξ. We show in step 1 that Gx(ξ)−g(x) is a subexponential random variable.*
Because ∥x∥=1 and ξ has i.i.d. standard normal entries, ξ⊤x is a standard normal random variable (and thus it is subgaussian). By the properties of a subgaussian random variable, ∥ξ⊤x∥ψ2≤c0 and P[∣ξ⊤x∣≥t]≤2exp(−c1⋅t2/c0), for any t≥0. Therefore, P[max{0,ξ⊤x}≥t]≤2exp(−c1⋅t2/c0), for any t≥0. By the definition of the subgaussian norm, we know that max{0,ξ⊤x}ψ2≤c2. Because supξ′∣Cg(ξ′)∣≤1 according to Assumption 9.2, invoking Lemma 2.7.7 of Vershynin (2018), we have ∥Cg(ξ)⋅max{0,ξ⊤x}∥ψ1≤∥Cg(ξ)∥ψ2⋅max{0,ξ⊤x}ψ2≤c3, which further leads to Cg(ξ)⋅max{0,ξ⊤x}−Eξ[Cg(ξ)⋅max{0,ξ⊤x}]ψ1=∥Gx(ξ)−g(x)∥ψ1≤c4. Thus, Gx(ξ)−g(x) is subexponential for a fixed x∈X, as desired in this step.
Step 2.* This step combines the result from Step 1 and the ϵ-net argument to prove (118) as desired. In doing so, for any ϵ∈(0,1], we construct a net of grids Bϵ such that, for any x∈X, there exists z∈Bϵ: ∥x−z∥≤(5+1)⋅dϵ. To that end, it suffices to involve as many as ∣Bϵ∣:=⌈ϵ(5+1)d⌉d≤[ϵ2(5+1)d]d grids.*
Consider the following two sets
[TABLE]
Because K∗1∑k=1K∗max{0,ξk⊤x1}−K∗1∑k=1K∗max{0,ξk⊤x2}≤K∗1∑k=1K∗∥ξk∥⋅∥x1−x2∥≤K∗1∑k=1K∗∥ξk∥2⋅∥x1−x2∥ for any x1,x2∈X, we have
[TABLE]
Further noticing that supξ∣Cg(ξ)∣≤1 as per Assumption 9.2, we then have
[TABLE]
We may then continue with the ϵ-net argument to obtain that, given the event E1∩E2, for any x∈X, there exists z∈Bϵ:∥z−x∥≤(5+1)⋅dϵ such that
[TABLE]
Here (120) is due to (119) and the observation that (Eξ[∥ξ∥])2≤Eξ[∥ξ∥2]=d, where the latter is based on the fact that ∥ξ∥2 follows the χ2 distribution with the degree of freedom being d. We may then continue to obtain that, given E1∩E2, it holds that
K∗1∑k=1K∗Gx(ξk)−Eξ[Gx(ξ)]≤c5⋅(K∗t+K∗t)+(5+1)d∥z−x∥≤c5⋅(K∗t+K∗t)+ϵ.
We now establish the probability for E1∩E2. As an immediate implication of Step 1, a Bernstein-like inequality holds, for any fixed x∈Bϵ, as below:
[TABLE]
Together with ∣Bϵ∣:=[ϵ2⋅(5+1)d]d, the above inequality implies that
[TABLE]
In establishing the probability of E2, we observe that ξk follows the d-variate standard Gaussian distribution. Thus, ∑k=1K∗∥ξk∥2 is a χ2-distribution, whose degree of freedom is d⋅K∗. A well-known tail bound for the χ2-distribution yields that P[∑k=1K∗∥ξk∥2≤dK∗⋅(1+2t+2t)]≥1−exp(−dtK∗). This further implies that P[E2]=P[K∗1∑k=1K∗∥ξk∥2≤5d]≥1−exp(−d⋅K∗). Thus, combining the above by invoking the union bound and De Morgan’s law, for any ϵ>0, we have that P[E1∩E2]≥1−[ϵ2(5+1)d]d⋅exp(−t)−exp(−d⋅K∗).
Therefore, for any ϵ>0,
[TABLE]
We may as well let ϵ=1/K∗ and t=2dln[ϵ2⋅(5+1)d]=2dln(2(5+1)d⋅K∗). Consequently (and in view of the assumption that K∗≥dln(dK∗)), (123) is reduced to
[TABLE]
which (combined with y∈{−1,1}) further leads to
[TABLE]
which is the desired result. □
Bibliography98
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1Alford et al. [2018] S. Alford, R. Robinett, L. Milechin, and J. Kepner. Pruned and Structurally Sparse Neural Networks . ar Xiv: 1810.00299, 2018.
2Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems , pages 6155–6166, 2019.
3Ap S [2015] M. Ap S. The mosek optimization toolbox for matlab manual. version 7.1 (revision 28) online, 2015. URL http://docsmosekcom/71/toolbox/indexhtml.
4Barron and Klusowski [2018] A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep learning networks. ar Xiv preprint ar Xiv:1809.03090 , 2018.
5Bartlett et al. [2006] P. Bartlett, M. Jordan, and J. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association , 101(473):138–156, 2006.
6Bartlett et al. [2017] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems , pages 6240–6249, 2017.
7Belloni and Chernozhukov [2011] A. Belloni and V. Chernozhukov. ℓ ℓ \ell 1-penalized quantile regression in high-dimensional sparse models. Annals of Statistics , 39(1):82–130, 2011.
8Berner et al. [2019] J. Berner, D. Elbrächter, P. Grohs, and A. Jentzen. Towards a regularity theory for relu networks–chain rule and global error estimates. In 2019 13th International conference on Sampling Theory and Applications (Samp TA) , pages 1–5. IEEE, 2019.