Sharp Oracle Inequalities for Low-complexity Priors
Tung Duy Luu, Jalal Fadili, Christophe Chesneau

TL;DR
This paper establishes sharp oracle inequalities for high-dimensional estimators like Lasso and nuclear norm penalties, demonstrating their theoretical performance guarantees under various data loss functions and priors.
Contribution
It provides a unified analysis of exponential weighted aggregation and penalized estimators with general priors, highlighting their performance and differences in high-dimensional settings.
Findings
Sharp oracle inequalities for Lasso, group Lasso, and nuclear norm penalties.
Theoretical guarantees for estimators under various data loss functions.
Efficient implementation via proximal splitting algorithms.
Abstract
In this paper,we consider a high-dimensional statistical estimation problem in which the the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the and the nuclear norm penalties. All our estimators…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Stochastic Gradient Optimization Techniques · Distributed Sensor Networks and Detection Algorithms
Sharp Oracle Inequalities for Low-complexity Priors
Tung Duy Luu Normandie Univ, ENSICAEN, CNRS, GREYC, France, Email: {duy-tung.luu, Jalal.Fadili}@ensicaen.fr.
Jalal Fadili††footnotemark:
Christophe Chesneau Normandie Univ, UNICAEN, CNRS, LMNO, France, Email: [email protected].
Abstract
In this paper, we consider a high-dimensional statistical estimation problem in which the the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the and the nuclear norm penalties. All our estimators can be efficiently implemented using proximal splitting algorithms.
Key words. High-dimensional estimation, exponential weighted aggregation, penalized estimation, oracle inequality, low-complexity models.
AMS subject classifications. 62G07 62G20
1 Introduction
1.1 Problem statement
Our statistical context is the following. Let be identically distributed observations with common marginal distribution, and a deterministic design matrix. The goal to estimate a parameter vector of the observations marginal distribution based on the data and .
Let be a loss function supposed to be smooth and convex that assigns to each a cost . Let be any minimizer of the population risk. We regard as the true parameter. A usual instance of this statistical setting is the standard linear regression model based on pairs of response-covariate that are linked linearly , and F({\boldsymbol{u}},\boldsymbol{y})=\tfrac{1}{2}\big{\|}\boldsymbol{y}-{\boldsymbol{u}}\big{\|}_{2}^{2}.
Our goal is to provide general oracle inequalities in prediction for two estimators of : the penalized estimator and exponential weighted aggregation. In the setting where ” larger than (possibly much larger), the estimation problem is ill-posed since the rectangular matrix has a kernel of dimension at least . To circumvent this difficulty, we will exploit the prior that has some low-complexity structure (among which sparsity and low-rank are the most popular). That is, even if the ambient dimension of is very large, its intrinsic dimension is much smaller than the sample size . This makes it possible to build estimates with good provable performance guarantees under appropriate conditions. There has been a flurry of research on the use of low-complexity regularization in ill-posed recovery problems in various areas including statistics and machine learning.
1.2 Variational/Penalized Estimators
Regularization is now a central theme in many fields including statistics, machine learning and inverse problems. It allows one to impose on the set of candidate solutions some prior structure on the object to be estimated. This regularization ranges from squared Euclidean or Hilbertian norms to non-Hilbertian norms (e.g. norm for sparse objects, or nuclear norm for low-rank matrices) that have sparked considerable interest in the recent years. In this paper, we consider the class of estimators obtained by solving the convex optimization problem111To avoid trivialities, the set of minimizers is assumed non-empty, which holds for instance if is also coercive.
[TABLE]
where the regularizing penalty is a proper closed convex function that promotes some specific notion of simplicity/low-complexity, and is the regularization parameter. A prominent member covered by (1.1) is the Lasso [13, 57, 42, 23, 8, 5, 7, 33] and its variants such the analysis/fused Lasso [52, 58], SLOPE [6, 54] or group Lasso [2, 76, 1, 73]. Another example is the nuclear norm minimization for low rank matrix recovery motivated by various applications including robust PCA, phase retrieval, control and computer vision [45, 10, 28, 11]. See [40, 7, 67, 64] for generalizations and comprehensive reviews.
1.3 Exponential Weighted Aggregation (EWA)
An alternative to the the variational estimator (1.1) is the aggregation by exponential weighting, which consists in substituting averaging for minimization. The aggregators are defined via the probability density function
[TABLE]
where is called temperature parameter. If all are candidates to estimate the true vector , then . The aggregate is thus defined by
[TABLE]
Aggregation by exponential weighting has been widely considered in the statistical and machine learning literatures, see e.g. [20, 17, 16, 21, 41, 74, 46, 35, 29, 26] to name a few. can also be interpreted as the posterior conditional mean in the Bayesian sense if is the negative-loglikelihood associated to the noise with the prior density .
1.4 Oracle inequalities
Oracle inequalities, which are at the heart of our work, quantify the quality of an estimator compared to the best possible one among a family of estimators. These inequalities are well adapted in the scenario where the prior penalty promotes some notion of low-complexity (e.g. sparsity, low rank, etc.). Given two vectors and , let be a nonnegative error measure between their predictions, respectively and . A popular example is the averaged prediction squared error \tfrac{1}{n}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}_{1}-\boldsymbol{X}{\boldsymbol{\theta}}_{2}\big{\|}_{2}^{2}, where \big{\|}\cdot\big{\|}_{2} is the norm. will serve as a measure of the performance of the estimators and . More precisely, we aim to prove that and mimic as much as possible the best possible model. This idea is materialized in the following type of inequalities (stated here for EWA)
[TABLE]
where is the leading constant of the oracle inequality and the remainder term depends on the performance of the estimator, the complexity of , the sample size , the dimension , and the regularization and temperature parameters . An estimator with good oracle properties would correspond to close to (ideally, , in which case the inequality is said “sharp”), and is small and decreases rapidly to [math] as .
1.5 Contributions
We provide a unified analysis where we capture the essential ingredients behind the low-complexity priors promoted by , relying on sophisticated arguments from convex analysis and our previous work [27, 63, 65, 62, 64]. Our main contributions are summarized as follows:
- •
We show that the EWA estimator in (1.2) and the variational/penalized estimator in (1.1) satisfy (deterministic) sharp oracle inequalities for prediction with optimal remainder term, for general data losses beyond the usual quadratic one, and is a proper finite-valued sublinear function (i.e. is finite-valued convex and positively homogeneous). We also highlight the differences between the two estimators in terms of the corresponding bounds.
- •
When the observations are random, we prove oracle inequalities in probability. The theory is non-asymptotic in nature, as it yields explicit bounds that hold with high probability for finite sample sizes, and reveals the dependence on dimension and other structural parameters of the model.
- •
For the standard linear model with Gaussian or sub-Gaussian noise, and a quadratic loss, we deliver refined versions of these oracle inequalities in probability. We underscore the role of the Gaussian width, a concept that captures important geometric characteristics of sets in .
- •
These results yield naturally a large number of corollaries when specialized to penalties routinely used in the literature, among which the Lasso, the group Lasso, their analysis-type counterparts (fused (group) Lasso), the and the nuclear norms. Soem of these corollaries are known and others novel.
The estimators and can be easily implemented thanks to the framework of proximal splitting methods, and more precisely forward-backward type splitting. While the latter is well-known to solve (1.1) [64], its application within a proximal Langevin Monte-Carlo algorithm to compute with provable guarantees has been recently developed by the authors in [26] to sample from log-semiconcave densities222In a forthcoming paper, this framework was extended to cover the even more general class of prox-regular functions., see also [25] for log-concave densities.
1.6 Relation to previous work
Our oracle inequality for extends the work of [18] with an unprecedented level of generality, far beyond the Lasso and the nuclear norm. Our prediction sharp oracle inequality for specializes to that of [55] in the case of the Lasso (see also the discussion in [19] and references therein) and that of [34] for the case of the nuclear norm. Our work also goes much beyond that in [67] on weakly decomposable priors, where we show in particular that there is no need to impose decomposability on the regularizer, since it is rather an intrinsic property of it.
1.7 Paper organization
Section 2 states our main assumptions on the data loss and the prior penalty. All the concepts and notions are exemplified on some penalties some of which are popular in the literature. In Section 3, we prove our main oracle inequalities, and their versions in probability. We then tackle the case of linear regression with quadratic data loss in Section 4. Concepts from convex analysis that are essential to this work are gathered in Section A. A key intermediate result in the proof of our main results is established in Section B with an elegant argument relying on Moreau-Yosida regularization.
1.8 Notations
Vectors and matrices
For a -dimensional Euclidean space , we endow it with its usual inner product and associated norm . is the identity matrix on . For , will denote the norm of a vector with the usual adaptation for .
In the following, if is a vector space, denotes the orthogonal projector on , and
[TABLE]
For a finite set we denote \big{|}{\mathcal{C}}\big{|} its cardinality. For , we denote by its complement. is the subvector whose entries are those of restricted to the indices in , and the submatrix whose columns are those of indexed by . For any matrix , denotes its transpose and its Moore-Penrose pseudo-inverse. For a linear operator , is its adjoint.
Sets
For a nonempty set , we denote the closure of its convex hull, and its indicator function, i.e. if and otherwise. For a nonempty convex set , its affine hull is the smallest affine manifold containing it. It is a translate of its parallel subspace , i.e. ; for any . The relative interior of a convex set is the interior of for the topology relative to its affine full.
Functions
A function is closed (or lower semicontinuous) if so is its epigraph. It is coercive if , and strongly coercive if . The effective domain of is \operatorname*{dom}(f)=\big{\{}{\boldsymbol{\theta}}\in\mathbb{R}^{p}\;:\;f({\boldsymbol{\theta}})<+\infty\big{\}} and is proper if as is the case when it is finite-valued. A function is said sublinear if it is convex and positively homogeneous. The Legendre-Fenchel conjugate of is . For proper, the functions obey the Fenchel-Young inequality
[TABLE]
When is a proper lower semicontonuous and convex function, is actually the best pair for which this inequality cannot be tightened. For a function on , the function is called the monotone conjugate of . The pair obviously obeys (1.5) on .
For a -smooth function , is its (Euclidean) gradient. For a bivariate function that is with respect to the first variable , for any , we will denote the gradient of at with respect to the first variable.
The subdifferential of a convex function at is the set
[TABLE]
An element of is a subgradient. If the convex function is differentiable at , then its only subgradient is its gradient, i.e. .
The Bregman divergence associated to a convex function at with respect to is
[TABLE]
The Bregman divergence is in general nonsymmetric. It is also nonnegative by convexity. When is differentiable at , we simply write (which is, in this case, also known as the Taylor distance).
2 Estimation with low-complexity penalties
The estimators and in (1.1) and (1.3) require two essential ingredients: the data loss term and the prior penalty . We here specify the class of such functions covered in our work, and provide illustrating examples.
2.1 Data loss
The class of loss functions that we consider obey the following assumptions:
- (H.1)
is and uniformly convex for all of modulus , i.e.
[TABLE]
where is a convex non-decreasing function that vanishes only at [math]. 2. (H.2)
For any and , \int_{\mathbb{R}^{p}}\exp{\left(-F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y})/(n\beta)\right)}\big{|}\langle\nabla F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y}),\boldsymbol{X}(\overline{{\boldsymbol{\theta}}}-{\boldsymbol{\theta}})\rangle\big{|}d{\boldsymbol{\theta}}<+\infty.
Recall that by Lemma A.1, the monotone conjugate of is a proper, closed, convex, strongly coercive and non-decreasing function on that vanishes at [math]. Moreover, . is finite-valued on if is strongly coercive, and it vanishes only at [math] under e.g. Lemma A.1(iii).
The class of data loss functions in (H.1) is fairly general. It is reminiscent of the negative log-likelihood in the regular exponential family. For the moment assumption (H.2) to be satisfied, it is suffient that
[TABLE]
where be a minimizer of , which is unique by uniform convexity. We here provide an example.
Example 2.1**.**
Consider the case where333We consider a scaled version of for simplicity, but the same conclusions remain valid if we take , with . , , or equivalently where . For , (H.1) amounts to saying that is strongly convex for all . In particular, [3, Proposition 10.13] shows that F({\boldsymbol{u}},\boldsymbol{y})=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q}/q is uniformly convex for with modulus , where is a constant that depends solely on .
For (H.2) to be verified, it is suffient that
[TABLE]
In particular, taking F({\boldsymbol{u}},\boldsymbol{y})=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q}/q, , we have \big{\|}\nabla F({\boldsymbol{u}},\boldsymbol{y})\big{\|}_{2}=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q-1}, and thus (H.2) holds since
[TABLE]
2.2 Prior penalty
Recall the main definitions and results from convex analysis that are collected in Section A. Our main assumption on is the following.
- (H.3)
is the gauge of a non-empty convex compact set containing the origin as an interior point.
By Lemma A.3, this assumption is equivalent to saying that is proper, convex, positively homogeneous, finite-valued and coercive. In turn, is locally Lipschitz continuous on . Observe also that by virtue of Lemma A.4 and Lemma A.2, the polar gauge enjoys the same properties as in (H.3).
2.3 Decomposability of the prior penalty
We are now in position to provide an important characterization of the subdifferential mapping of a function satisfying (H.3). This characterization will play a pivotal role in our proof of the oracle inequality.
We start by defining some essential geometrical objects that were introduced in [63].
Definition 2.1** (Model Subspace).**
Let . We denote by as
[TABLE]
We denote
[TABLE]
* is coined the model subspace of associated to .*
It can be shown, see [63, Proposition 5], that , hence the name model subspace. When is differentiable at , we have and . When is the -norm (Lasso), the vector is nothing but the sign of . Thus, can be viewed as a generalization of the sign vector. Observe also that , and thus . However, in general, .
We now provide a fundamental equivalent description of the subdifferential of at in terms of , , and the polar gauge .
Theorem 2.1**.**
Let satisfy (H.3). Let and .
- (i)
The subdifferential of at reads
[TABLE] 2. (ii)
For any , such that
[TABLE]
Proof.
- (i)
This follows by piecing together [63, Theorem 1, Proposition 4 and Proposition 5(iii)]. 2. (ii)
From [63, Proposition 5(iv)], we have
[TABLE]
Thus there exists a supporting point with normal vector [3, Corollary 7.6(iii)], i.e.
[TABLE]
Taking concludes the proof.
∎
Remark 2.1**.**
The coercivity assumption in (H.3) is not needed for Theorem 2.1 to hold.
The decomposability of described in Theorem 2.1(i) depends on the particular choice of the mapping . An interesting situation is encountered when , so that one can choose . Strong gauges, see [63, Definition 6], are precisely a class of gauges for which this situation occurs, and in this case, Theorem 2.1(i) has the simpler form
[TABLE]
The Lasso, group Lasso and nuclear norms are typical examples of (symmetric) strong gauges. However, analysis sparsity penalties (e.g. the fused Lasso) or the -penalty are not strong gauges, though they obviously satisfy (H.3). See the next section for a detailed discussion.
2.4 Calculus with the prior family
The family of penalties complying with (H.3) form a robust class enjoying important calculus rules. In particular it is closed under the sum and composition with an injective linear operator as we now prove.
Lemma 2.1**.**
The set of functions satisfying (H.3) is closed under addition444It is obvious that the same holds with any positive linear combination. and pre-composition by an injective linear operator. More precisely, the following holds:
- (i)
Let and be two gauges satisfying (H.3). Then also obeys (H.3). Moreover,
- (a)
* and , where and (resp. and ) are the model subspace and vector at associated to (resp. );* 2. (b)
. 2. (ii)
Let be a gauge satisfying (H.3), and be surjective. Then also fulfills (H.3). Moreover,
- (a)
* and , where and are the model subspace and vector at associated to ;* 2. (b)
, where {\boldsymbol{D}}^{+}={\boldsymbol{D}}^{\top}{\big{(}{\boldsymbol{D}}{\boldsymbol{D}}^{\top}\big{)}}^{-1}.
The outcome of Lemma 2.1 is naturally expected. For instance, assertion (i) states that combining several penalties/priors will promote objects living on the intersection of the respective low-complexity models. Similarly, for (ii), one promotes low-complexity in the image of the analysis operator . It then follows that one has not to deploy an ad hoc analysis when linearly pre-composing or combining (or both) several penalties (e.g. +nuclear norms for recovering sparse and low-rank matrices) since our unified analysis in Section 3 will apply to them just as well.
Proof.
- (i)
Convexity, positive homogeneity, coercivity and finite-valuedness are straightforward.
- (a)
This is [63, Proposition 8(i)-(ii)]. 2. (b)
We have from Lemma A.4 and calculus rules on support functions,
[TABLE] 2. (ii)
Again, Convexity, positive homogeneity and finite-valuedness are immediate. Coercivity holds by injectivity of .
- (a)
This is [63, Proposition 10(i)-(ii)]. 2. (b)
Denote . We have
[TABLE]
where in the last equality, we used the fact that {\boldsymbol{D}}^{+}{\boldsymbol{\omega}}\in\operatorname*{Span}{\big{(}{\boldsymbol{D}}^{\top}\big{)}}=\operatorname*{Ker}({\boldsymbol{D}})^{\perp}, and thus unless , and is continuous and convex by (H.3) and Lemma A.4.
∎
2.5 Examples
2.5.1 Lasso
The Lasso regularization is used to promote the sparsity of the minimizers, see [7] for a comphensive review. It corresponds to choosing as the -norm
[TABLE]
It is also referred to as -synthesis in the signal processing community, in contrast to the more general -analysis sparsity penalty detailed below.
We denote the canonical basis of and \mathrm{supp}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\dots,p\}\;:\;{\boldsymbol{\theta}}_{i}\neq 0\big{\}}. Then,
[TABLE]
2.5.2 Group Lasso
The group Lasso has been advocated to promote sparsity by groups, i.e. it drives all the coefficients in one group to zero together hence leading to group selection, see [2, 76, 1, 73] to cite a few. The group Lasso penalty with groups reads
[TABLE]
where , and whenever . Define the group support as \mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\ldots,L\}\;:\;{\boldsymbol{\theta}}_{b_{i}}\neq 0\big{\}}. Thus, one has
[TABLE]
2.5.3 Analysis (group) Lasso
One can push the structured sparsity idea one step further by promoting group/block sparsity through a linear operator, i.e. analysis-type sparsity. Given a linear operator (seen as a matrix), the analysis group sparsity penalty is
[TABLE]
This encompasses the 2-D isotropic total variation [52]. For when all groups of cardinality one, we have the analysis- penalty (a.k.a. general Lasso), which encapsulates several important penalties including that of the 1-D total variation [52], and the fused Lasso [58]. The overlapping group Lasso [31] is also a special case of (2.4) by taking to be an operator that exactract the blocks [43, 14] (in which case has even orthogonal rows).
Let and its complement. From Lemma 2.1(ii) and (2.5), we get
[TABLE]
If, in addition, is surjective, then by virtue of Lemma 2.1(ii) we also have
[TABLE]
2.5.4 Anti-sparsity
If the vector to be estimated is expected to be flat (anti-sparse), this can be captured using the norm (a.k.a. Tchebychev norm) as prior
[TABLE]
The regularization has found applications in several fields [32, 38, 53]. Suppose that , and define the saturation support of as I^{\mathrm{sat}}_{{\boldsymbol{\theta}}}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\dots,p\}\;:\;\big{|}{\boldsymbol{\theta}}_{i}\big{|}=\left\|{\boldsymbol{\theta}}\right\|_{\infty}\big{\}}\neq\emptyset. From [63, Proposition 14], we have
[TABLE]
2.5.5 Nuclear norm
The natural extension of low-complexity priors to matrices is to penalize the singular values of the matrix. Let , and be a reduced rank- SVD decomposition, where and have orthonormal columns, and is the vector of singular values in non-increasing order. The nuclear norm of is
[TABLE]
This penalty is the best convex surrogate to enforce a low-rank prior. It has been widely used for various applications [45, 10, 9, 28, 11].
Following e.g. [62, Example 21], we have
[TABLE]
3 Oracle inequalities for a general loss
Before delving into the details, in the sequel, we will need a bit of notations.
We recall and the model subspace and vector associated to (see Definition 2.1). Denote . Given two coercive finite-valued gauges and , and a linear operator , we define the operator bound as
[TABLE]
Note that is bounded (this follows from Lemma A.3(v)). Furthermore, we have from Lemma A.4 that
[TABLE]
In the following, whenever it is clear from the context, to lighten notation when is a norm, we write the subscript of the norm instead of (e.g. for the norm, for the nuclear norm, etc.).
Our main result will involve a measure of well-conditionedness of the design matrix when restricted to some subspace . More precisely, for , we introduce the coefficient
[TABLE]
This generalizes the compatibility factor introduced in [68] for the Lasso (and used in [18]). The experienced reader may have recognized that this factor is reminescent of the null space property and restricted injectivity that play a central role in the analysis of the performance guarantees of variational/penalized estimators (1.1); see [27, 63, 65, 62, 64]. One can see in particular that is larger than the smallest singular value of .
The oracle inequalites will provided in terms of the loss
[TABLE]
3.1 Oracle inequality for
We are now ready to establish our first main result: an oracle inequality for the EWA estimator (1.3).
Theorem 3.1**.**
Consider the EWA estimator in (1.3) with the density (1.2), where and satisfy Assumptions (H.1)-(H.2) and (H.3). Then, for any such that , the following holds,
[TABLE]
Remark 3.1**.**
It should be emphasized that Theorem 3.1 is actually a deterministic statement for a fixed choice of . Probabilistic analysis will be required when the result is applied to particular statistical models as we will see later. For this, we will use concentration inequalities in order to provide bounds that hold with high probability over the data. 2. 2.
The oracle inequality is sharp. The remainder in it has two terms. The first one encodes the complexity of the model promoted by . The second one, , captures the influence of the temperature parameter. In particular, taking sufficiently small of the order , this term becomes . 3. 3.
When , i.e. is -strongly convex, then , and the reminder term becomes
[TABLE]
If, moreover, is also -Lipschitz continuous, then it can be shown that R_{n}{\big{(}{\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\big{)}} is equivalent to a quadratic loss. This means that the oracle inequality in Theorem 3.1 can be stated in terms of the quadratic prediction error. However, the inequality is not anymore sharp in this case as a constant factor equal to the condition number naturally multiplies the right-hand side. 4. 4.
If is such that (typically for a strong gauge by (2.1)), then (in fact an equality if ). Thus the term can be omitted in (3.2). 5. 5.
A close inspection of the proof of Theorem 3.1 reveals that the term can be improved to the smaller bound
[TABLE]
where the upper-bound is a consequence of Jensen inequality.
Proof.
By convexity of and assumption (H.1), we have for any and any ,
[TABLE]
Since is non-decreasing and convex, is a convex function. Thus, taking the expectation w.r.t. to on both sides and using Jensen inequality, we get
[TABLE]
This holds for any , and in particular at the minimal selection {\big{(}\partial V_{n}({\boldsymbol{\theta}})\big{)}}^{0} (see Section B for details). It then follows from the pillar result in Proposition B.1555In the appendix, we provide a self-contained proof based on a novel Moreau-Yosida regularization argument. In [18, Corollary 1 and 2], an alternative proof is given using an absolute continuity argument since is locally Lipschitz, hence a Sobolev function. that
[TABLE]
We thus deduce the inequality
[TABLE]
By definition of the Bregman divergence, we have
[TABLE]
By virtue of the duality inequality (A.1), we have
[TABLE]
Denote . By virtue of (H.3), Theorem 2.1 and (A.1), we obtain
[TABLE]
This inequality together with (3.4) (applied with ) and (3.1) yield
[TABLE]
where we applied Fenchel-Young inequality (1.5) to get the last bound. Taking the infimum over yields the desired result. ∎
Stratifiable functions
Theorem 3.1 has a nice instanciation when can be partitioned into a collection of subsets that form a stratification of . That is, is a finite disjoint union such that the partitioning sets (called strata) must fit nicely together and the stratification is endowed with a partial ordering for the closure operation. For example, it is known that a polyhedral function has a polyhedral stratification, and more generally, semialgebraic functions induce stratifications into finite disjoint unions of manifolds; see, e.g., [15]. Another example is that of partly smooth convex functions thoroughly studied in [63, 65, 62, 64] for various statistical and inverse problems. These functions induce a stratification into strata that are -smooth submanifolds of . In turns out that all popular penalty functions discussed in this paper are partly smooth (see [62, 64]). Let’s denote the set of strata associated to . With this notation at hand, the oracle inequality (3.2) now reads
[TABLE]
3.2 Oracle inequality for
The next result establishes that satisfies a sharp prediction oracle inequality that we will compare to (3.2).
Theorem 3.2**.**
Consider the penalized estimator in (1.1), where and satisfy Assumptions (H.1) and (H.3). Then, for any such that , the following holds,
[TABLE]
Proof.
The proof follows the same lines as that of Theorem 3.1 except that we use the fact that is a global minimizer of , i.e. . Indeed, we have for any
[TABLE]
Continuing exactly as just after (3.4), replacing with and invoking (3.7) instead of (3.4), we arrive at the claimed result. ∎
Remark 3.2**.**
Observe that the penalized estimator does not require the moment assumption (H.2) for (3.6) to hold. The convexity assumption on in (H.1), which was important to apply Jensen’s inequality in the proof of (3.2), is not needed either to get (3.6). 2. 2.
As we remarked for Theorem 3.1, Theorem 3.2 is also a deterministic statement for a fixed choice of that holds for any minimizer , which is not unique in general. The condition on is similar to the one in **[40]** where authors established different guarantees for .
One clearly sees that the difference between the prediction performance of and lies in the term (or rather its lower-bound in Remark 3.1-5). Thus letting in (3.2), one recovers the oracle inequality (3.6) of penalized estimators. In particular, for , this is on the order .
3.3 Oracle inequalities in probability
It remains to check when the event holds with high probability when is random. We will use concentration inequalities in order to provide bounds that hold with high probability over the data. Toward this goal, we will need the following assumption.
- (H.4)
are independent and identically distributed observations, and , . Moreover,
- (i)
\mathbb{E}\left[\big{|}f_{i}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\big{|}\right]<+\infty, ; 2. (ii)
\big{|}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},t)\big{|}\leq g(t), where , ; 3. (iii)
Bernstein moment condition: and all integers , \mathbb{E}\left[\big{|}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\big{|}^{m}\right]\leq m!\kappa^{m-2}\sigma_{i}^{2}/2 for some constants , independent of .
Observe that under (H.4), and by virtue of Lemma A.4(iv) and [30, Proposition V.3.3.4], we have
[TABLE]
Thus, checking the event amounts to establishing a deviation inequality for the supremum of an empirical process666As is compact, it has a dense countable subset. above its mean under the weak Bernstein moment condition (H.4)(iii), which essentially requires that the have sub-exponential tails, We will first tackle the case where is the convex hull of a finite set (i.e. is a polytope).
3.3.1 Polyhedral penalty
We here suppose that is a finite-valued gauge of , where is finite, i.e. is a polytope with vertices [49, Corollary 19.1.1]. Our first oracle inequality in probability is the following.
Proposition 3.1**.**
Consider the estimators and , where and satisfy Assumptions (H.1), (H.2), (H.3) and (H.4), and is a polytope with vertices . Suppose that and \max_{{\boldsymbol{v}}\in{\mathcal{V}}}\big{\|}\boldsymbol{X}{\boldsymbol{v}}\big{\|}_{\infty}\leq 1, and take
[TABLE]
for some and . Then (3.2) and (3.6) hold with probability at least .
Proof.
In view of Assumptions (H.1) and (H.4), one can differentiate under the expectation sign (Leibniz rule) to conclude that is at and . As minimizes the population risk, one has . Using the rank assumption on , we deduce that
[TABLE]
Moreover, (3.8) specializes to
[TABLE]
Let . By the union bound and (3.8), we have
[TABLE]
The random variables {\big{(}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\boldsymbol{z}_{i}\big{)}}_{i} are zero-mean independent, and and
[TABLE]
We are then in position to apply the Bernstein inequality to get
[TABLE]
where . Every such that
[TABLE]
satisfies . Applying the trivial inequality to the bound on , we conclude. ∎
Remark 3.3**.**
In the monograph [7, Lemma 14.12], the authors derived an exponential deviation inequality for the supremum of an empirical process with finite and possibly unbounded empirical processes under a Bernstein moment condition similar to ours (in fact ours implies theirs). The very last part of our proof can be obtained by applying their result. We detailed it here for the sake of completeness.
Lasso
To lighten the notation, let . From (2.3), it is easy to see that
[TABLE]
where last bound holds as an equality whenever . Further the norm is the gauge of the cross-polytope (i.e. the unit ball). Its vertex set is the set of unit-norm one-sparse vectors , where we recall the canonical basis. Thus
[TABLE]
Inserting this into Proposition 3.1, we obtain the following corollary.
Corollary 3.1**.**
Consider the estimators and , where where is the Lasso penalty and satisfies Assumptions (H.1), (H.2) and (H.4). Suppose that and , and take
[TABLE]
for some and . Then, with probability at least , the following holds
[TABLE]
and
[TABLE]
For , we recover a similar scaling for and the oracle inequality as in [66], though in the latter the oracle inequality is not sharp unlike ours. Note that the above oracle inequality extends readily to the case of analysis/fused Lasso \big{\|}\boldsymbol{D}^{\top}\cdot\big{\|}_{1} where is surjective. We leave the details to the interested reader (see also the analysis group Lasso example in Section 4).
Anti-sparsity
From Section 2.5.4, recall the saturation support of . From (2.10), we get
[TABLE]
with equality whenever . In addition, the norm is the gauge of the hypercube whose vertex set is . Thus
[TABLE]
We have the following oracle inequalities.
Corollary 3.2**.**
Consider the estimators and , where where is anti-sparsity penalty (2.9), and satisfies Assumptions (H.1), (H.2) and (H.4). Suppose that and , and take
[TABLE]
for some and . Then, with probability at least , the following holds
[TABLE]
and
[TABLE]
We are not aware of any result of this kind in the literature. The bound imposed on is similar to what is generally assumed in the vector quantization literature [38, 53].
3.3.2 General penalty
Extending the above reasoning to a general penalty requires a deviation inequality for the supremum of an empirical process in (3.8) under the Bernstein moment condition (H.4)(iii), but without the need of uniform boundedness. This can be achieved via generic chaining along a tree using entropy with bracketing; see [69, Theorem 8]. The resulting deviation bound will thus depend on the entropies with bracketing. These quantities capture the complexity of the set but are intricate to compute in general. This subject deserves further investigation that we leave to a future work.
Remark 3.4** (Group Lasso).**
Using the union bound, we have
[TABLE]
This requires a concentration inequality for quadratic forms of independent random variables satisfying the Bernstein moment assumption above. We are not aware of any such a result. But if our moment condition is strengthened to
[TABLE]
then one can use [4, Theorem 3]. Indeed, assume the nroamlization , which entails
[TABLE]
It then follows that taking
[TABLE]
the oracle inequalities (4.5) and (4.6) hold for the group Lasso with probability at least . A similar result can be proved for the analysis group Lasso just as well with a proper normalization assumption on (see Section 4.3.3).
4 Oracle inequalities for low-complexity linear regression
In this section, we consider the classical linear regression problem where the response-covariate pairs are linked as
[TABLE]
where is a noise vector. The data loss will be set to F({\boldsymbol{u}},\boldsymbol{y})=\tfrac{1}{2}\big{\|}\boldsymbol{y}-{\boldsymbol{u}}\big{\|}_{2}^{2}. This in turn entails that on and R_{n}{\big{(}{\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\big{)}}=\tfrac{1}{2n}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}-\boldsymbol{X}{\boldsymbol{\theta}}_{0}\big{\|}_{2}^{2}.
In this section, we assume that the noise is a zero-mean sub-Gaussian vector in with parameter . That is, its one-dimensional marginals are sub-Gaussian random variables , i.e. they satisfy
[TABLE]
In this case, the bounds of Section 3.3 can be improved.
4.1 General penalty
As we will shortly show, the event will depend on the Gaussian width, a summary geometric quantity which, informally speaking, measures the size of the bulk of a set in .
Definition 4.1**.**
The Gaussian width of a subset is defined as
[TABLE]
The concept of Gaussian width has appeared in the literature in different contexts. In particular, it has been used to establish sample complexity bounds to ensure exact recovery (noiseless case) and mean-square estimation stability (noisy case) for low-complexity penalized estimators from Gaussian measurements; see e.g. [51, 12, 59, 70, 64].
The Gaussian width has deep connections to convex geometry and it enjoys many useful properties. It is well-known that it is positively homogeneous, monotonic w.r.t. inclusion, and invariant under orthogonal transformations. Moreover, . From Lemma A.2(ii)-(iii), is a non-negative finite quantity whenever the set is bounded and contains the origin.
We are now ready to state our oracle inequality in probability with sub-Gaussian noise.
Proposition 4.1**.**
Let the data generated by (4.1) where is a zero-mean sub-Gaussian random vector with parameter . Consider the estimators and , where and satisfy Assumptions (H.1)-(H.2) and (H.3). Suppose that , for some and , where and are positive absolute constants. Then with probability at least , (3.2) and (3.6) hold with the remainder term given by (3.3) with .
The proof requires sophisticated ideas from the theory of generic chaining [56], but we only apply these results. The constants and can be traced back to the proof of these results as detailed in [56].
Proof.
First, from (4.2), we have the bound
[TABLE]
i.e. the increment condition [56, (0.4)] is verified. Thus combining (3.8) with the probability bound in [56, page 11], the generic chaining theorem [56, Theorem 1.2.6] and the majorizing measure theorem [56, Theorem 2.1.1], we have
[TABLE]
∎
If the noise is Gaussian, an enhanced version can be proved by invoking Gaussian concentration of Lipschitz functions [36].
Proposition 4.2**.**
Let the data generated by (4.1) with noise . Consider the estimators and , where and satisfy Assumptions (H.1)-(H.2) and (H.3). Suppose that , for some and . Then with probability at least , (3.2) and (3.6) hold with the remainder term given by (3.3) with .
Proof.
Thanks to sublinearity (see Lemma A.3(i) and Lemma A.4), the function is Lipschitz continuous with Lipschitz constant {\big{|}\kern-1.50696pt\big{|}\kern-1.50696pt\big{|}\boldsymbol{X}^{\top}\big{|}\kern-1.50696pt\big{|}\kern-1.50696pt\big{|}}_{2\to J^{\circ}}={\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{X}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{J\to 2}. From (3.8), we also have
[TABLE]
Observe that is a convex compact set containing the origin. Setting , it follows from (3.8) and the Gaussian concentration of Lipschitz functions [36] that
[TABLE]
∎
Estimating theoretically the Gaussian width of a set777Not to mention its image with a linear operator as for . is a non-trivial problem that has been extensively studied in the areas of probability in Banach spaces and stochastic processes. There are classical bounds on the Gaussian width (Sudakov’s and Dudley’s inequalities), but they are difficult to estimate in most cases and neither of these bounds is tight for all sets. When the set is a convex cone (intersected with a sphere), tractable estimates based on polarity arguments were proposed in, e.g., [12].
4.2 Polyhedral penalty
When and is polytope, enhanced oracle inequalities can be obtained by invoking a simple union bound argument.
Proposition 4.3**.**
Let the data generated by (4.1) where is a zero-mean sub-Gaussian random vector with parameter . Consider the estimators and , where and satisfy Assumptions (H.1)-(H.2) and (H.3), and moreover is a polytope with vertices . Suppose that \lambda_{n}\geq\frac{\tau\sigma{\big{(}\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\big{)}}\sqrt{2\delta\log(|{\mathcal{V}}|)}}{n}, for some and . Then with probability at least , (3.2) and (3.6) hold with the remainder term given by (3.3) with .
In particular, if , then one can take .
Proof.
From (3.8) we have
[TABLE]
where in the last inequality, we used the fact that a convex function attains its maximum on at an extreme point . Let \epsilon=\sigma{\big{(}\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\big{)}}\sqrt{2\delta\log(|{\mathcal{V}}|)}. By the union bound, (4.2) and (3.8), we have
[TABLE]
∎
4.3 Applications
In this section, we exemplify our oracle inequalities for the penalties described in Section 2.5.
4.3.1 Lasso
Recall the derivations for the Lasso in Section 3.3.1. We obtain the following corollary of Proposition 4.3.
Corollary 4.1**.**
Let the data generated by (4.1) where is a zero-mean sub-Gaussian random vector with parameter . Assume that is such that . Consider the estimators and , where is the Lasso penalty (2.2) and satisfies Assumptions (H.1)-(H.2). Suppose that , for some and . Then, with probability at least , the following holds
[TABLE]
and
[TABLE]
The remainder term grows as . The oracle inequality (4.4) recovers [18, Theorem 1] in the exactly sparse case, and (4.4) the one in [55, Theorem 4] (see also [34, Theorem 11] and [19, Theorem 2]). It is worth mentioning, however, that [18, Theorem 1] handles the inexactly sparse case while we do not.
4.3.2 Group Lasso
Recall the notations in Section 2.5.2, and denote the set indexing active blocks in . From (2.5), we have
[TABLE]
where the last bound holds as an equality whenever .
We have the following oracle inequalities as corollaries of Proposition 4.1 and Proposition 4.2.
Corollary 4.2**.**
Let the data generated by (4.1). Consider the estimators and , where satisfies Assumptions (H.1)-(H.2), and is the group Lasso (2.4) with non-overlapping blocks of equal size . Assume that is such that .
- (i)
* is a zero-mean sub-Gaussian random vector with parameter : suppose that , for some and , where and are the positive absolute constants in Proposition 4.1. Then, with probability at least , the following holds*
[TABLE]
and
[TABLE] 2. (ii)
: suppose that , for some and . Then, with probability at least , (4.5) and (4.6) hold.
The first remainder term is on the order . This is similar to the scaling that has been provided in the literature for EWA with other group sparsity priors and noises [48, 26]. Similar rates were given for with the group Lasso in [40, 37, 67].
Proof.
- (i)
This is a consequence of Proposition 4.1, for which we need to bound
[TABLE]
We first have, for any block
[TABLE]
Furthermore, \big{\|}\boldsymbol{X}_{b_{i}}^{\top}\cdot\big{\|}_{2} is Lipschitz continuous with Lipschitz constant . Thus the union bound and Gaussian concentration of Lipschitz functions [36] yield, for any ,
[TABLE]
Let . can be expressed as
[TABLE] 2. (ii)
The proof follows the lines of Proposition 4.2 where we additionally use the union bound. Indeed,
[TABLE]
where used the Gaussian concentration of Lipschitz functions [36] in the last inequality.
∎
We observe in passing that another way to prove the oracle inequalities in the sub-Gaussian is to use Dudley’s inequality on the sphere in after applying a union bound on the blocks. In addition, in the Gaussian case, the (similar) bound can be obtained by combining Proposition 4.2 and the estimate in the proof of (i). The corresponding probability of success would be at least .
4.3.3 Analysis group Lasso
We now turn to the prior penalty (2.6). Recall the notations in Section 2.5.3, and remind . We assume that is a frame of , hence surjective, meaning that there exist such that for any
[TABLE]
This together with (2.7)-(2.8) and Cauchy-Schwarz inequality entail
[TABLE]
Note, however, that from (2.7), we do not have in general .
With exactly the same arguments to those for proving Corollary 4.2, replacing by , we arrive at the following oracle inequalities.
Corollary 4.3**.**
Let the data generated by (4.1). Consider the estimators and , where satisfies Assumptions (H.1)-(H.2), and is the analysis group Lasso (2.6) with blocks of equal size . Assume that is a frame, and is such that .
- (i)
* is a zero-mean sub-Gaussian random vector with parameter : suppose that , for some and , where and are the positive absolute constants in Proposition 4.1. Then, with probability at least , the following holds*
[TABLE]
and
[TABLE] 2. (ii)
: suppose that , for some and . Then, with probability at least , ((i)) and ((i)) hold.
To the best of our knowledge, this result is new to the literature. The scaling of the remainder term is the same as in **[26]** and **[48]** with analysis sparsity priors different from ours (the authors in the latter also assume that is invertible).
4.3.4 Anti-sparsity
Recall the derivations for the norm example in Section 3.3.1. We have the following oracle inequalities from Proposition 4.3.
Corollary 4.4**.**
Let the data generated by (4.1) where is a zero-mean sub-Gaussian random vector with parameter . Assume that is such that . Consider the estimators and , where satisfies Assumptions (H.1)-(H.2), and is the anti-sparsity penalty (2.9). Suppose that , for some and . Then, with probability at least , the following holds
[TABLE]
and
[TABLE]
The first remainder term scales as which reflects that anti-sparsity regularization requires an overdetermined regime to ensure good stability performance. This is in agreement with **[63, Theorem 7]**. This phenomenon was also observed by **[24]** who studied sample complexity thresholds for noiseless recovery from random projections of the hypercube.
4.3.5 Nuclear norm
*We now turn to the nuclear norm case. Recall the notations of Section 2.5.5. For matrices , a measurement map takes the form of a linear operator whose *th component is given by the Frobenius scalar product
[TABLE]
where is a matrix in . We denote the associated norm. From (2.12), it is immediate to see that whenever ,
[TABLE]
Moreover, from (2.12), we have
[TABLE]
To apply Proposition 4.1 and Proposition 4.2, we need to bound (* is the nuclear ball), or equivalently, to bound*
[TABLE]
which is the expectation of the operator norm of a random series with matrix coefficients. Thus using **[60, Theorem 4.1.1(4.1.5)]** to get this bound, and inserting it into Proposition 4.1 and Proposition 4.2, we get the following oracle inequalities for the nuclear norm. Define
[TABLE]
Corollary 4.5**.**
Let the data generated by (4.1) with a linear operator . Assume that . Consider the estimators and , where satisfies Assumptions (H.1)-(H.2), and is the nuclear norm (2.11).
- (i)
* is a zero-mean sub-Gaussian random vector with parameter : suppose that , for some and , where and are the positive absolute constants in Proposition 4.1. Then, with probability at least , the following holds*
[TABLE]
and
[TABLE] 2. (ii)
: suppose that , for some and . Then, with probability at least , (4.11) and (4.12) hold.
The set over which the infimum is taken just reminds us that the nuclear norm is partly smooth (see above) relative to the constant rank manifold (which is a Riemannian submanifold of ) **[22, Theorem 3.19]**. The first remainder term now scales as . In the iid Gaussian case, we recover the same rate as in **[18, Theorem 3]** for and in **[34, Theorem 2]** for .
4.4 Discussion of minimax optimality
In this section, we discuss the optimality of the estimators and (we remind the reader that the design is fixed). Recall the discussion on stratification at the end of Section 3.1. Let be the stratum active at . In this setting, with , (3.5) and Proposition 4.2 ensure that
[TABLE]
with high probability. In particular, for a polyhedral gauge penalty, in which case (see **[63]**), and under the normalization , Proposition 4.3 entails
[TABLE]
with high probability. Thus the risk bounds only depend on . A natural question that arises is whether the above bounds are optimal, i.e. whether an estimator can achieve a significantly better prediction risk than and uniformly on . A classical way to answer this question is the minimax point of view. This amounts to finding a lower bound on the minimax probabilities of the form
[TABLE]
where is the rate, which ideally, should be comparable to the risk bounds above. A standard path to derive such a lower bound is to exhibit a subset of of well-separated points while controlling its diameter, see **[61, Chapter 2]** or **[39, Section 4.3]**. This however must be worked out on a case-by-case basis.
Example 4.1**.**
For the Lasso case, is the subspace of vectors whose support is contained in that of . Let and . Define the set
[TABLE]
We have and for all . Define {\mathcal{F}}_{0}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}r\boldsymbol{X}{\boldsymbol{\theta}}\;:\;{\boldsymbol{\theta}}\in{\mathcal{B}}_{0}\big{\}}, for to be specified later. Due to the Varshamov-Gilbert lemma [39, Lemma 4.7], given , there exists a subset with cardinality such that for two distinct elements and in
[TABLE]
where
[TABLE]
Standard results from random matrix theory ensure that for a Gaussian design with high probability as long as [59] for some positive absolute constant .
Then choosing , where and , we get the bounds
[TABLE]
We are now in position to apply [61, Theorem 2.5] to conclude that there exists (that depends on ) such that
[TABLE]
This lower bound together with Corollary 4.1 show that (with ) and are nearly minimax (up to a logarithmic factor) over .
One can generalize this reasoning to get a minimax lower bound over the larger class of -sparse vectors, i.e. \bigcup\big{\{}V=\operatorname*{Span}\{(\boldsymbol{a}_{j})_{1\leq j\leq p}\}\;:\;\dim(V)=s\big{\}}, which is a finite union of subspaces that contains . Let such that and 888E.g. take ., . Then combining [61, Theorem 2.5] and [39, Lemma 4.6 and Lemma 4.10], we have for
[TABLE]
where , and and are now the restricted isometry constants of of degree , i.e.
[TABLE]
For this lower bound to be meaningful, should be positive. From the compressed sensing literature, many random designs are known to verify this condition for large enough compared to , e.g. sub-Gaussian designs with .
One can see that the difference between this lower bound and the one on lies in the factor, which basically derives from the control over the union of subspaces. The minimax prediction risk (in expectation) over the -ball were studied in [47, 44, 71, 75, 72], where similar lower bounds were obtained.
Example 4.2**.**
For the group Lasso with groups of equal size , is the subspace group sparse vectors whose group support is included in that of . Let be the number of non-zero (active) groups in . Following exactly the same reasoning as for the Lasso, one can show that the risk lower bound in probability scales as , which together with Corollary 4.2, shows that and are nearly minimax (up again to a logarithmic factor) over . One can also derive the lower bound over the set of -block sparse vectors. Such minimax lower bound is comparable to the one in [37].
Example 4.3**.**
Let’s consider the -penalty. Denote the saturation support of as and recall the subspace form (2.10). Thus, is the subspace of vectors which are collinear to on and free on its complement. Observe that , where . Define the set
[TABLE]
By construction, , and for all . Thus following the same arguments as for the Lasso example (using again Varshamov-Gilbert lemma and [61, Theorem 2.5]), we conclude that there exists (that depends on ) such that
[TABLE]
where the restricted isometry constants are defined similarly to the Lasso but with respect to the model subspace of the norm. Again, for a Gaussian design, with high probability as long as [59].
The obtained minimax lower bound is consistent with the sample complexity thresholds derived in [24] for noiseless recovery from random projections of the hypercube. For a saturation support size small compared to , the bound of Corollary 4.4 comes close to the minimax lower bound.
Example 4.4**.**
Let , where , and . For the nuclear norm, is the manifold of rank- matrices. Thus arguing as in [34, Theorem 5] (who use the Varshamov-Gilbert lemma [39] to find the covering set), one can show that the minimax risk lower bound over is . In view of Corollary 4.5, we deduce that and are nearly minimax over the constant rank manifolds.
Appendix A Pre-requisites from convex analysis
We here collect some ingredients from convex analysis that are essential to our exposition.
Monotone conjugate
Lemma A.1**.**
Let be a non-decreasing function on that vanishes at [math]. Then the following hold:
- (i)
* is a proper closed convex and non-decreasing function on that vanishes at [math].* 2. (ii)
If is also closed and convex, then . 3. (iii)
Let such that is differentiable on , where is finite-valued, strictly convex and strongly coercive. Then is likewise finite-valued, strictly convex, strongly coercive, and is differentiable on . In particular, both and are strictly increasing on .
Proof.
- (i)
By [3, Proposition 13.11], is a closed convex function. We have . Since is non-decreasing and , then . In addition, by (1.5), we have , . This shows that is non-negative and , and in turn, it is also proper.
Let in such that . Then
[TABLE]
That is, is non-decreasing on . 2. (ii)
This follows from [49, Theorem 12.4]. 3. (iii)
By definition of , is a finite-valued function on , strictly convex, differentiable and strognly coercive. It then follows from [30, Corollary X.4.1.4] that enjoys the same properties. In turn, using the fact that both and are even, we have is strongly coercive, and strict convexity of (resp. ) is equivalent to that of (resp. ). Altogether, this shows the first claim. We now prove that vanishes only at [math] (and similary for ). As is non-decreasing and strictly convex, we have, for any and in such that ,
[TABLE]
∎
Support function
The support function of is
[TABLE]
We recall the following properties whose proofs can be found in e.g. **[49, 30]**.
Lemma A.2**.**
Let be a non-empty set.
- (i)
* is proper lsc and sublinear.* 2. (ii)
* is finite-valued if and only if is bounded.* 3. (iii)
If , then is non-negative. 4. (iv)
If is convex and compact with , then is finite-valued and coercive.
Gauges and polars
Definition A.1** (Polar set).**
Let be a nonempty convex set. The set given by
[TABLE]
is called the polar of .
The set is closed convex and contains the origin. When is also closed and contains the origin, then it coincides with its bipolar, i.e. .
Let be a non-empty closed convex set containing the origin. The gauge of is the function defined on by
[TABLE]
As usual, if the infimum is not attained.
Lemma A.3 hereafter recaps the main properties of a gauge that we need. In particular, (ii) is a fundamental result of convex analysis that states that there is a one-to-one correspondence between gauge functions and closed convex sets containing the origin. This allows to identify sets from their gauges, and vice versa.
Lemma A.3**.**
- (i)
* is a non-negative, lsc and sublinear function.* 2. (ii)
* is the unique closed convex set containing the origin such that*
[TABLE] 3. (iii)
* is finite-valued if, and only if, , in which case is Lipschitz continuous.* 4. (iv)
* is finite-valued and coercive if, and only if, is compact and .*
See **[63]** for the proof.
Observe that thanks to sublinearity, local Lipschitz continuity valid for any finite-valued convex function is streghthned to global Lipschitz continuity. Moreover, is a norm, having as its unit ball, if and only if is bounded with nonempty interior and symmetric.
We now define the polar gauge.
Definition A.2** (Polar Gauge).**
The polar of a gauge is the function defined by
[TABLE]
An immediate consequence is that gauges polar to each other have the property
[TABLE]
just as dual norms satisfy a duality inequality. In fact, polar pairs of gauges correspond to the best inequalities of this type.
Lemma A.4**.**
Let be a closed convex set containing the origin. Then,
- (ii)
* is a gauge function and .* 2. (iii)
, or equivalently
[TABLE] 3. (iv)
The gauge of and the support function of are mutually polar, i.e.
[TABLE]
See **[49, 30, 63]** for the proof.
Appendix B Expectation of the inner product
We start with some definitions and notations that will be used in the proof. For a non-empty closed convex set , we denote {\big{(}{\mathcal{C}}\big{)}}^{0} its minimal selection, i.e. the element of minimal norm in . This element is of course unique. For a proper lsc and convex function and , its Moreau envelope (or Moreau-Yosida regularization) is defined by
[TABLE]
The Moreau envelope enjoys several important properties that we collect in the following lemma.
Lemma B.1**.**
Let be a finite-valued and convex function. Then
- (i)
* is a decreasing net, and , as .* 2. (ii)
* with -Lipschitz continuous gradient.* 3. (iii)
*, \nabla\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\to{\big{(}\partial f({\boldsymbol{\theta}})\big{)}}^{0} and \big{\|}\nabla\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\big{\|}_{2}\nearrow\big{\|}{\big{(}\partial f({\boldsymbol{\theta}})\big{)}}^{0}\big{\|}_{2} as .
Proof.
(ii)(i) [3, Proposition 12.32]. (ii)(ii) [3, Proposition 12.29]. (ii)(iii) By assumption, is subdifferentiable everywhere and its subdifferential is a maximal monotone operator with domain , and the result follows from [3, Corollary 23.46(i)]. ∎
We are now equipped to prove the following important result999The result will be proved using Moreau-Yosida regularization. Yet another alternative proof could be based on mollifiers for approximating subdifferentials..
Proposition B.1**.**
Let the density in (1.2), where
- (a)
* satisfies Assumptions (H.1)-(H.2);* 2. (b)
* is a finite-valued lower-bounded convex function, and and , such that , \big{\|}{\big{(}\partial J({\boldsymbol{\theta}})\big{)}}^{0}\big{\|}_{2}\leq R\left\|{\boldsymbol{\theta}}\right\|_{2}^{\rho};* 3. (c)
and is coercive.
Then, ,
[TABLE]
This result covers of course the situation where fulfills (H.3). In this case, since by Theorem 2.1(i), we have and , the diameter of the convex compact set containing the origin. It can be shown that, when is strongly coercive, the coercivity assumption (c) can be equivalently stated as , , where is the recession/asymptotic function of ; see e.g. **[50]**.
Proof.
Let and define , where is the normalizing constant of the density . Assumption (H.1) and Lemma B.1(ii)(ii)-(ii)(iii) tell us that and \nabla V^{\gamma}_{n}({\boldsymbol{\theta}})\to{\big{(}\partial V_{n}({\boldsymbol{\theta}})\big{)}}^{0} as . Thus
[TABLE]
We now check that is dominated by an integrable function. From the definition of the Moreau envelope, we have
[TABLE]
From coercivity of , the objective in the is also coercive in by [50, Exercise 3.29(b)]. It then follows from [50, Theorem 3.31] that is also coercive. In turn, [50, Theorem 11.8(c) and 3.26(a)] allow to assert that for some , such that for all and
[TABLE]
Lemma B.1-(ii)(iii) and assumption (b) on entail that for any ,
[TABLE]
Altogether, we have
[TABLE]
where the constant reflects the lower-boudedness of . It is easy to see that the function in this upper-bound is integrable, where we also use (H.2). Hence, we can apply the dominated convergence theorem to get
[TABLE]
Now, by simple differential calculus (chain and product rules), we have
[TABLE]
Integrating the first term, we get by Fubini theorem and the Newton-Leibniz formula
[TABLE]
where we used coercivity of (see (B.1)) to conclude that . For the second term, we have from Lemma B.1(ii)(i) that as . Thus, arguing again as in (B.1), we can apply the dominated convergence theorem to conclude that
[TABLE]
This concludes the proof. ∎
Acknowledgement.
This work was supported by Conseil Régional de Basse-Normandie and partly by Institut Universitaire de France.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research , 9:1179–1225, 2008.
- 2[2] S. Bakin. Adaptive regression and model selection in data mining problems, 1999. Thesis (Ph.D.)–Australian National University, 1999.
- 3[3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . Springer, 2011.
- 4[4] P. Bellec. Concentration of quadratic forms under a bernstein moment assumption. Technical report, Ecole Polytechnique, 2014.
- 5[5] P. J. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. Annals of Statistics , 37(4):1705–1732, 2009.
- 6[6] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. Slope – adaptive variable selection via convex optimization. Annals of Applied Statistics , 9(3):1103–1140, 2014.
- 7[7] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications . Springer Series in Statistics. Springer-Verlag Berlin Heidelberg, 2011.
- 8[8] E. Candès and Y. Plan. Near-ideal model selection by ℓ 1 subscript ℓ 1 \ell_{1} minimization. Annals of Statistics , 37(5A):2145–2177, 2009.
