Localized Gaussian width of $M$-convex hulls with applications to Lasso and convex aggregation
Pierre C Bellec

TL;DR
This paper derives bounds on the Gaussian mean width of convex hulls intersected with Euclidean balls and applies these results to analyze the performance of Lasso, ERM, and convex aggregation methods in statistical estimation.
Contribution
It introduces new bounds on Gaussian widths of convex hulls under restricted isometry conditions and applies them to key statistical estimators.
Findings
Bounds match up to a constant under RIP conditions
Provides theoretical insights into Lasso and aggregation performance
Enhances understanding of geometric properties in high-dimensional statistics
Abstract
Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property. This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\startlocaldefs\endlocaldefs
Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation
Pierre C. Bellec
Rutgers University, Department of Statistics and Biostatistics
Abstract
Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property.
This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.
1 Introduction
Let be a subset of . The Gaussian width of is defined as
[TABLE]
where and are i.i.d. standard normal random variables. For any vector , denote by its Euclidean norm and define the Euclidean balls
[TABLE]
We will also use the notation . The localized Gaussian width of with radius is the quantity . For any , define the norm by for any , and let be the number of nonzero coefficients of .
This paper studies the localized Gaussian width
[TABLE]
where is the convex hull of points in .
If , then matching upper and lower bounds are available for the localized Gaussian width:
[TABLE]
cf. [14] and [21, Section 4.1]. In the above display, means that and for some large enough numerical constant .
The first goal of this paper is to generalize this bound to any that is the convex hull of points in .
Contributions.
Section 2 is devoted to the generalization of (4) and provides sharp bounds on the localized Gaussian width of the convex hull of points in , see Propositions 1 and 2 below. Sections 3, 4 and 5 provide statistical applications of the results of Section 2. Section 3 studies the Lasso estimator and the convex aggregation problem in fixed-design regression. In Section 4, we show that Empirical Risk Minimization achieves the minimax rate for the persistence problem in the anisotropic setting. Finally, Section 5 provides results for bounded empirical processes and for the convex aggregation problem in density estimation.
2 Localized Gaussian width of a -convex hull
The first contribution of the present paper is the following upper bound on localized Gaussian width of the convex hull of points in .
Proposition 1**.**
Let and . Let be the convex hull of points in and assume that . Let be a centered Gaussian random variable with covariance matrix . Then for all ,
[TABLE]
where .
Proposition 1 is proved in the next two subsections. Inequality
[TABLE]
is a direct consequence of the Cauchy-Schwarz inequality and where is the orthogonal projection onto the linear span of and is the rank of . The novelty of (5) is inequality
[TABLE]
Inequality (7) was known for the -ball [14], but to our knowledge (7) is new for general -convex hulls. If is the -ball, then the bound (5) is sharp up to numerical constants [14], [21, Section 4.1].
The above result does not assume any type of Restricted Isometry Property (RIP). The following proposition shows that (7) is essentially sharp provided that the vertices of satisfies a one-sided RIP of order .
Proposition 2**.**
Let and . Let be a centered Gaussian random variable with covariance matrix . Let and assume for simplicity that is a positive integer such that . Let be the convex hull of the points where . Assume that for some real number we have
[TABLE]
where . Then
[TABLE]
The proof of Proposition 2 is given in Appendix A.
2.1 A refinement of Maurey’s argument
This subsection provides the main tool to derive the upper bound (7). Define the simplex in by
[TABLE]
Let be an integer, and let
[TABLE]
where is a positive semi-definite matrix of size . Let be a deterministic vector such that is small. Maurey’s argument [27] has been used extensively to prove the existence of a sparse vector such that is of the same order as that of . Maurey’s argument uses the probabilistic method to prove the existence of such . A sketch of this argument is as follows.
Define the discrete set as
[TABLE]
where is the canonical basis in . The discrete set is a subset of the simplex that contains only -sparse vectors.
Let be the canonical basis in . Let be i.i.d. random variables valued in with distribution
[TABLE]
Next, consider the random variable
[TABLE]
The random variable is valued in and is such that , where denotes the expectation with respect to . Then a bias-variance decomposition yields
[TABLE]
where is a constant such that . As , this yields the existence of such that
[TABLE]
If is chosen large enough, the two terms and are of the same order and we have established the existence of an -sparse vector so that is not much substantially larger than .
For our purpose, we need to refine this argument by controlling the deviation of the random variable . This is done in Lemma 3 below.
Lemma 3**.**
Let and define by (12). Let be a convex function. For all , let
[TABLE]
where is a positive semi-definite matrix of size . Assume that the diagonal elements of satisfy for all . Then for all ,
[TABLE]
In the next sections, it will be useful to bound from above the quantity maximized over subject to the constraint . An interpretation of (18) is as follows. Consider the two optimization problems
[TABLE]
for some . Equation 18 says that the optimal value of the first optimization problem is smaller than the optimal value of the second optimization problem averaged over the distribution of given by the density on . The second optimization problem above is over the discrete set with the relaxed constraint , hence we have relaxed the constraint in exchange for discreteness. The discreteness of the set will be used in the next subsection for the proof of Proposition 1.
Proof of Lemma 3.
The set is compact. The function is convex with domain and thus continuous. Hence the supremum in the left hand side of (18) is achieved at some such that . Let be the random variable defined in (13) and (14) above. Denote by the expectation with respect to . By definition, and . Let . A bias-variance decomposition and the independence of yield
[TABLE]
Another bias-variance decomposition yields
[TABLE]
where we used that and that almost surely. Thus
[TABLE]
Define the random variable , which is nonnegative and satisfifes . By Markov inequality, it holds that . Define the random variable by the density function on . Then we have for any , so by stochastic dominance, there exists a rich enough probability space and random variables and defined on such that and have the same distribution, and have the same distribution, and almost surely on (see for instance Theorem 7.1 in [12]). Denote by the expectation sign on the probability space .
By definition of and , using Jensen’s inequality, Fubini’s Theorem and the fact that we have
[TABLE]
where is the nondecreasing function . The right hand side of the previous display is equal to to . Next, we use the random variables and as follows:
[TABLE]
Combining the previous display and (24) completes the proof. ∎
2.2 Proof of (7)
We are now ready to prove Proposition 1. The main ingredients are Lemma 3 and the following upper bound on the cardinal of
[TABLE]
Proof of (7).
If then by (6) we have , hence (7) holds. Thus it is enough to focus on the case .
Let and set , which satisfies . As is the convex hull of points, let be such that
[TABLE]
where for .
Let for all . This is a polynomial of order , of the form , where is the Gram matrix with for all . As we assume that , the diagonal elements of satisfy . For all , let . Applying Lemma 3 with the above notation, , and , we obtain
[TABLE]
By definition of , so that . Using Fubini Theorem and a bound on the expectation of the maximum of centered Gaussian random variables with variances bounded from above by , we obtain that the right hand side of the previous display is bounded from above by
[TABLE]
where we used the bound (27). To complete the proof of (7), notice that we have and .
∎
3 Statistical applications in fixed-design regression
Numerous works have established a close relationship between localized Gaussian widths and the performance of statistical and compressed sensing procedures. Some of these works are reviewed below.
- •
In a regression problem with random design where the design and the target are subgaussian, Lecué and Mendelson [21] established that two quantities govern the performance of empirical risk minimizer over a convex class . These two quantities are defined using the Gaussian width of the class intersected with an ball [21, Definition 1.3],
- •
If are such that and . Gordon et al. [14] provide precise estimates of where is the unit ball and is the ball of radius . These estimates are then used to solve the approximate reconstruction problem where one wants to recover an unknown high dimensional vector from a few random measurements [14, Section 7].
- •
Plan et al. [28] shows that in the semiparametric single index model, if the signal is known to belong to some star-shaped set , then the Gaussian width of and its localized version characterize the gain obtained by using the additional information that the signal belongs to , cf. Theorem 1.3 in [28].
- •
Finally, Chatterjee [9] exhibits connection between localized Gaussian widths and shape-constrained estimation.
These results are reminiscent of the isomorphic method [17, 3, 2], where localized expected supremum of empirical processes are used to obtain upper bounds on the performance of Empirical Risk Minimization (ERM) procedures. These results show that Gaussian width estimates are important to understand the statistical properties of estimators in many statistical contexts.
In Proposition 1, we established an upper bound on the Gaussian width of -convex hulls. We now provide some statistical applications of this result in regression with fixed-design. We will use the following Theorem from [7].
Theorem 4** ([7]).**
Let be a closed convex subset of and . Let be an unknown vector and let . Denote by the projection of onto . Assume that for some ,
[TABLE]
Then for any , with probability greater than , the Least Squares estimator satisfies
[TABLE]
Hence, to prove an oracle inequality of the form (32), it is enough to prove the existence of a quantity such that (31) holds. If the convex set in the above theorem is the convex hull of points, then a quantity is given by the following proposition.
Proposition 5**.**
Let and . Let such that for all . For all , let . Let be a centered Gaussian random variable with covariance matrix . If then the quantity
[TABLE]
provided that .
Proof.
Inequality
[TABLE]
is a reformulation of Proposition 1 using the notation of Proposition 5. Thus, in order to prove (33), it is enough to establish that for we have
[TABLE]
As and for all , the left hand side of the previous display satisfies
[TABLE]
Thus (35) holds if , which is the case if the absolute constant is . ∎
Inequality (33) establishes the existence of a quantity such that
[TABLE]
where is the convex hull of . Consequences of (38) and Theorem 4 are given in the next subsections.
We now introduce two statistical frameworks where the localized Gaussian width of an -convex hull has applications: the Lasso estimator in high-dimensional statistics and the convex aggregation problem.
3.1 Convex aggregation
Let be an unknown regression vector and let be an observed random vector, where satisfies . Let and let be deterministic vectors in . The set will be referred to as the dictionary. For any , let . If a set is given, the goal of the aggregation problem induced by is to find an estimator constructed with and the dictionary such that
[TABLE]
either in expectation or with high probability, where is a small quantity. Inequality (39) is called a sharp oracle inequality, where "sharp" means that in the right hand side of (39), the multiplicative constant of the term is . Similar notations will be defined for regression with random design and density estimation. Define the simplex in by (10). The following aggregation problems were introduced in [26, 34].
- •
Model Selection type aggregation with , i.e., is the canonical basis of . The goal is to construct an estimator whose risk is as close as possible to the best function in the dictionary. Such results can be found in [34, 22, 1] for random design regression, in [23, 10, 5, 11] for fixed design regression, and in [16, 6] for density estimation.
- •
Convex aggregation with , i.e., is the simplex in . The goal is to construct an estimator whose risk is as close as possible to the best convex combination of the dictionary functions. See [34, 20, 19, 33] for results of this type in the regression framework and [29] for such results in density estimation.
- •
Linear aggregation with . The goal is to construct an estimator whose risk is as close as possible to the best linear combination of the dictionary functions, cf. [34, 33] for such results in regression and [29] for such results in density estimation.
One may also define the Sparse or Sparse Convex aggregation problems: construct an estimator whose risk is as close as possible to the best sparse combination of the dictionary functions. Such results can be found in [31, 30, 33] for fixed design regression and in [24] for regression with random design. These problems are out of the scope of the present paper.
A goal of the present paper is to provide a unified argument that shows that empirical risk minimization is optimal for the convex aggregation problem in density estimation, regression with fixed design and regression with random design.
Theorem 6**.**
Let , let and define . Let and let for all . Let
[TABLE]
Then for all , with probability greater than ,
[TABLE]
where and .
Proof of Theorem 6.
Let be the linear span of and let be the orthogonal projector onto . If , then
[TABLE]
Let be the convex hull of . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points, and for all , . By (47) and (33), the quantity satisfies (31). Applying Theorem 4 completes the proof. ∎
3.2 Lasso
We consider the following regression model. Let and assume that for all . We will refer to as the covariates. Let be the matrix of dimension with columns . We observe
[TABLE]
where is an unknown mean. The goal is to estimate using the design matrix .
Let be a tuning parameter and define the constrained Lasso estimator [32] by
[TABLE]
Our goal will be to study the performance of the estimator (44) with respect to the prediction loss
[TABLE]
Let and assume that for all . Let be the matrix of dimension with columns .
Theorem 7**.**
Let be a tuning parameter and consider the regression model (43). Define the Lasso estimator by (44). Then for all , with probability greater than ,
[TABLE]
where .
Proof of Theorem 7.
Let be the linear span of and let be the orthogonal projector onto . If , then
[TABLE]
Let be the convex hull of , so that . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points of empirical norm less or equal to . By (47) and (33), the quantity satisfies (31). Applying Theorem 4 completes the proof. ∎
The lower bound [30, Theorem 5.4 and (5.25)] states that there exists an absolute constant such that the following holds. If , then there exists a design matrix such that for all estimator ,
[TABLE]
where for all , denotes the expectation with respect to the distribution of . Thus, Theorem 7 shows that the Least Squares estimator over the set is minimax optimal. In particular, the right hand side of inequality (46) cannot be improved.
4 The anisotropic persistence problem in regression with random design
Consider iid observations where are real valued and the are design random variables in with for some covariance matrix . We consider the learning problem over the function class
[TABLE]
for a given constant . We consider the Emprical Risk Minimizer defined by
[TABLE]
This problem is sometimes referred to as the persistence problem or the persistence framework [15, 4]. The prediction risk of is given by
[TABLE]
where is a new observation distributed as and independent from the data . Define also the oracle by
[TABLE]
and define by
[TABLE]
where the subgaussian norm is defined by for any random variable (see Section 5.2.3 in [35] for equivalent definitions of the norm).
To analyse the above learning problem, we use the machinery developed by Lecué and Mendelson [21] to study learning problems over subgaussian classes. Consider the two quantities
[TABLE]
where . In the present setting, Theorem A from Lecué and Mendelson [21] reads as follows.
Theorem 8** (Theorem A in Lecué and Mendelson [21]).**
There exist absolute constants such that the following holds. Let . Consider iid observations with . Assume that the design random vectors are subgaussian with respect to the covariance matrix in the sense that for any . Define by (52) and by (53). Assume that the diagonal elements of are no larger than 1. Then, there exists absolute constants such that the estimator defined in (50) satisfies
[TABLE]
with probability at least .
In the isotropic case (), [25] proves that
[TABLE]
for some constants that only depends on , while
[TABLE]
for some constants that only depend on .
Using Proposition 1 and Equation 14 above lets us extend these bounds to the anisotropic case where is not proportional to the identity matrix.
Proposition 9**.**
Let , let and assume that the diagonal elements of are no larger than 1. For any , define and by (55) and (54). Then for any , there exists constants that depend only on such that (58) and (57) hold.
The proof of Proposition 9 will be given at the end of this subsection. The primary improvement of Proposition 1 over previous results is that this result is agnostic to the underlying covariance structure. This lets us handle the anisotropic case with in the above proposition.
Proposition 9 combined with Theorem 8 lets us obtained the minimax rate of estimation for the persistence problem in the anisotropic case. Although the minimax rate was previously obtained in the isotropic case, we are not aware of a previous result that yields this rate for general covariance matrices .
Proof of Proposition 9.
In this proof, is an absolute constant whose value may change from line to line. Let . We first bound from above. Let and define
[TABLE]
The random variable has the same distribution as where . Thus, the expectation inside the infimum in (54) is equal to
[TABLE]
To bound from above, it is enough to find some such that (60) is bounded from above by .
By the Cauchy-Schwarz inequality, the right hand side is bounded from above by , which is smaller than for all small enough provided that for some constant that only depends on .
We now bound from above in the regime . Let be the columns of and let be the convex hull of the points . Using the fact that , the right hand side of the previous display is bounded from above by
[TABLE]
where we used Proposition 1 for the last inequality. By simple algebra, one can show that if for some large enough constant that only depends on , then the right hand side of (61) is bounded from above by .
We now bound from above. Let . By definition of , to prove that , it is enough to show that
[TABLE]
is smaller than . We use Proposition 1 to show that the right hand side of the previous display is bounded from above by
[TABLE]
By simple algebra very similar to that of the proof of Proposition 5, we obtain that if equals the right hand side of (58) for large enough and , then the right hand side of the previous display is bounded from above by . This completes the proof of (58). ∎
5 Bounded empirical processes and density estimation
We now prove a result similar to Proposition 1 for bounded empirical processes indexed by the convex hull of points. This will be useful to study the convex aggregation problem for density estimation. Throughout the paper, are i.i.d. Rademacher random variables that are independent of all other random variables.
Proposition 10**.**
There exists an absolute constant such that the following holds. Let be integers and let be real numbers. Let for some semi-positive matrix . Let be i.i.d. random variables valued in some measurable set . Let be measurable functions. Let for all . Assume that almost surely
[TABLE]
for all and all . Then for all such that we have
[TABLE]
where for all .
Proof of Proposition 10.
Let . The function is convex since it can be written as the maximum of two linear functions. Applying Lemma 3 with the above notation and yields
[TABLE]
where the second inequality is a consequence of Fubini’s Theorem and for all ,
[TABLE]
Using (64) and the Rademacher complexity bound for finite classes given in [18, Theorem 3.5], we obtain that for all ,
[TABLE]
where is a numerical constant and is the cardinal of the set . By definition of we have . The cardinal of the set is bounded from above by the right hand side of (27). Combining inequality (66), inequality (67), the fact that the integrals and are finite, we obtain
[TABLE]
for some absolute constant . By definition of , we have . A monotonicity argument completes the proof. ∎
Next, we show that Proposition 10 can be used to derive a condition similar to (33) for bounded empirical processes. To bound from above the performance of ERM procedures in density estimation, Theorem 13 in the appendix requires the existence of a quantity such that
[TABLE]
where is the function defined in Proposition 10 above.
To obtain such quantity under the assumptions of Proposition 10, we proceed as follows. Let and assume that
[TABLE]
Define where is a numerical constant that will be chosen later. We now bound from above the right hand side of (65). We have
[TABLE]
where for the last inequality we used that for all and that , since and . Thus, the right hand side of (65) is bounded from above by
[TABLE]
It is clear that the above quantity is bounded from above by if the numerical constant is large enough. Thus we have proved that as long as , inequality (69) holds for
[TABLE]
where is a numerical constant.
ERM and convex aggregation in density estimation
The minimax optimal rate for the convex aggregation problem is known to be of order
[TABLE]
for regression with fixed design [30] and regression with random design [34] if the integers and satisfy or equivalently . The arguments for the convex aggregation lower bound from [34] can be readily applied to density estimation, showing that the rate is a lower bound on the optimal rate of convex aggregation for density estimation.
We now use the results of the previous sections to show that ERM is optimal for the convex aggregation problem in regression with fixed design, regression with random design and density estimation.
Theorem 11**.**
There exists an absolute constant such that the following holds. Let be a measurable space with measure . Let be an unknown density with respect to the measure . Let be i.i.d. random variables valued in with density . Let and let for all . Let
[TABLE]
Then for all , with probability greater than ,
[TABLE]
where and .
Proof.
It is a direct application of Theorem 13 in the appendix. If , a fixed point is given by Lemma 14. If , we use Proposition 10 with , and . The bound (69) yields the existence of a fixed point in this regime. ∎
Appendix A Proof of the lower bound (9)
Proof of Proposition 2.
By the Varshamov-Gilbert extraction lemma [13, Lemma 2.5], there exist a subset of such that
[TABLE]
for any distinct .
For each , we define , a signed version of , as follows. Let be iid Rademacher random variables. Then we have
[TABLE]
Hence, there exists some with for all such that .
Define . Since , each element of is of the form where are distinct elements of , hence by convexity of we have . By definition of , it holds that , and thus . For any two distinct ,
[TABLE]
where the supremum is taken over any two distinct elements of . By Sudakov’s inequality (see for instance [8, Theorem 13.4]) we have
[TABLE]
Since , the right hand side of the previous display is equal to the right hand side of (9) and the proof is complete. ∎
Appendix B Local Rademacher complexities and density estimation
In the last decade emerged a vast literature on local Rademacher complexities to study the performance of empirical risk minimizers (ERM) for general learning problems, cf. [3, 2, 17] and the references therein. The following result is given in [3, Theorem 2.1]. Let be independent Rademacher random variables, that are independent from all other random variables considered in the paper.
Theorem 12** (Bartlett et al. [3]).**
Let be i.i.d. random variables valued in some measurable space . Let be a class of measurable functions. Assume that there is some such that for all . Then for all , with probability greater than ,
[TABLE]
Theorem 12 is a straightforward consequence of Talagrand inequality. We now explain how Theorem 12 can be used to derive sharp oracle inequalities in density estimation.
Theorem 13**.**
Let be a measurable space with measure . Let be an unknown density with respect to the measure . Let be i.i.d. random variables valued in with density . Let be a convex subset of . Assume that there exists such that . Assume that for some ,
[TABLE]
Assume that there exists an estimator such that almost surely,
[TABLE]
Then for all , with probability greater than ,
[TABLE]
where .
Proof of Theorem 13.
By optimality of we have
[TABLE]
where for all , is the random variable
[TABLE]
Let and define
[TABLE]
The class is convex, and so that implies . For any linear form ,
[TABLE]
so that by taking expectations, (83) holds if is replaced by .
For any , and is valued in -almost surely. We apply Theorem 12 to the class . This yields that with probability greater than , if is such that , then
[TABLE]
On the same event of probability greater than , if is such that , consider which belongs to . We have , which can be rewritten
[TABLE]
so that . In summary, we have proved that on an event of probability greater than , . In particular, this holds for which completes the proof. ∎
Appendix C A fixed point for finite dimensional classes
Lemma 14**.**
Consider the notations of Theorem 13 and assume that the linear span of is finite dimensional of dimension . Then (83) is satisfied for .
Proof.
Let be an orthonormal basis of the linear span of , for the scalar product . Then
[TABLE]
where we have used the Cauchy-Schwarz inequality, Jensen’ inequality, and that for all . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Audibert and Tsybakov [2007] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers. Ann. Statist. , 35(2):608–633, 04 2007. 10.1214/009053606000001217 . URL http://dx.doi.org/10.1214/009053606000001217 . · doi ↗
- 2Bartlett and Mendelson [2006] Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields , 135(3):311–334, 2006.
- 3Bartlett et al. [2005] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Ann. Statist. , 33(4):1497–1537, 08 2005. 10.1214/009053605000000282 . URL http://dx.doi.org/10.1214/009053605000000282 . · doi ↗
- 4Bartlett et al. [2012] Peter L Bartlett, Shahar Mendelson, and Joseph Neeman. L 1-regularized linear regression: persistence and oracle inequalities. Probability theory and related fields , 154(1-2):193–224, 2012.
- 5Bellec [2017 a] Pierre C. Bellec. Optimal bounds for aggregation of affine estimators. Annals of Statistics, to appear , 2017 a. URL https://arxiv.org/pdf/1410.0346 v 4.pdf .
- 6Bellec [2017 b] Pierre C. Bellec. Optimal exponential bounds for aggregation of density estimators. Bernoulli , 23(1):219–248, 2017 b. 10.3150/15-BEJ 742 . URL http://dx.doi.org/10.3150/15-BEJ 742 . · doi ↗
- 7Bellec [2017 c] Pierre C. Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. Annals of Statistics, to appear , 2017 c. URL https://arxiv.org/pdf/1510.08029.pdf .
- 8Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.
