Localized Gaussian width of $M$-convex hulls with applications to Lasso   and convex aggregation

Pierre C Bellec

arXiv:1705.10696·math.ST·September 28, 2017

Localized Gaussian width of $M$-convex hulls with applications to Lasso and convex aggregation

Pierre C Bellec

PDF

TL;DR

This paper derives bounds on the Gaussian mean width of convex hulls intersected with Euclidean balls and applies these results to analyze the performance of Lasso, ERM, and convex aggregation methods in statistical estimation.

Contribution

It introduces new bounds on Gaussian widths of convex hulls under restricted isometry conditions and applies them to key statistical estimators.

Findings

01

Bounds match up to a constant under RIP conditions

02

Provides theoretical insights into Lasso and aggregation performance

03

Enhances understanding of geometric properties in high-dimensional statistics

Abstract

Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of $M$ points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property. This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.

Equations184

ℓ (T) : = E u \in T sup u^{T} g,

ℓ (T) : = E u \in T sup u^{T} g,

B_{2} = {u \in R^{n} : ∣ u ∣_{2} \leq 1}, s B_{2} = {s u \in R^{n}, u \in B_{2}} for all s \geq 0.

B_{2} = {u \in R^{n} : ∣ u ∣_{2} \leq 1}, s B_{2} = {s u \in R^{n}, u \in B_{2}} for all s \geq 0.

ℓ (s B_{2} \cap T),

ℓ (s B_{2} \cap T),

ℓ (s B_{2} \cap B_{1}) ≍ lo g (e n (s^{2} \land 1)) \land (s n),

ℓ (s B_{2} \cap B_{1}) ≍ lo g (e n (s^{2} \land 1)) \land (s n),

ℓ (T \cap s B_{2}) \leq (4 lo g_{+} (4 e M (s^{2} \land 1))) \land (s n \land M)

ℓ (T \cap s B_{2}) \leq (4 lo g_{+} (4 e M (s^{2} \land 1))) \land (s n \land M)

ℓ (T \cap s B_{2}) \leq s n \land M

ℓ (T \cap s B_{2}) \leq s n \land M

ℓ (T \cap s B_{2}) \leq 4 lo g_{+} (4 e M (s^{2} \land 1)) .

ℓ (T \cap s B_{2}) \leq 4 lo g_{+} (4 e M (s^{2} \land 1)) .

κ ∣ θ ∣_{2} \leq ∣ μ_{θ} ∣_{2} for all θ \in R^{M} such that ∣ θ ∣_{0} \leq 2 m,

κ ∣ θ ∣_{2} \leq ∣ μ_{θ} ∣_{2} for all θ \in R^{M} such that ∣ θ ∣_{0} \leq 2 m,

ℓ (T \cap s B_{2}) \geq (2 /4) κ lo g (\frac{M s ^{2}}{5}) .

ℓ (T \cap s B_{2}) \geq (2 /4) κ lo g (\frac{M s ^{2}}{5}) .

\Lambda^{M}=\Big{\{}{\boldsymbol{\theta}}\in\mathbf{R}^{M},\quad\sum_{j=1}^{M}\theta_{j}=1,\quad\forall j=1\dots M,\;\;\theta_{j}\geq 0\Big{\}}.

\Lambda^{M}=\Big{\{}{\boldsymbol{\theta}}\in\mathbf{R}^{M},\quad\sum_{j=1}^{M}\theta_{j}=1,\quad\forall j=1\dots M,\;\;\theta_{j}\geq 0\Big{\}}.

Q (θ) = θ^{T} Σ θ,

Q (θ) = θ^{T} Σ θ,

Λ_{m}^{M} : = {\frac{1}{m} k = 1 \sum m u_{k}, u_{1}, ..., u_{m} \in {e_{1}, ..., e_{M}}},

Λ_{m}^{M} : = {\frac{1}{m} k = 1 \sum m u_{k}, u_{1}, ..., u_{m} \in {e_{1}, ..., e_{M}}},

P (Θ_{k} = e_{j}) = \overset{ˉ}{θ}_{j} for all k = 1, ..., m .

P (Θ_{k} = e_{j}) = \overset{ˉ}{θ}_{j} for all k = 1, ..., m .

\hat{θ} = \frac{1}{m} k = 1 \sum m Θ_{k} .

\hat{θ} = \frac{1}{m} k = 1 \sum m Θ_{k} .

E_{Θ} [\hat{θ}] \leq Q (\overset{ˉ}{θ}) + R^{2} / m,

E_{Θ} [\hat{θ}] \leq Q (\overset{ˉ}{θ}) + R^{2} / m,

Q (\tilde{θ}) \leq Q (\overset{ˉ}{θ}) + R^{2} / m .

Q (\tilde{θ}) \leq Q (\overset{ˉ}{θ}) + R^{2} / m .

Q (θ) = θ^{T} Σ θ,

Q (θ) = θ^{T} Σ θ,

θ \in Λ^{M} : Q (θ) \leq t^{2} sup F (θ) \leq \int_{1}^{+ \infty} [θ \in Λ_{m}^{M} : Q (θ) \leq x (t^{2} + R^{2} / m) max F (θ)] \frac{d x}{x ^{2}} .

θ \in Λ^{M} : Q (θ) \leq t^{2} sup F (θ) \leq \int_{1}^{+ \infty} [θ \in Λ_{m}^{M} : Q (θ) \leq x (t^{2} + R^{2} / m) max F (θ)] \frac{d x}{x ^{2}} .

maximize F (θ)

maximize F (θ)

maximize F (θ)

E : = E_{Θ} [Q (\hat{θ})]

E : = E_{Θ} [Q (\hat{θ})]

= Q (\overset{ˉ}{θ}) + \frac{1}{m} E_{Θ} [(Θ_{1} - \overset{ˉ}{θ})^{T} Σ (Θ_{1} - \overset{ˉ}{θ})] .

E_{Θ} (Θ_{1} - \overset{ˉ}{θ})^{T} Σ (Θ_{1} - \overset{ˉ}{θ}) = E_{Θ} [Q (Θ_{1})] - Q (\overset{ˉ}{θ}) \leq E_{Θ} Q (Θ_{1}) \leq R^{2},

E_{Θ} (Θ_{1} - \overset{ˉ}{θ})^{T} Σ (Θ_{1} - \overset{ˉ}{θ}) = E_{Θ} [Q (Θ_{1})] - Q (\overset{ˉ}{θ}) \leq E_{Θ} Q (Θ_{1}) \leq R^{2},

E = E_{Θ} [Q (\hat{θ})] \leq Q (\overset{ˉ}{θ}) + R^{2} / m \leq t^{2} + R^{2} / m .

E = E_{Θ} [Q (\hat{θ})] \leq Q (\overset{ˉ}{θ}) + R^{2} / m \leq t^{2} + R^{2} / m .

θ \in Λ^{M} : Q (θ) \leq t^{2} sup F (θ) = F (\overset{ˉ}{θ}) = F (E_{Θ} [\hat{θ}]) \leq E_{Θ} [F (\hat{θ})] \leq E_{Θ} [g (Q (\hat{θ}) / E)]

θ \in Λ^{M} : Q (θ) \leq t^{2} sup F (θ) = F (\overset{ˉ}{θ}) = F (E_{Θ} [\hat{θ}]) \leq E_{Θ} [F (\hat{θ})] \leq E_{Θ} [g (Q (\hat{θ}) / E)]

E_{Θ} [g (X)] = E_{Ω} [g (\tilde{X})] \leq E_{Ω} [g (\tilde{Y})] = \int_{1}^{+ \infty} \frac{g ( x )}{x ^{2}} d x .

E_{Θ} [g (X)] = E_{Ω} [g (\tilde{X})] \leq E_{Ω} [g (\tilde{Y})] = \int_{1}^{+ \infty} \frac{g ( x )}{x ^{2}} d x .

lo g ∣ Λ_{m}^{M} ∣ = lo g (m M + m - 1) \leq lo g (m 2 M) \leq m lo g (\frac{2 e M}{m}) .

lo g ∣ Λ_{m}^{M} ∣ = lo g (m M + m - 1) \leq lo g (m 2 M) \leq m lo g (\frac{2 e M}{m}) .

T = convex hull of {μ_{1}, ..., μ_{M}} = {μ_{θ}, θ \in Λ^{M}},

T = convex hull of {μ_{1}, ..., μ_{M}} = {μ_{θ}, θ \in Λ^{M}},

E θ \in Λ^{M} : Q (θ) \leq r^{2} sup g^{T} μ_{θ} \leq E \int_{1}^{+ \infty} [θ \in Λ_{m}^{M} : Q (θ) \leq x (r^{2} + 1/ m) max F (θ)] \frac{d x}{x ^{2}} .

E θ \in Λ^{M} : Q (θ) \leq r^{2} sup g^{T} μ_{θ} \leq E \int_{1}^{+ \infty} [θ \in Λ_{m}^{M} : Q (θ) \leq x (r^{2} + 1/ m) max F (θ)] \frac{d x}{x ^{2}} .

\int_{1}^{+ \infty} \frac{1}{x ^{2}} \frac{4 x lo g ∣ Λ _{m}^{M} ∣}{m} d x \leq lo g (2 e M / m) \int_{1}^{+ \infty} \frac{2}{x ^{3/2}} d x .

\int_{1}^{+ \infty} \frac{1}{x ^{2}} \frac{4 x lo g ∣ Λ _{m}^{M} ∣}{m} d x \leq lo g (2 e M / m) \int_{1}^{+ \infty} \frac{2}{x ^{3/2}} d x .

\frac{1}{n} E [u \in K : \frac{1}{n} ∣ f_{0}^{*} - u ∣_{2}^{2} \leq t_{*}^{2} sup ξ^{T} (u - f_{0}^{*})] \leq \frac{t _{*}^{2}}{2} .

\frac{1}{n} E [u \in K : \frac{1}{n} ∣ f_{0}^{*} - u ∣_{2}^{2} \leq t_{*}^{2} sup ξ^{T} (u - f_{0}^{*})] \leq \frac{t _{*}^{2}}{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\startlocaldefs\endlocaldefs

Localized Gaussian width of $M$ -convex hulls with applications to Lasso and convex aggregation

Pierre C. Bellec

Rutgers University, Department of Statistics and Biostatistics

Abstract

Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of $M$ points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property.

This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.

1 Introduction

Let $T$ be a subset of $\mathbf{R}^{n}$ . The Gaussian width of $T$ is defined as

[TABLE]

where ${\boldsymbol{g}}=(g_{1},...,g_{n})^{T}$ and $g_{1},...,g_{n}$ are i.i.d. standard normal random variables. For any vector ${\boldsymbol{u}}\in\mathbf{R}^{n}$ , denote by $|{\boldsymbol{u}}|_{2}$ its Euclidean norm and define the Euclidean balls

[TABLE]

We will also use the notation $S^{n-1}=\{{\boldsymbol{u}}\in\mathbf{R}^{n}:|{\boldsymbol{u}}|_{2}=1\}$ . The localized Gaussian width of $T$ with radius $s>0$ is the quantity $\ell(T\cap sB_{2})$ . For any ${\boldsymbol{u}}\in\mathbf{R}^{p}$ , define the $\ell_{p}$ norm by $|{\boldsymbol{u}}|_{p}=(\sum_{i=1}^{n}|u_{i}|^{p})^{1/p}$ for any $p\geq 1$ , and let $|{\boldsymbol{u}}|_{0}$ be the number of nonzero coefficients of ${\boldsymbol{u}}$ .

This paper studies the localized Gaussian width

[TABLE]

where $T$ is the convex hull of $M$ points in $\mathbf{R}^{n}$ .

If $T=B_{1}=\{{\boldsymbol{u}}\in\mathbf{R}^{n}:|{\boldsymbol{u}}|_{1}\leq 1\}$ , then matching upper and lower bounds are available for the localized Gaussian width:

[TABLE]

cf. [14] and [21, Section 4.1]. In the above display, $a\asymp b$ means that $a\leq Cb$ and $b\leq Ca$ for some large enough numerical constant $C\geq 1$ .

The first goal of this paper is to generalize this bound to any $T$ that is the convex hull of $M\geq 1$ points in $\mathbf{R}^{n}$ .

Contributions.

Section 2 is devoted to the generalization of (4) and provides sharp bounds on the localized Gaussian width of the convex hull of $M$ points in $\mathbf{R}^{n}$ , see Propositions 1 and 2 below. Sections 3, 4 and 5 provide statistical applications of the results of Section 2. Section 3 studies the Lasso estimator and the convex aggregation problem in fixed-design regression. In Section 4, we show that Empirical Risk Minimization achieves the minimax rate for the persistence problem in the anisotropic setting. Finally, Section 5 provides results for bounded empirical processes and for the convex aggregation problem in density estimation.

2 Localized Gaussian width of a $M$ -convex hull

The first contribution of the present paper is the following upper bound on localized Gaussian width of the convex hull of $M$ points in $\mathbf{R}^{n}$ .

Proposition 1.

Let $n\geq 1$ and $M\geq 2$ . Let $T$ be the convex hull of $M$ points in $\mathbf{R}^{n}$ and assume that $T\subset B_{2}$ . Let ${\boldsymbol{g}}$ be a centered Gaussian random variable with covariance matrix $I_{n\times n}$ . Then for all $s>0$ ,

[TABLE]

where $\log_{+}(a)=\max(1,\log a)$ .

Proposition 1 is proved in the next two subsections. Inequality

[TABLE]

is a direct consequence of the Cauchy-Schwarz inequality and $\mathbb{E}|P{\boldsymbol{g}}|_{2}\leq\sqrt{d}$ where $P\in\mathbf{R}^{n\times n}$ is the orthogonal projection onto the linear span of $T$ and $d\leq(n\wedge M)$ is the rank of $P$ . The novelty of (5) is inequality

[TABLE]

Inequality (7) was known for the $\ell_{1}$ -ball $T=\{{\boldsymbol{u}}\in\mathbf{R}^{n}:|{\boldsymbol{u}}|_{1}\leq 1\}$ [14], but to our knowledge (7) is new for general $M$ -convex hulls. If $T$ is the $\ell_{1}$ -ball, then the bound (5) is sharp up to numerical constants [14], [21, Section 4.1].

The above result does not assume any type of Restricted Isometry Property (RIP). The following proposition shows that (7) is essentially sharp provided that the vertices of $T$ satisfies a one-sided RIP of order $2/s^{2}$ .

Proposition 2.

Let $n\geq 1$ and $M\geq 2$ . Let ${\boldsymbol{g}}$ be a centered Gaussian random variable with covariance matrix $I_{n\times n}$ . Let $s\in(0,1]$ and assume for simplicity that $m=1/s^{2}$ is a positive integer such that $m\leq M/5$ . Let $T$ be the convex hull of the $2M$ points $\{\pm{\boldsymbol{\mu}}_{1},...,\pm{\boldsymbol{\mu}}_{M}\}$ where ${\boldsymbol{\mu}}_{1},...,{\boldsymbol{\mu}}_{M}\in S^{n-1}$ . Assume that for some real number $\kappa\in(0,1)$ we have

[TABLE]

where ${\boldsymbol{\mu}}_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}{\boldsymbol{\mu}}_{j}$ . Then

[TABLE]

The proof of Proposition 2 is given in Appendix A.

2.1 A refinement of Maurey’s argument

This subsection provides the main tool to derive the upper bound (7). Define the simplex in $\mathbf{R}^{M}$ by

[TABLE]

Let $m\geq 1$ be an integer, and let

[TABLE]

where $\Sigma=(\Sigma_{jj^{\prime}})_{j,j^{\prime}=1,...,M}$ is a positive semi-definite matrix of size $M$ . Let $\bar{\boldsymbol{\theta}}\in\Lambda^{M}$ be a deterministic vector such that $Q(\bar{\boldsymbol{\theta}})$ is small. Maurey’s argument [27] has been used extensively to prove the existence of a sparse vector $\tilde{\boldsymbol{\theta}}\in\Lambda^{M}$ such that $Q(\tilde{\boldsymbol{\theta}})$ is of the same order as that of $Q(\bar{\boldsymbol{\theta}})$ . Maurey’s argument uses the probabilistic method to prove the existence of such $\tilde{\boldsymbol{\theta}}$ . A sketch of this argument is as follows.

Define the discrete set $\Lambda^{M}_{m}$ as

[TABLE]

where $({\boldsymbol{e}}_{1},...,{\boldsymbol{e}}_{M})$ is the canonical basis in $\mathbf{R}^{M}$ . The discrete set $\Lambda^{M}_{m}$ is a subset of the simplex $\Lambda^{M}$ that contains only $m$ -sparse vectors.

Let $({\boldsymbol{e}}_{1},...,{\boldsymbol{e}}_{M})$ be the canonical basis in $\mathbf{R}^{M}$ . Let $\Theta_{1},...,\Theta_{m}$ be i.i.d. random variables valued in $\{{\boldsymbol{e}}_{1},...,{\boldsymbol{e}}_{M}\}$ with distribution

[TABLE]

Next, consider the random variable

[TABLE]

The random variable $\hat{\boldsymbol{\theta}}$ is valued in $\Lambda^{M}_{m}$ and is such that $\mathbb{E}_{\Theta}[\hat{\boldsymbol{\theta}}]=\bar{\boldsymbol{\theta}}$ , where $\mathbb{E}_{\Theta}$ denotes the expectation with respect to $\hat{\boldsymbol{\theta}}$ . Then a bias-variance decomposition yields

[TABLE]

where $R>0$ is a constant such that $\max_{j=1,...,M}\Sigma_{jj}\leq R^{2}$ . As $\min_{{\boldsymbol{\theta}}\in\Lambda^{M}_{m}}Q({\boldsymbol{\theta}})\leq\mathbb{E}_{\Theta}[\hat{\boldsymbol{\theta}}]$ , this yields the existence of $\tilde{\boldsymbol{\theta}}\in\Lambda^{M}_{m}$ such that

[TABLE]

If $m$ is chosen large enough, the two terms $Q(\bar{\boldsymbol{\theta}})$ and $R^{2}/m$ are of the same order and we have established the existence of an $m$ -sparse vector $\tilde{\boldsymbol{\theta}}$ so that $Q(\tilde{\boldsymbol{\theta}})$ is not much substantially larger than $Q(\bar{\boldsymbol{\theta}})$ .

For our purpose, we need to refine this argument by controlling the deviation of the random variable $Q({\boldsymbol{\hat{\theta}}})$ . This is done in Lemma 3 below.

Lemma 3.

Let $m\geq 1$ and define $\Lambda^{M}_{m}$ by (12). Let $F:\mathbf{R}^{M}\rightarrow[0,+\infty)$ be a convex function. For all ${\boldsymbol{\theta}}\in\mathbf{R}^{M}$ , let

[TABLE]

where $\Sigma=(\Sigma_{jj^{\prime}})_{j,j^{\prime}=1,...,M}$ is a positive semi-definite matrix of size $M$ . Assume that the diagonal elements of $\Sigma$ satisfy $\Sigma_{jj}\leq R^{2}$ for all $j=1,...,M$ . Then for all $t>0$ ,

[TABLE]

In the next sections, it will be useful to bound from above the quantity $F({\boldsymbol{\theta}})$ maximized over $\Lambda^{M}$ subject to the constraint $Q({\boldsymbol{\theta}})\leq t^{2}$ . An interpretation of (18) is as follows. Consider the two optimization problems

[TABLE]

for some $Y\geq 1$ . Equation 18 says that the optimal value of the first optimization problem is smaller than the optimal value of the second optimization problem averaged over the distribution of $Y$ given by the density $y\mapsto 1/y^{2}$ on $[1,+\infty)$ . The second optimization problem above is over the discrete set $\Lambda^{M}_{m}$ with the relaxed constraint $Q({\boldsymbol{\theta}})\leq Y(t^{2}+R^{2}/m)$ , hence we have relaxed the constraint in exchange for discreteness. The discreteness of the set $\Lambda^{M}_{m}$ will be used in the next subsection for the proof of Proposition 1.

Proof of Lemma 3.

The set $\{{\boldsymbol{\theta}}\in\Lambda^{M}:Q({\boldsymbol{\theta}})\leq t^{2}\}$ is compact. The function $F$ is convex with domain $\mathbf{R}^{M}$ and thus continuous. Hence the supremum in the left hand side of (18) is achieved at some $\bar{\boldsymbol{\theta}}\in\Lambda^{M}$ such that $Q(\bar{\boldsymbol{\theta}})\leq t^{2}$ . Let $\Theta_{1},...,\Theta_{m},{\boldsymbol{\hat{\theta}}}$ be the random variable defined in (13) and (14) above. Denote by $E_{\Theta}$ the expectation with respect to $\Theta_{1},...,\Theta_{m}$ . By definition, ${\boldsymbol{\hat{\theta}}}\in\Lambda^{M}_{m}$ and $\mathbb{E}_{\Theta}{\boldsymbol{\hat{\theta}}}=\bar{\boldsymbol{\theta}}$ . Let $E=\mathbb{E}_{\Theta}[Q({\boldsymbol{\hat{\theta}}})]$ . A bias-variance decomposition and the independence of $\Theta_{1},...,\Theta_{m}$ yield

[TABLE]

Another bias-variance decomposition yields

[TABLE]

where we used that $Q(\cdot)\geq 0$ and that $\Theta_{1}\Sigma\Theta_{1}\leq R^{2}$ almost surely. Thus

[TABLE]

Define the random variable $X=Q({\boldsymbol{\hat{\theta}}})/E$ , which is nonnegative and satisfifes $\mathbb{E}_{\Theta}[X]=1$ . By Markov inequality, it holds that $\mathbb{P}_{\Theta}(X>t)\leq 1/t=\int_{1}^{+\infty}(1/x^{2})dx$ . Define the random variable $Y$ by the density function $x\rightarrow 1/x^{2}$ on $[1,+\infty)$ . Then we have $\mathbb{P}_{\Theta}(X>t)\leq\mathbb{P}(Y>t)$ for any $t>0$ , so by stochastic dominance, there exists a rich enough probability space $\Omega$ and random variables $\tilde{X}$ and $\tilde{Y}$ defined on $\Omega$ such that $\tilde{X}$ and $X$ have the same distribution, $\tilde{Y}$ and $Y$ have the same distribution, and $\tilde{X}\leq\tilde{Y}$ almost surely on $\Omega$ (see for instance Theorem 7.1 in [12]). Denote by $\mathbb{E}_{\Omega}$ the expectation sign on the probability space $\Omega$ .

By definition of $\bar{\boldsymbol{\theta}}$ and ${\boldsymbol{\hat{\theta}}}$ , using Jensen’s inequality, Fubini’s Theorem and the fact that ${\boldsymbol{\hat{\theta}}}\in\Lambda^{M}_{m}$ we have

[TABLE]

where $g(\cdot)$ is the nondecreasing function $g(x)=\max_{{\boldsymbol{\theta}}\in\Lambda^{M}_{m}:Q({\boldsymbol{\theta}})\leq xE}F({\boldsymbol{\theta}})$ . The right hand side of the previous display is equal to to $\mathbb{E}_{\Theta}[g(X)]$ . Next, we use the random variables $\tilde{X}$ and $\tilde{Y}$ as follows:

[TABLE]

Combining the previous display and (24) completes the proof. ∎

2.2 Proof of (7)

We are now ready to prove Proposition 1. The main ingredients are Lemma 3 and the following upper bound on the cardinal of $\Lambda^{M}_{m}$

[TABLE]

Proof of (7).

If $s^{2}<1/M$ then by (6) we have $\ell(T\cap sB_{2})\leq 1$ , hence (7) holds. Thus it is enough to focus on the case $s^{2}\geq 1/M$ .

Let $r=\min(s,1)$ and set $m=\lfloor 1/r^{2}\rfloor$ , which satisfies $1\leq m\leq M$ . As $T$ is the convex hull of $M$ points, let ${\boldsymbol{\mu}}_{1},...,{\boldsymbol{\mu}}_{M}\in\mathbf{R}^{n}$ be such that

[TABLE]

where ${\boldsymbol{\mu}}_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}{\boldsymbol{\mu}}_{j}$ for ${\boldsymbol{\theta}}\in\Lambda^{M}$ .

Let $Q({\boldsymbol{\theta}})=|{\boldsymbol{\mu}}_{\boldsymbol{\theta}}|_{2}^{2}$ for all ${\boldsymbol{\theta}}\in\mathbf{R}^{M}$ . This is a polynomial of order $2$ , of the form $Q({\boldsymbol{\theta}})={\boldsymbol{\theta}}^{T}\Sigma{\boldsymbol{\theta}}$ , where $\Sigma$ is the Gram matrix with $\Sigma_{jk}={\boldsymbol{\mu}}_{k}^{T}{\boldsymbol{\mu}}_{j}$ for all $j,k=1,...,M$ . As we assume that $T\subset B_{2}$ , the diagonal elements of $\Sigma$ satisfy $\Sigma_{jj}\leq 1$ . For all ${\boldsymbol{\theta}}\in\mathbf{R}^{M}$ , let $F({\boldsymbol{\theta}})={\boldsymbol{g}}^{T}{\boldsymbol{\mu}}_{\boldsymbol{\theta}}$ . Applying Lemma 3 with the above notation, $R=1$ , $m=\lfloor 1/r^{2}\rfloor$ and $t=r$ , we obtain

[TABLE]

By definition of $m$ , $r^{2}\leq 1/m$ so that $x(r^{2}+1/m)\leq 2x/m$ . Using Fubini Theorem and a bound on the expectation of the maximum of $|\Lambda^{M}_{m}|$ centered Gaussian random variables with variances bounded from above by $2x/m$ , we obtain that the right hand side of the previous display is bounded from above by

[TABLE]

where we used the bound (27). To complete the proof of (7), notice that we have $1/m\leq 2r^{2}$ and $\int_{1}^{+\infty}\frac{2}{x^{3/2}}dx=4$ .

∎

3 Statistical applications in fixed-design regression

Numerous works have established a close relationship between localized Gaussian widths and the performance of statistical and compressed sensing procedures. Some of these works are reviewed below.

•

In a regression problem with random design where the design and the target are subgaussian, Lecué and Mendelson [21] established that two quantities govern the performance of empirical risk minimizer over a convex class $\mathcal{F}$ . These two quantities are defined using the Gaussian width of the class $\mathcal{F}$ intersected with an $L_{2}$ ball [21, Definition 1.3],

•

If $p,p^{\prime}>1$ are such that $p^{\prime}\leq p\leq+\infty$ and $\log(2n)/(\log(2en)\leq p^{\prime}$ . Gordon et al. [14] provide precise estimates of $\ell(B_{p}\cap sB_{p^{\prime}})$ where $B_{p}\subset\mathbf{R}^{n}$ is the unit $L_{p}$ ball and $sB_{p^{\prime}}$ is the $L_{p^{\prime}}$ ball of radius $s>0$ . These estimates are then used to solve the approximate reconstruction problem where one wants to recover an unknown high dimensional vector from a few random measurements [14, Section 7].

•

Plan et al. [28] shows that in the semiparametric single index model, if the signal is known to belong to some star-shaped set $T\subset\mathbf{R}^{n}$ , then the Gaussian width of $T$ and its localized version characterize the gain obtained by using the additional information that the signal belongs to $T$ , cf. Theorem 1.3 in [28].

•

Finally, Chatterjee [9] exhibits connection between localized Gaussian widths and shape-constrained estimation.

These results are reminiscent of the isomorphic method [17, 3, 2], where localized expected supremum of empirical processes are used to obtain upper bounds on the performance of Empirical Risk Minimization (ERM) procedures. These results show that Gaussian width estimates are important to understand the statistical properties of estimators in many statistical contexts.

In Proposition 1, we established an upper bound on the Gaussian width of $M$ -convex hulls. We now provide some statistical applications of this result in regression with fixed-design. We will use the following Theorem from [7].

Theorem 4 ([7]).

Let $K$ be a closed convex subset of $\mathbf{R}^{n}$ and ${\boldsymbol{\xi}}\sim\mathcal{N}(0,\sigma^{2}I_{n\times n})$ . Let ${\boldsymbol{f}}_{0}\in\mathbf{R}^{n}$ be an unknown vector and let $\mathbf{y}={\boldsymbol{f}}_{0}+{\boldsymbol{\xi}}$ . Denote by ${\boldsymbol{f}}_{0}^{*}$ the projection of ${\boldsymbol{f}}_{0}$ onto $K$ . Assume that for some $t_{*}>0$ ,

[TABLE]

Then for any $x>0$ , with probability greater than $1-e^{-x}$ , the Least Squares estimator ${\boldsymbol{\hat{f}}}=\operatorname*{argmin}_{{\boldsymbol{f}}\in K}|\mathbf{y}-{\boldsymbol{f}}|_{2}^{2}$ satisfies

[TABLE]

Hence, to prove an oracle inequality of the form (32), it is enough to prove the existence of a quantity $t_{*}$ such that (31) holds. If the convex set $K$ in the above theorem is the convex hull of $M$ points, then a quantity $t_{*}$ is given by the following proposition.

Proposition 5.

Let $\sigma^{2}>0,R>0,n\geq 1$ and $M\geq 2$ . Let ${\boldsymbol{\mu}}_{1},...,{\boldsymbol{\mu}}_{M}\in\mathbf{R}^{n}$ such that $\frac{1}{n}|{\boldsymbol{\mu}}_{j}|_{2}^{2}\leq R^{2}$ for all $j=1,...,M$ . For all ${\boldsymbol{\theta}}\in\Lambda^{M}$ , let ${\boldsymbol{\mu}}_{\boldsymbol{\theta}}=\sum_{j=1,...,M}\theta_{j}{\boldsymbol{\mu}}_{j}$ . Let ${\boldsymbol{g}}$ be a centered Gaussian random variable with covariance matrix $\sigma^{2}I_{n\times n}$ . If $R\sqrt{n}\leq M\sigma$ then the quantity

[TABLE]

provided that $t_{*}\leq R$ .

Proof.

Inequality

[TABLE]

is a reformulation of Proposition 1 using the notation of Proposition 5. Thus, in order to prove (33), it is enough to establish that for $\gamma=31$ we have

[TABLE]

As $1\leq\log(eM\sigma/(R\sqrt{n}))$ and $\log t\leq t$ for all $t>0$ , the left hand side of the previous display satisfies

[TABLE]

Thus (35) holds if $64(3/2+\log(4\gamma))\leq\gamma^{2}/4$ , which is the case if the absolute constant is $\gamma=31$ . ∎

Inequality (33) establishes the existence of a quantity $t_{*}$ such that

[TABLE]

where $T$ is the convex hull of ${\boldsymbol{\mu}}_{1},...,{\boldsymbol{\mu}}_{M}$ . Consequences of (38) and Theorem 4 are given in the next subsections.

We now introduce two statistical frameworks where the localized Gaussian width of an $M$ -convex hull has applications: the Lasso estimator in high-dimensional statistics and the convex aggregation problem.

3.1 Convex aggregation

Let ${\boldsymbol{f}}_{0}\in\mathbf{R}^{n}$ be an unknown regression vector and let $\mathbf{y}={\boldsymbol{f}}_{0}+{\boldsymbol{\xi}}$ be an observed random vector, where ${\boldsymbol{\xi}}$ satisfies $\mathbb{E}[{\boldsymbol{\xi}}]={\boldsymbol{0}}$ . Let $M\geq 2$ and let ${\boldsymbol{f}}_{1},...,{\boldsymbol{f}}_{M}$ be deterministic vectors in $\mathbf{R}^{n}$ . The set $\{{\boldsymbol{f}}_{1},...,{\boldsymbol{f}}_{M}\}$ will be referred to as the dictionary. For any ${\boldsymbol{\theta}}=(\theta_{1},...,\theta_{M})^{T}\in\mathbf{R}^{M}$ , let ${\boldsymbol{f}}_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}{\boldsymbol{f}}_{j}$ . If a set $\Theta\subset\mathbf{R}^{M}$ is given, the goal of the aggregation problem induced by $\Theta$ is to find an estimator ${\boldsymbol{\hat{f}}}$ constructed with $\mathbf{y}$ and the dictionary such that

[TABLE]

either in expectation or with high probability, where $\delta_{n,M,\Theta}$ is a small quantity. Inequality (39) is called a sharp oracle inequality, where "sharp" means that in the right hand side of (39), the multiplicative constant of the term $\inf_{{\boldsymbol{\theta}}\in\Theta}\frac{1}{n}|{\boldsymbol{f}}_{\boldsymbol{\theta}}-{\boldsymbol{f}}_{0}|_{2}^{2}$ is $1$ . Similar notations will be defined for regression with random design and density estimation. Define the simplex in $\mathbf{R}^{M}$ by (10). The following aggregation problems were introduced in [26, 34].

•

Model Selection type aggregation with $\Theta=\{{\boldsymbol{e}}_{1},...,{\boldsymbol{e}}_{M}\}$ , i.e., $\Theta$ is the canonical basis of $\mathbf{R}^{M}$ . The goal is to construct an estimator whose risk is as close as possible to the best function in the dictionary. Such results can be found in [34, 22, 1] for random design regression, in [23, 10, 5, 11] for fixed design regression, and in [16, 6] for density estimation.

•

Convex aggregation with $\Theta=\Lambda^{M}$ , i.e., $\Theta$ is the simplex in $\mathbf{R}^{M}$ . The goal is to construct an estimator whose risk is as close as possible to the best convex combination of the dictionary functions. See [34, 20, 19, 33] for results of this type in the regression framework and [29] for such results in density estimation.

•

Linear aggregation with $\Theta=\mathbf{R}^{M}$ . The goal is to construct an estimator whose risk is as close as possible to the best linear combination of the dictionary functions, cf. [34, 33] for such results in regression and [29] for such results in density estimation.

One may also define the Sparse or Sparse Convex aggregation problems: construct an estimator whose risk is as close as possible to the best sparse combination of the dictionary functions. Such results can be found in [31, 30, 33] for fixed design regression and in [24] for regression with random design. These problems are out of the scope of the present paper.

A goal of the present paper is to provide a unified argument that shows that empirical risk minimization is optimal for the convex aggregation problem in density estimation, regression with fixed design and regression with random design.

Theorem 6.

Let ${\boldsymbol{f}}_{0}\in\mathbf{R}^{n}$ , let ${\boldsymbol{\xi}}\sim\mathcal{N}(0,\sigma^{2}I_{n\times n})$ and define $\mathbf{y}={\boldsymbol{f}}_{0}+{\boldsymbol{\xi}}$ . Let ${\boldsymbol{f}}_{1},...,{\boldsymbol{f}}_{M}\in\mathbf{R}^{n}$ and let ${\boldsymbol{f}}_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}{\boldsymbol{f}}_{j}$ for all ${\boldsymbol{\theta}}=(\theta_{1},...,\theta_{M})^{T}\in\mathbf{R}^{M}$ . Let

[TABLE]

Then for all $x>0$ , with probability greater than $1-\exp(-x)$ ,

[TABLE]

where $t_{*}^{2}=\min\left(\frac{4\sigma^{2}M}{n},\frac{31\sigma R\sqrt{\log(eM\sigma/(R\sqrt{n}))}}{\sqrt{n}}\right)$ and $R^{2}=\frac{1}{4}\max_{j=1,...,M}\frac{1}{n}|{\boldsymbol{f}}_{j}|_{2}^{2}$ .

Proof of Theorem 6.

Let $V$ be the linear span of ${\boldsymbol{f}}_{1},...,{\boldsymbol{f}}_{M}$ and let $P\in\mathbf{R}^{n\times n}$ be the orthogonal projector onto $V$ . If $t_{*}^{2}=4\sigma^{2}M/n$ , then

[TABLE]

Let $K$ be the convex hull of ${\boldsymbol{f}}_{1},...,{\boldsymbol{f}}_{M}$ . Let ${\boldsymbol{f}}_{0}^{*}$ be the convex projection of ${\boldsymbol{f}}_{0}$ onto $K$ . We apply Proposition 5 to $K-{\boldsymbol{f}}_{0}^{*}$ which is a convex hull of $M$ points, and for all ${\boldsymbol{v}}\in K$ , $\frac{1}{n}|{\boldsymbol{v}}|_{2}^{2}\leq R^{2}$ . By (47) and (33), the quantity $t_{*}$ satisfies (31). Applying Theorem 4 completes the proof. ∎

3.2 Lasso

We consider the following regression model. Let ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}\in R^{n}$ and assume that $\frac{1}{n}|{\boldsymbol{x}}_{j}|_{2}^{2}\leq 1$ for all $j=1,...,M$ . We will refer to ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}$ as the covariates. Let $\mathbf{X}$ be the matrix of dimension $n\times M$ with columns ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}$ . We observe

[TABLE]

where ${\boldsymbol{f}}_{0}\in\mathbf{R}^{n}$ is an unknown mean. The goal is to estimate ${\boldsymbol{f}}_{0}$ using the design matrix $\mathbf{X}$ .

Let $R>0$ be a tuning parameter and define the constrained Lasso estimator [32] by

[TABLE]

Our goal will be to study the performance of the estimator (44) with respect to the prediction loss

[TABLE]

Let ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}\in R^{n}$ and assume that $\frac{1}{n}|{\boldsymbol{x}}_{j}|_{2}^{2}\leq 1$ for all $j=1,...,M$ . Let $\mathbf{X}$ be the matrix of dimension $n\times M$ with columns ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}$ .

Theorem 7.

Let $R>0$ be a tuning parameter and consider the regression model (43). Define the Lasso estimator ${\boldsymbol{\hat{\beta}}}$ by (44). Then for all $x>0$ , with probability greater than $1-\exp(-x)$ ,

[TABLE]

where $t_{*}^{2}=\min\left(\frac{4\sigma^{2}\operatorname*{rank}(\mathbf{X})}{n},\frac{62\sigma R\sqrt{\log(2eM\sigma/(R\sqrt{n}))}}{\sqrt{n}}\right)$ .

Proof of Theorem 7.

Let $V$ be the linear span of ${\boldsymbol{x}}_{1},...,{\boldsymbol{x}}_{M}$ and let $P\in\mathbf{R}^{n\times n}$ be the orthogonal projector onto $V$ . If $t_{*}^{2}=4\sigma^{2}\operatorname*{rank}(\mathbf{X})/n$ , then

[TABLE]

Let $K$ be the convex hull of $\{\pm R{\boldsymbol{x}}_{1},...,\pm R{\boldsymbol{x}}_{M}\}$ , so that $K=\{\mathbf{X}{\boldsymbol{\beta}}:{\boldsymbol{\beta}}\in\mathbf{R}^{M}:|{\boldsymbol{\beta}}|_{1}\leq R\}$ . Let ${\boldsymbol{f}}_{0}^{*}$ be the convex projection of ${\boldsymbol{f}}_{0}$ onto $K$ . We apply Proposition 5 to $K-{\boldsymbol{f}}_{0}^{*}$ which is a convex hull of $2M$ points of empirical norm less or equal to $R^{2}$ . By (47) and (33), the quantity $t_{*}$ satisfies (31). Applying Theorem 4 completes the proof. ∎

The lower bound [30, Theorem 5.4 and (5.25)] states that there exists an absolute constant $C_{0}>0$ such that the following holds. If $\log(1+eM/\sqrt{n})\leq C_{0}\sqrt{n}$ , then there exists a design matrix $\mathbf{X}$ such that for all estimator ${\boldsymbol{\hat{f}}}$ ,

[TABLE]

where for all ${\boldsymbol{f}}_{0}\in\mathbf{R}^{n}$ , $\mathbb{E}_{{\boldsymbol{f}}_{0}}$ denotes the expectation with respect to the distribution of $\mathbf{y}\sim\mathcal{N}({\boldsymbol{f}}_{0},\sigma^{2}I_{n\times n})$ . Thus, Theorem 7 shows that the Least Squares estimator over the set $\{\mathbf{X}{\boldsymbol{\beta}},{\boldsymbol{\beta}}\in\mathbf{R}^{M}:|{\boldsymbol{\beta}}|_{1}\leq R\}$ is minimax optimal. In particular, the right hand side of inequality (46) cannot be improved.

4 The anisotropic persistence problem in regression with random design

Consider $n$ iid observations $(Y_{i},X_{i})_{i=1,...,n}$ where $(Y_{i})_{i=1,...,n}$ are real valued and the $(X_{i})_{i=1,..,n}$ are design random variables in $\mathbf{R}^{M}$ with $\mathbb{E}[X_{i}X_{i}^{T}]=\Sigma$ for some covariance matrix $\Sigma\in\mathbf{R}^{M\times M}$ . We consider the learning problem over the function class

[TABLE]

for a given constant $R>0$ . We consider the Emprical Risk Minimizer defined by

[TABLE]

This problem is sometimes referred to as the persistence problem or the persistence framework [15, 4]. The prediction risk of $f_{\boldsymbol{\hat{\beta}}}$ is given by

[TABLE]

where $(X,Y)$ is a new observation distributed as $(X_{1},Y_{1})$ and independent from the data $(X_{i},Y_{i})_{i=1,...,n}$ . Define also the oracle ${\boldsymbol{\beta}}^{*}$ by

[TABLE]

and define $\sigma>0$ by

[TABLE]

where the subgaussian norm $\|\cdot\|_{\psi_{2}}$ is defined by $\|Z\|_{\psi_{2}}=\sup_{p\geq 1}\mathbb{E}[|Z|^{p}]^{1/p}/\sqrt{p}$ for any random variable $Z$ (see Section 5.2.3 in [35] for equivalent definitions of the $\psi_{2}$ norm).

To analyse the above learning problem, we use the machinery developed by Lecué and Mendelson [21] to study learning problems over subgaussian classes. Consider the two quantities

[TABLE]

where $G\sim N({\boldsymbol{0}},\Sigma)$ . In the present setting, Theorem A from Lecué and Mendelson [21] reads as follows.

Theorem 8 (Theorem A in Lecué and Mendelson [21]).

There exist absolute constants $c_{1},c_{2},c_{4}>0$ such that the following holds. Let $R>0$ . Consider iid observations $(X_{i},Y_{i})$ with $\mathbb{E}[X_{i}X_{i}^{T}]=\Sigma$ . Assume that the design random vectors $X_{i}$ are subgaussian with respect to the covariance matrix $\Sigma$ in the sense that $\|X_{i}^{T}\tau\|_{\psi_{2}}\leq 10|\Sigma^{1/2}\tau|_{2}$ for any $\tau\in\mathbf{R}^{p}$ . Define ${\boldsymbol{\beta}}^{*}$ by (52) and $\sigma$ by (53). Assume that the diagonal elements of $\Sigma$ are no larger than 1. Then, there exists absolute constants $c_{0},c_{1},c_{2},c_{3}>0$ such that the estimator ${\boldsymbol{\hat{\beta}}}$ defined in (50) satisfies

[TABLE]

with probability at least $1-6\exp(-c_{4}n\min(c_{2},s_{n}(c_{1})))$ .

In the isotropic case ( $\Sigma=I_{M}$ ), [25] proves that

[TABLE]

for some constants $c_{3},c_{4}>0$ that only depends on $\gamma$ , while

[TABLE]

for some constants $c_{5},c_{6}>0$ that only depend on $\gamma$ .

Using Proposition 1 and Equation 14 above lets us extend these bounds to the anisotropic case where $\Sigma$ is not proportional to the identity matrix.

Proposition 9.

Let $R>0$ , let $G\sim N({\boldsymbol{0}},\Sigma)$ and assume that the diagonal elements of $\Sigma$ are no larger than 1. For any $\gamma>0$ , define $r_{n}(\gamma)$ and $s_{n}(\gamma)$ by (55) and (54). Then for any $\gamma>0$ , there exists constants $c_{3},c_{4},c_{5},c_{6}>0$ that depend only on $\gamma$ such that (58) and (57) hold.

The proof of Proposition 9 will be given at the end of this subsection. The primary improvement of Proposition 1 over previous results is that this result is agnostic to the underlying covariance structure. This lets us handle the anisotropic case with $\Sigma\neq I_{M}$ in the above proposition.

Proposition 9 combined with Theorem 8 lets us obtained the minimax rate of estimation for the persistence problem in the anisotropic case. Although the minimax rate was previously obtained in the isotropic case, we are not aware of a previous result that yields this rate for general covariance matrices $\Sigma\neq I_{M}$ .

Proof of Proposition 9.

In this proof, $c>0$ is an absolute constant whose value may change from line to line. Let $\gamma>0$ . We first bound $r_{n}(\gamma)$ from above. Let $r>0$ and define

[TABLE]

The random variable $X\sim N({\boldsymbol{0}},\Sigma)$ has the same distribution as $\Sigma^{1/2}{\boldsymbol{g}}$ where ${\boldsymbol{g}}\sim N({\boldsymbol{0}},I_{M})$ . Thus, the expectation inside the infimum in (54) is equal to

[TABLE]

To bound $r_{n}(\gamma)$ from above, it is enough to find some $r>0$ such that (60) is bounded from above by $\gamma r\sqrt{n}$ .

By the Cauchy-Schwarz inequality, the right hand side is bounded from above by $r\sqrt{M}$ , which is smaller than $\gamma r\sqrt{n}$ for all small enough $s>0$ provided that $n>c_{4}M$ for some constant $c_{4}$ that only depends on $\gamma$ .

We now bound $r_{n}(\gamma)$ from above in the regime $n\leq c_{4}M$ . Let ${\boldsymbol{u}}_{1},...,{\boldsymbol{u}}_{M}$ be the columns of $\Sigma$ and let $\tilde{T}$ be the convex hull of the $2M$ points $\{\pm{\boldsymbol{u}}_{1},...,\pm{\boldsymbol{u}}_{M}\}$ . Using the fact that $T_{r}(R)=2RT_{r/(2R)}(1)\subset 2R(\tilde{T}\cap(r/(2R))B_{2})$ , the right hand side of the previous display is bounded from above by

[TABLE]

where we used Proposition 1 for the last inequality. By simple algebra, one can show that if $r=c_{3}(\gamma)\frac{R}{\sqrt{n}}\sqrt{\log(c_{3}(\gamma)M/n)}$ for some large enough constant $c_{3}(\gamma)$ that only depends on $\gamma$ , then the right hand side of (61) is bounded from above by $\gamma r\sqrt{n}$ .

We now bound $s_{n}(\gamma)$ from above. Let $s>0$ . By definition of $s_{n}(\gamma)$ , to prove that $s_{n}(\gamma)\leq s$ , it is enough to show that

[TABLE]

is smaller than $\gamma s^{2}\sqrt{n}$ . We use Proposition 1 to show that the right hand side of the previous display is bounded from above by

[TABLE]

By simple algebra very similar to that of the proof of Proposition 5, we obtain that if $s^{2}$ equals the right hand side of (58) for large enough $c_{5}=c_{5}(\gamma)$ and $c_{6}=c_{6}(\gamma)$ , then the right hand side of the previous display is bounded from above by $\gamma s^{2}\sqrt{n}$ . This completes the proof of (58). ∎

5 Bounded empirical processes and density estimation

We now prove a result similar to Proposition 1 for bounded empirical processes indexed by the convex hull of $M$ points. This will be useful to study the convex aggregation problem for density estimation. Throughout the paper, $\varepsilon_{1},...,\varepsilon_{n}$ are i.i.d. Rademacher random variables that are independent of all other random variables.

Proposition 10.

There exists an absolute constant $c>0$ such that the following holds. Let $M\geq 2,n\geq 1$ be integers and let $b,R,L>0$ be real numbers. Let $Q({\boldsymbol{\theta}})={\boldsymbol{\theta}}^{T}\Sigma{\boldsymbol{\theta}}$ for some semi-positive matrix $\Sigma$ . Let $Z_{1},...,Z_{n}$ be i.i.d. random variables valued in some measurable set $\mathcal{Z}$ . Let $h_{1},...,h_{M}:\mathcal{Z}\rightarrow\mathbf{R}$ be measurable functions. Let $h_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}h_{j}$ for all ${\boldsymbol{\theta}}=(\theta_{1},...,\theta_{M})^{T}\in\mathbf{R}^{M}$ . Assume that almost surely

[TABLE]

for all $j=1,...,M$ and all ${\boldsymbol{\theta}}\in\Lambda^{M}$ . Then for all $r>0$ such that $R/\sqrt{M}\leq r\leq R$ we have

[TABLE]

where $F({\boldsymbol{\theta}})=\frac{1}{n}|\sum_{i=1}^{n}\varepsilon_{i}h_{\boldsymbol{\theta}}(Z_{i})|$ for all ${\boldsymbol{\theta}}\in\mathbf{R}^{M}$ .

Proof of Proposition 10.

Let $m=\lfloor R^{2}/r^{2}\rfloor\geq 1$ . The function $F$ is convex since it can be written as the maximum of two linear functions. Applying Lemma 3 with the above notation and $t=r$ yields

[TABLE]

where the second inequality is a consequence of Fubini’s Theorem and for all $x\geq 1$ ,

[TABLE]

Using (64) and the Rademacher complexity bound for finite classes given in [18, Theorem 3.5], we obtain that for all $x\geq 1$ ,

[TABLE]

where $c^{\prime}>0$ is a numerical constant and $|\Lambda^{M}_{m}|$ is the cardinal of the set $\Lambda^{M}_{m}$ . By definition of $m$ we have $r^{2}\leq R^{2}/m$ . The cardinal $|\Lambda^{M}_{m}|$ of the set $\Lambda^{M}_{m}$ is bounded from above by the right hand side of (27). Combining inequality (66), inequality (67), the fact that the integrals $\int_{1}^{+\infty}\frac{dx}{x^{2}}$ and $\int_{1}^{+\infty}\frac{dx}{x^{3/2}}$ are finite, we obtain

[TABLE]

for some absolute constant $c^{\prime\prime}>0$ . By definition of $m$ , we have $R^{2}/(2r^{2})\leq m\leq R^{2}/r^{2}$ . A monotonicity argument completes the proof. ∎

Next, we show that Proposition 10 can be used to derive a condition similar to (33) for bounded empirical processes. To bound from above the performance of ERM procedures in density estimation, Theorem 13 in the appendix requires the existence of a quantity $r_{*}>0$ such that

[TABLE]

where $F$ is the function defined in Proposition 10 above.

To obtain such quantity $r_{*}>0$ under the assumptions of Proposition 10, we proceed as follows. Let $K=\max(b,\sqrt{L})$ and assume that

[TABLE]

Define $r^{2}=CKR\sqrt{\log(eMK/(R\sqrt{n}))}$ where $C\geq 1$ is a numerical constant that will be chosen later. We now bound from above the right hand side of (65). We have

[TABLE]

where for the last inequality we used that $\log\log(u)\leq\log u$ for all $u>1$ and that $\log(C)\leq\log(C)\log(eMK/(R\sqrt{n}))$ , since $C\geq 1$ and $MK/(R\sqrt{n})\geq 1$ . Thus, the right hand side of (65) is bounded from above by

[TABLE]

It is clear that the above quantity is bounded from above by $r^{2}/16$ if the numerical constant $C$ is large enough. Thus we have proved that as long as $MK>R\sqrt{n}$ , inequality (69) holds for

[TABLE]

where $C\geq 1$ is a numerical constant.

ERM and convex aggregation in density estimation

The minimax optimal rate for the convex aggregation problem is known to be of order

[TABLE]

for regression with fixed design [30] and regression with random design [34] if the integers $M$ and $\sqrt{n}$ satisfy $eM\sigma\leq R\sqrt{n}\exp(\sqrt{n})$ or equivalently $\phi_{M}^{C}(n)\leq 1$ . The arguments for the convex aggregation lower bound from [34] can be readily applied to density estimation, showing that the rate $\phi_{M}^{C}(n)$ is a lower bound on the optimal rate of convex aggregation for density estimation.

We now use the results of the previous sections to show that ERM is optimal for the convex aggregation problem in regression with fixed design, regression with random design and density estimation.

Theorem 11.

There exists an absolute constant $c>0$ such that the following holds. Let $(\mathcal{Z},\mu)$ be a measurable space with measure $\mu$ . Let $p_{0}$ be an unknown density with respect to the measure $\mu$ . Let $Z_{1},...,Z_{n}$ be i.i.d. random variables valued in $\mathcal{Z}$ with density $p_{0}$ . Let $p_{1},...,p_{M}\in L_{2}(\mu)$ and let $p_{\boldsymbol{\theta}}=\sum_{j=1}^{M}\theta_{j}p_{j}$ for all ${\boldsymbol{\theta}}=(\theta_{1},...,\theta_{M})^{T}\in\mathbf{R}^{M}$ . Let

[TABLE]

Then for all $x>0$ , with probability greater than $1-\exp(-x)$ ,

[TABLE]

where $R^{2}=\frac{1}{4}\max_{j=1,...,M}\int p_{j}^{2}d\mu$ and $b_{\infty}=\max_{j=0,1,...,M}\|p_{j}\|_{L_{\infty}(\mu)}$ .

Proof.

It is a direct application of Theorem 13 in the appendix. If $M\sqrt{b_{\infty}}\leq R\sqrt{n}$ , a fixed point $t_{*}$ is given by Lemma 14. If $M\sqrt{b_{\infty}}>R\sqrt{n}$ , we use Proposition 10 with $Q({\boldsymbol{\theta}})=\int(p_{0}^{*}-p_{\boldsymbol{\theta}})^{2}$ , $L=b_{\infty}$ and $b=b_{\infty}$ . The bound (69) yields the existence of a fixed point $t_{*}$ in this regime. ∎

Appendix A Proof of the lower bound (9)

Proof of Proposition 2.

By the Varshamov-Gilbert extraction lemma [13, Lemma 2.5], there exist a subset $\Omega$ of $\{0,1\}^{M}$ such that

[TABLE]

for any distinct ${\boldsymbol{\omega}},{\boldsymbol{\omega}}^{\prime}\in\Omega$ .

For each ${\boldsymbol{\omega}}\in\Omega$ , we define $s({\boldsymbol{\omega}})\in\{-1,0,1\}^{M}$ , a signed version of ${\boldsymbol{\omega}}$ , as follows. Let $\varepsilon_{1},...,\varepsilon_{M}$ be $M$ iid Rademacher random variables. Then we have

[TABLE]

Hence, there exists some $s({\boldsymbol{\omega}})\in\{-1,0,1\}^{M}$ with $|s({\boldsymbol{\omega}})_{j}|=\omega_{j}$ for all $j=1,...,M$ such that $|{\boldsymbol{\mu}}_{s({\boldsymbol{\omega}})}|_{2}^{2}\leq m$ .

Define $T_{\Omega}=\{s^{2}{\boldsymbol{\mu}}_{s({\boldsymbol{\omega}})},{\boldsymbol{\omega}}\in\Omega\}$ . Since $s^{2}=1/m$ , each element of $T_{\Omega}$ is of the form $(1/m)(\pm{\boldsymbol{\mu}}_{j_{1}}\pm...\pm{\boldsymbol{\mu}}_{j_{m}})$ where ${\boldsymbol{\mu}}_{j_{1}},...,{\boldsymbol{\mu}}_{j_{m}}$ are $m$ distinct elements of $\{{\boldsymbol{\mu}}_{1},...,{\boldsymbol{\mu}}_{M}\}$ , hence by convexity of $T$ we have $T_{\Omega}\subset T$ . By definition of $s({\boldsymbol{\omega}})$ , it holds that $T_{\Omega}\subset sB_{2}$ , and thus $T_{\Omega}\subset T\cap sB_{2}$ . For any two distinct ${\boldsymbol{u}},{\boldsymbol{v}}\in T_{\Omega}$ ,

[TABLE]

where the supremum is taken over any two distinct elements of $\Omega$ . By Sudakov’s inequality (see for instance [8, Theorem 13.4]) we have

[TABLE]

Since $1/m=s^{2}$ , the right hand side of the previous display is equal to the right hand side of (9) and the proof is complete. ∎

Appendix B Local Rademacher complexities and density estimation

In the last decade emerged a vast literature on local Rademacher complexities to study the performance of empirical risk minimizers (ERM) for general learning problems, cf. [3, 2, 17] and the references therein. The following result is given in [3, Theorem 2.1]. Let $\varepsilon_{1},...,\varepsilon_{n}$ be independent Rademacher random variables, that are independent from all other random variables considered in the paper.

Theorem 12 (Bartlett et al. [3]).

Let $Z_{1},...,Z_{n}$ be i.i.d. random variables valued in some measurable space $\mathcal{Z}$ . Let $\mathcal{H}:\mathcal{Z}\rightarrow[-b_{\infty},b_{\infty}]$ be a class of measurable functions. Assume that there is some $v>0$ such that $\mathbb{E}[h(Z_{1})^{2}]\leq v$ for all $h\in\mathcal{H}$ . Then for all $x>0$ , with probability greater than $1-\exp(-x)$ ,

[TABLE]

Theorem 12 is a straightforward consequence of Talagrand inequality. We now explain how Theorem 12 can be used to derive sharp oracle inequalities in density estimation.

Theorem 13.

Let $(\mathcal{Z},\mu)$ be a measurable space with measure $\mu$ . Let $p_{0}$ be an unknown density with respect to the measure $\mu$ . Let $Z_{1},...,Z_{n}$ be i.i.d. random variables valued in $\mathcal{Z}$ with density $p_{0}$ . Let $\mathcal{P}$ be a convex subset of $L^{2}(\mu)$ . Assume that there exists $p_{0}^{*}\in\mathcal{P}$ such that $\int(p_{0}-p_{0}^{*})^{2}d\mu=\inf_{p\in\mathcal{P}}\int(p_{0}-p)^{2}d\mu$ . Assume that for some $t_{*}>0$ ,

[TABLE]

Assume that there exists an estimator $\hat{p}$ such that almost surely,

[TABLE]

Then for all $x>0$ , with probability greater than $1-\exp(-x)$ ,

[TABLE]

where $b_{\infty}=\sup_{p\in\mathcal{P}}\|p\|_{L_{\infty}(\mu)}$ .

Proof of Theorem 13.

By optimality of $\hat{p}$ we have

[TABLE]

where for all $p\in\mathcal{P}$ , $\Xi_{p}$ is the random variable

[TABLE]

Let $\rho=\max\left(t_{*}^{2},4(\|p_{0}\|_{L_{\infty}(\mu)}+8b_{\infty}/3)x/n\right)$ and define

[TABLE]

The class $\mathcal{H}$ is convex, $0\in\mathcal{H}$ and $t_{*}^{2}\leq\rho$ so that $h\in\mathcal{H}$ implies $\frac{t_{*}^{2}}{\rho}h\in\mathcal{H}$ . For any linear form $L$ ,

[TABLE]

so that by taking expectations, (83) holds if $t_{*}^{2}$ is replaced by $\rho$ .

For any $h\in\mathcal{H}$ , $\mathbb{E}[h(Z_{1})^{2}]\leq\|p_{0}\|_{L_{\infty}(\mu)}\rho$ and $h$ is valued in $[-2b_{\infty},2b_{\infty}]$ $\mu$ -almost surely. We apply Theorem 12 to the class $\mathcal{H}$ . This yields that with probability greater than $1-e^{-x}$ , if $p\in\mathcal{P}$ is such that $p_{0}^{8}-p\in\mathcal{H}$ , then

[TABLE]

On the same event of probability greater than $1-e^{-x}$ , if $p\in\mathcal{P}$ is such that $\int(p_{0}^{*}-p)^{2}d\mu>\rho$ , consider $h=\sqrt{\rho}(p_{0}^{*}-p)/\sqrt{\int(p_{0}^{*}-p)^{2}d\mu}$ which belongs to $\mathcal{H}$ . We have $(P-P_{n})h\leq\rho$ , which can be rewritten

[TABLE]

so that $\Xi_{p}\leq\rho/2\leq\rho$ . In summary, we have proved that on an event of probability greater than $1-e^{-x}$ , $\sup_{p\in\mathcal{P}}\Xi_{p}\leq\rho$ . In particular, this holds for $p=\hat{p}$ which completes the proof. ∎

Appendix C A fixed point $t_{*}$ for finite dimensional classes

Lemma 14.

Consider the notations of Theorem 13 and assume that the linear span of $\mathcal{P}$ is finite dimensional of dimension $d$ . Then (83) is satisfied for $t_{*}^{2}=256\|p_{0}\|_{L_{\infty}(\mu)}d/n$ .

Proof.

Let $e_{1},...,e_{d}$ be an orthonormal basis of the linear span of $\mathcal{P}$ , for the scalar product $\langle p_{1},p_{2}\rangle=\int p_{1}p_{2}d\mu$ . Then

[TABLE]

where we have used the Cauchy-Schwarz inequality, Jensen’ inequality, and that $\mathbb{E}e_{j}(X)^{2}\leq\|p_{0}\|_{L_{\infty}(\mu)}$ for all $j=1,...,d$ . ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Audibert and Tsybakov [2007] Jean-Yves Audibert and Alexandre B. Tsybakov. Fast learning rates for plug-in classifiers. Ann. Statist. , 35(2):608–633, 04 2007. 10.1214/009053606000001217 . URL http://dx.doi.org/10.1214/009053606000001217 . · doi ↗
2Bartlett and Mendelson [2006] Peter L Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields , 135(3):311–334, 2006.
3Bartlett et al. [2005] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Ann. Statist. , 33(4):1497–1537, 08 2005. 10.1214/009053605000000282 . URL http://dx.doi.org/10.1214/009053605000000282 . · doi ↗
4Bartlett et al. [2012] Peter L Bartlett, Shahar Mendelson, and Joseph Neeman. L 1-regularized linear regression: persistence and oracle inequalities. Probability theory and related fields , 154(1-2):193–224, 2012.
5Bellec [2017 a] Pierre C. Bellec. Optimal bounds for aggregation of affine estimators. Annals of Statistics, to appear , 2017 a. URL https://arxiv.org/pdf/1410.0346 v 4.pdf .
6Bellec [2017 b] Pierre C. Bellec. Optimal exponential bounds for aggregation of density estimators. Bernoulli , 23(1):219–248, 2017 b. 10.3150/15-BEJ 742 . URL http://dx.doi.org/10.3150/15-BEJ 742 . · doi ↗
7Bellec [2017 c] Pierre C. Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. Annals of Statistics, to appear , 2017 c. URL https://arxiv.org/pdf/1510.08029.pdf .
8Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence . Oxford University Press, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Localized Gaussian width of MMM-convex hulls with applications to Lasso and convex aggregation

Abstract

1 Introduction

Contributions.

2 Localized Gaussian width of a MMM-convex hull

Proposition 1**.**

Proposition 2**.**

2.1 A refinement of Maurey’s argument

Lemma 3**.**

Proof of Lemma 3.

2.2 Proof of (7)

Proof of (7).

3 Statistical applications in fixed-design regression

Theorem 4** ([7]).**

Proposition 5**.**

Proof.

3.1 Convex aggregation

Theorem 6**.**

Proof of Theorem 6.

3.2 Lasso

Theorem 7**.**

Proof of Theorem 7.

4 The anisotropic persistence problem in regression with random design

Theorem 8** (Theorem A in Lecué and Mendelson [21]).**

Proposition 9**.**

Proof of Proposition 9.

5 Bounded empirical processes and density estimation

Proposition 10**.**

Proof of Proposition 10.

ERM and convex aggregation in density estimation

Theorem 11**.**

Proof.

Appendix A Proof of the lower bound (9)

Proof of Proposition 2.

Appendix B Local Rademacher complexities and density estimation

Theorem 12** (Bartlett et al. [3]).**

Theorem 13**.**

Proof of Theorem 13.

Appendix C A fixed point t∗t_{*}t∗​ for finite dimensional classes

Lemma 14**.**

Proof.

Localized Gaussian width of $M$ -convex hulls with applications to Lasso and convex aggregation

2 Localized Gaussian width of a $M$ -convex hull

Proposition 1.

Proposition 2.

Lemma 3.

Theorem 4 ([7]).

Proposition 5.

Theorem 6.

Theorem 7.

Theorem 8 (Theorem A in Lecué and Mendelson [21]).

Proposition 9.

Proposition 10.

Theorem 11.

Theorem 12 (Bartlett et al. [3]).

Theorem 13.

Appendix C A fixed point $t_{*}$ for finite dimensional classes

Lemma 14.