Sharp Oracle Inequalities for Low-complexity Priors

Tung Duy Luu; Jalal Fadili; Christophe Chesneau

arXiv:1702.03166·math.ST·October 4, 2017

Sharp Oracle Inequalities for Low-complexity Priors

Tung Duy Luu, Jalal Fadili, Christophe Chesneau

PDF

Open Access

TL;DR

This paper establishes sharp oracle inequalities for high-dimensional estimators like Lasso and nuclear norm penalties, demonstrating their theoretical performance guarantees under various data loss functions and priors.

Contribution

It provides a unified analysis of exponential weighted aggregation and penalized estimators with general priors, highlighting their performance and differences in high-dimensional settings.

Findings

01

Sharp oracle inequalities for Lasso, group Lasso, and nuclear norm penalties.

02

Theoretical guarantees for estimators under various data loss functions.

03

Efficient implementation via proximal splitting algorithms.

Abstract

In this paper,we consider a high-dimensional statistical estimation problem in which the the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the $ℓ_{\infty}$ and the nuclear norm penalties. All our estimators…

Equations316

θ_{n}^{PEN} \in θ \in R^{p} Argmin {V_{n} (θ) = def \frac{1}{n} F (X θ, y) + λ_{n} J (θ)},

θ_{n}^{PEN} \in θ \in R^{p} Argmin {V_{n} (θ) = def \frac{1}{n} F (X θ, y) + λ_{n} J (θ)},

μ_{n} (θ) = \frac{exp ( - V _{n} ( θ ) / β )}{\int _{Θ} exp ( - V _{n} ( ω ) / β ) d ω},

μ_{n} (θ) = \frac{exp ( - V _{n} ( θ ) / β )}{\int _{Θ} exp ( - V _{n} ( ω ) / β ) d ω},

θ_{n}^{EWA} = \int_{R^{p}} θ μ_{n} (θ) d θ .

θ_{n}^{EWA} = \int_{R^{p}} θ μ_{n} (θ) d θ .

R_{n}{\big{(}{\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}},{\boldsymbol{\theta}}_{0}\big{)}}\leq C\inf_{{\boldsymbol{\theta}}\in\mathbb{R}^{p}}{\big{(}R_{n}{\left({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\right)}+\Delta_{n,p,\lambda_{n},\beta}({\boldsymbol{\theta}})\big{)}},

R_{n}{\big{(}{\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}},{\boldsymbol{\theta}}_{0}\big{)}}\leq C\inf_{{\boldsymbol{\theta}}\in\mathbb{R}^{p}}{\big{(}R_{n}{\left({\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\right)}+\Delta_{n,p,\lambda_{n},\beta}({\boldsymbol{\theta}})\big{)}},

θ_{T} = P_{T} θ and X_{T} = X P_{T} .

θ_{T} = P_{T} θ and X_{T} = X P_{T} .

f (θ) + f^{*} (z) \geq ⟨ z, θ ⟩, \forall (θ, z) \in R^{p} \times R^{p} .

f (θ) + f^{*} (z) \geq ⟨ z, θ ⟩, \forall (θ, z) \in R^{p} \times R^{p} .

\partial f({\boldsymbol{\theta}})=\big{\{}{\boldsymbol{\eta}}\in\mathbb{R}^{p}\;:\;f({\boldsymbol{\theta}}^{\prime})\geq f({\boldsymbol{\theta}})+\langle{\boldsymbol{\eta}},{\boldsymbol{\theta}}^{\prime}-{\boldsymbol{\theta}}\rangle,\quad\forall{\boldsymbol{\theta}}^{\prime}\in\operatorname*{dom}(f)\big{\}}~{}.

\partial f({\boldsymbol{\theta}})=\big{\{}{\boldsymbol{\eta}}\in\mathbb{R}^{p}\;:\;f({\boldsymbol{\theta}}^{\prime})\geq f({\boldsymbol{\theta}})+\langle{\boldsymbol{\eta}},{\boldsymbol{\theta}}^{\prime}-{\boldsymbol{\theta}}\rangle,\quad\forall{\boldsymbol{\theta}}^{\prime}\in\operatorname*{dom}(f)\big{\}}~{}.

D_{f}^{η} (\overline{θ}, θ) = f (\overline{θ}) - f (θ) - ⟨ η, \overline{θ} - θ ⟩ .

D_{f}^{η} (\overline{θ}, θ) = f (\overline{θ}) - f (θ) - ⟨ η, \overline{θ} - θ ⟩ .

F (v, y) \geq F (u, y) + ⟨ \nabla F (u, y), v - u ⟩ + φ (∥ v - u ∥_{2}),

F (v, y) \geq F (u, y) + ⟨ \nabla F (u, y), v - u ⟩ + φ (∥ v - u ∥_{2}),

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\varphi{\left(\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}\right)}/(n\beta)\right)}\big{\|}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star}),\boldsymbol{y})\big{\|}_{2}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}+({\boldsymbol{u}}^{\star}-\boldsymbol{X}\overline{{\boldsymbol{\theta}}})\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty,

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\varphi{\left(\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}\right)}/(n\beta)\right)}\big{\|}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star}),\boldsymbol{y})\big{\|}_{2}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}+({\boldsymbol{u}}^{\star}-\boldsymbol{X}\overline{{\boldsymbol{\theta}}})\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty,

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}^{q}/(qn\beta)\right)}\big{\|}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star}),\boldsymbol{y})\big{\|}_{2}\big{\|}(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})-\boldsymbol{X}\overline{{\boldsymbol{\theta}}}\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty.

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}^{q}/(qn\beta)\right)}\big{\|}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star}),\boldsymbol{y})\big{\|}_{2}\big{\|}(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})-\boldsymbol{X}\overline{{\boldsymbol{\theta}}}\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty.

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}^{q}/(qn\beta)\right)}\big{\|}\boldsymbol{y}-(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})\big{\|}_{2}^{q-1}\big{\|}\boldsymbol{X}\overline{{\boldsymbol{\theta}}}-(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty.

\displaystyle\int_{\mathbb{R}^{p}}\exp{\left(-\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}\big{\|}_{2}^{q}/(qn\beta)\right)}\big{\|}\boldsymbol{y}-(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})\big{\|}_{2}^{q-1}\big{\|}\boldsymbol{X}\overline{{\boldsymbol{\theta}}}-(\boldsymbol{X}{\boldsymbol{\theta}}+{\boldsymbol{u}}^{\star})\big{\|}_{2}d{\boldsymbol{\theta}}<+\infty.

e_{θ} = P_{aff (\partial J (θ))} (0) .

e_{θ} = P_{aff (\partial J (θ))} (0) .

S_{θ} = par (\partial J (θ)) and T_{θ} = S_{θ}^{⊥} .

S_{θ} = par (\partial J (θ)) and T_{θ} = S_{θ}^{⊥} .

\partial J (θ)

\partial J (θ)

\displaystyle=\big{\{}{\boldsymbol{\eta}}\in\mathbb{R}^{n}\;:\;{\boldsymbol{\eta}}_{T_{{\boldsymbol{\theta}}}}=e_{{\boldsymbol{\theta}}}\quad\text{and}\quad\inf_{\tau\geq 0}\max{\left(J^{\circ}{\left(\tau e_{{\boldsymbol{\theta}}}+{\boldsymbol{\eta}}_{S_{{\boldsymbol{\theta}}}}+(\tau-1)\operatorname{P}_{S_{{\boldsymbol{\theta}}}}f_{{\boldsymbol{\theta}}}\right)},\tau\right)}\leq 1\big{\}}.

J (ω_{S_{θ}}) = ⟨ η_{S_{θ}}, ω_{S_{θ}} ⟩ .

J (ω_{S_{θ}}) = ⟨ η_{S_{θ}}, ω_{S_{θ}} ⟩ .

σ_{\partial J (θ) - f_{θ}} (ω) = J (ω_{S_{θ}}) - ⟨ P_{S_{θ}} f_{θ}, ω_{S_{θ}} ⟩ .

σ_{\partial J (θ) - f_{θ}} (ω) = J (ω_{S_{θ}}) - ⟨ P_{S_{θ}} f_{θ}, ω_{S_{θ}} ⟩ .

σ_{\partial J (θ) - f_{θ}} (ω) = ⟨ v, ω_{S_{θ}} ⟩ .

σ_{\partial J (θ) - f_{θ}} (ω) = ⟨ v, ω_{S_{θ}} ⟩ .

\partial J({\boldsymbol{\theta}})=\operatorname{aff}(\partial J({\boldsymbol{\theta}}))\cap{\mathcal{C}}^{\circ}=\big{\{}{\boldsymbol{\eta}}\in\mathbb{R}^{n}\;:\;{\boldsymbol{\eta}}_{T_{{\boldsymbol{\theta}}}}=e_{{\boldsymbol{\theta}}}\quad\text{and}\quad J^{\circ}({\boldsymbol{\eta}}_{S_{{\boldsymbol{\theta}}}})\leq 1\big{\}}.

\partial J({\boldsymbol{\theta}})=\operatorname{aff}(\partial J({\boldsymbol{\theta}}))\cap{\mathcal{C}}^{\circ}=\big{\{}{\boldsymbol{\eta}}\in\mathbb{R}^{n}\;:\;{\boldsymbol{\eta}}_{T_{{\boldsymbol{\theta}}}}=e_{{\boldsymbol{\theta}}}\quad\text{and}\quad J^{\circ}({\boldsymbol{\eta}}_{S_{{\boldsymbol{\theta}}}})\leq 1\big{\}}.

H^{\circ} (ω)

H^{\circ} (ω)

= ρ \in [0, 1] max \overline{conv} (in f (σ_{J (θ) \leq ρ} (ω), σ_{G (θ) \leq 1 - ρ} (ω)))

= ρ \in [0, 1] max \overline{conv} (in f (ρ σ_{J (θ) \leq 1} (ω), (1 - ρ) σ_{G (θ) \leq 1} (ω)))

= ρ \in [0, 1] max \overline{conv} (in f (ρ J^{\circ} (ω), (1 - ρ) G^{\circ} (ω))) .

H^{\circ} (ω)

H^{\circ} (ω)

D^{⊤}

= u \in C \cap Span (D^{⊤}) sup ⟨ D^{+} ω, u ⟩

= \overline{conv} (in f (J^{\circ} (D^{+} ω), ι_{Ker (D)} (D^{+} ω)))

= J^{\circ} (D^{+} ω) .

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{\theta}}\big{\|}_{1}=\sum_{i=1}^{p}\big{|}{\boldsymbol{\theta}}_{i}\big{|}.

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{\theta}}\big{\|}_{1}=\sum_{i=1}^{p}\big{|}{\boldsymbol{\theta}}_{i}\big{|}.

T_{\boldsymbol{\theta}}=\operatorname*{Span}\{(\boldsymbol{a}_{i})_{i\in\mathrm{supp}({\boldsymbol{\theta}})}\},\quad(e_{{\boldsymbol{\theta}}})_{i}=\begin{cases}\operatorname*{sign}({\boldsymbol{\theta}}_{i})&\text{if~{}}i\in\mathrm{supp}({\boldsymbol{\theta}})\\ 0&\text{otherwise}\end{cases},\quad\text{and}\quad J^{\circ}=\big{\|}\cdot\big{\|}_{\infty}.

T_{\boldsymbol{\theta}}=\operatorname*{Span}\{(\boldsymbol{a}_{i})_{i\in\mathrm{supp}({\boldsymbol{\theta}})}\},\quad(e_{{\boldsymbol{\theta}}})_{i}=\begin{cases}\operatorname*{sign}({\boldsymbol{\theta}}_{i})&\text{if~{}}i\in\mathrm{supp}({\boldsymbol{\theta}})\\ 0&\text{otherwise}\end{cases},\quad\text{and}\quad J^{\circ}=\big{\|}\cdot\big{\|}_{\infty}.

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{\theta}}\big{\|}_{1,2}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\sum_{i=1}^{L}\big{\|}{\boldsymbol{\theta}}_{b_{i}}\big{\|}_{2}.

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{\theta}}\big{\|}_{1,2}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\sum_{i=1}^{L}\big{\|}{\boldsymbol{\theta}}_{b_{i}}\big{\|}_{2}.

T_{\boldsymbol{\theta}}=\operatorname*{Span}\{(a_{j})_{\big{\{}j\;:\;\exists i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}}),j\in b_{i}\big{\}}}\},(e_{{\boldsymbol{\theta}}})_{b_{i}}=\begin{cases}\tfrac{{\boldsymbol{\theta}}_{b_{i}}}{\left\|{\boldsymbol{\theta}}_{b_{i}}\right\|_{2}}&\text{if~{}}i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}})\\ 0&\text{otherwise}\end{cases},~{}\text{and}~{}J^{\circ}({\boldsymbol{\omega}})=\max_{i\in\{1,\ldots,L\}}\left\|{\boldsymbol{\omega}}_{b_{i}}\right\|_{2}.

T_{\boldsymbol{\theta}}=\operatorname*{Span}\{(a_{j})_{\big{\{}j\;:\;\exists i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}}),j\in b_{i}\big{\}}}\},(e_{{\boldsymbol{\theta}}})_{b_{i}}=\begin{cases}\tfrac{{\boldsymbol{\theta}}_{b_{i}}}{\left\|{\boldsymbol{\theta}}_{b_{i}}\right\|_{2}}&\text{if~{}}i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}})\\ 0&\text{otherwise}\end{cases},~{}\text{and}~{}J^{\circ}({\boldsymbol{\omega}})=\max_{i\in\{1,\ldots,L\}}\left\|{\boldsymbol{\omega}}_{b_{i}}\right\|_{2}.

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{D}}^{\top}{\boldsymbol{\theta}}\big{\|}_{1,2}.

J({\boldsymbol{\theta}})=\big{\|}{\boldsymbol{D}}^{\top}{\boldsymbol{\theta}}\big{\|}_{1,2}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Stochastic Gradient Optimization Techniques · Distributed Sensor Networks and Detection Algorithms

Full text

Sharp Oracle Inequalities for Low-complexity Priors

Tung Duy Luu Normandie Univ, ENSICAEN, CNRS, GREYC, France, Email: {duy-tung.luu, Jalal.Fadili}@ensicaen.fr.

Jalal Fadili††footnotemark:

Christophe Chesneau Normandie Univ, UNICAEN, CNRS, LMNO, France, Email: [email protected].

Abstract

In this paper, we consider a high-dimensional statistical estimation problem in which the the number of parameters is comparable or larger than the sample size. We present a unified analysis of the performance guarantees of exponential weighted aggregation and penalized estimators with a general class of data losses and priors which encourage objects which conform to some notion of simplicity/complexity. More precisely, we show that these two estimators satisfy sharp oracle inequalities for prediction ensuring their good theoretical performances. We also highlight the differences between them. When the noise is random, we provide oracle inequalities in probability using concentration inequalities. These results are then applied to several instances including the Lasso, the group Lasso, their analysis-type counterparts, the $\ell_{\infty}$ and the nuclear norm penalties. All our estimators can be efficiently implemented using proximal splitting algorithms.

Key words. High-dimensional estimation, exponential weighted aggregation, penalized estimation, oracle inequality, low-complexity models.

AMS subject classifications. 62G07 62G20

1 Introduction

1.1 Problem statement

Our statistical context is the following. Let $\boldsymbol{y}=(\boldsymbol{y}_{1},\boldsymbol{y}_{2},\cdots,\boldsymbol{y}_{n})$ be $n$ identically distributed observations with common marginal distribution, and $\boldsymbol{X}\in\mathbb{R}^{n\times p}$ a deterministic design matrix. The goal to estimate a parameter vector ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ of the observations marginal distribution based on the data $\boldsymbol{y}$ and $\boldsymbol{X}$ .

Let $F:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}$ be a loss function supposed to be smooth and convex that assigns to each ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ a cost $F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y})$ . Let ${\boldsymbol{\theta}}_{0}\in\operatorname*{Argmin}_{{\boldsymbol{\theta}}\in\mathbb{R}^{p}}\mathbb{E}\left[F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y})\right]$ be any minimizer of the population risk. We regard ${\boldsymbol{\theta}}_{0}$ as the true parameter. A usual instance of this statistical setting is the standard linear regression model based on $n$ pairs $(\boldsymbol{y}_{i},\boldsymbol{X}_{i})$ of response-covariate that are linked linearly $\boldsymbol{y}=\boldsymbol{X}{\boldsymbol{\theta}}_{0}+\boldsymbol{\xi}$ , and $F({\boldsymbol{u}},\boldsymbol{y})=\tfrac{1}{2}\big{\|}\boldsymbol{y}-{\boldsymbol{u}}\big{\|}_{2}^{2}$ .

Our goal is to provide general oracle inequalities in prediction for two estimators of ${\boldsymbol{\theta}}_{0}$ : the penalized estimator and exponential weighted aggregation. In the setting where ” $p$ larger than $n$ (possibly much larger), the estimation problem is ill-posed since the rectangular matrix $\boldsymbol{X}$ has a kernel of dimension at least $p-n$ . To circumvent this difficulty, we will exploit the prior that ${\boldsymbol{\theta}}_{0}$ has some low-complexity structure (among which sparsity and low-rank are the most popular). That is, even if the ambient dimension $p$ of ${\boldsymbol{\theta}}_{0}$ is very large, its intrinsic dimension is much smaller than the sample size $n$ . This makes it possible to build estimates $\boldsymbol{X}\widehat{{\boldsymbol{\theta}}}$ with good provable performance guarantees under appropriate conditions. There has been a flurry of research on the use of low-complexity regularization in ill-posed recovery problems in various areas including statistics and machine learning.

1.2 Variational/Penalized Estimators

Regularization is now a central theme in many fields including statistics, machine learning and inverse problems. It allows one to impose on the set of candidate solutions some prior structure on the object to be estimated. This regularization ranges from squared Euclidean or Hilbertian norms to non-Hilbertian norms (e.g. $\ell_{1}$ norm for sparse objects, or nuclear norm for low-rank matrices) that have sparked considerable interest in the recent years. In this paper, we consider the class of estimators obtained by solving the convex optimization problem111To avoid trivialities, the set of minimizers is assumed non-empty, which holds for instance if $J$ is also coercive.

[TABLE]

where the regularizing penalty $J$ is a proper closed convex function that promotes some specific notion of simplicity/low-complexity, and $\lambda_{n}>0$ is the regularization parameter. A prominent member covered by (1.1) is the Lasso [13, 57, 42, 23, 8, 5, 7, 33] and its variants such the analysis/fused Lasso [52, 58], SLOPE [6, 54] or group Lasso [2, 76, 1, 73]. Another example is the nuclear norm minimization for low rank matrix recovery motivated by various applications including robust PCA, phase retrieval, control and computer vision [45, 10, 28, 11]. See [40, 7, 67, 64] for generalizations and comprehensive reviews.

1.3 Exponential Weighted Aggregation (EWA)

An alternative to the the variational estimator (1.1) is the aggregation by exponential weighting, which consists in substituting averaging for minimization. The aggregators are defined via the probability density function

[TABLE]

where $\beta>0$ is called temperature parameter. If all ${\boldsymbol{\theta}}$ are candidates to estimate the true vector ${\boldsymbol{\theta}}_{0}$ , then $\Theta=\mathbb{R}^{p}$ . The aggregate is thus defined by

[TABLE]

Aggregation by exponential weighting has been widely considered in the statistical and machine learning literatures, see e.g. [20, 17, 16, 21, 41, 74, 46, 35, 29, 26] to name a few. ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ can also be interpreted as the posterior conditional mean in the Bayesian sense if $F/(n\beta)$ is the negative-loglikelihood associated to the noise $\boldsymbol{\xi}$ with the prior density $\pi({\boldsymbol{\theta}})\propto\exp{\left(-\lambda_{n}J({\boldsymbol{\theta}})/\beta\right)}$ .

1.4 Oracle inequalities

Oracle inequalities, which are at the heart of our work, quantify the quality of an estimator compared to the best possible one among a family of estimators. These inequalities are well adapted in the scenario where the prior penalty promotes some notion of low-complexity (e.g. sparsity, low rank, etc.). Given two vectors ${\boldsymbol{\theta}}_{1}$ and ${\boldsymbol{\theta}}_{2}$ , let $R_{n}{\left({\boldsymbol{\theta}}_{1},{\boldsymbol{\theta}}_{2}\right)}$ be a nonnegative error measure between their predictions, respectively $\boldsymbol{X}{\boldsymbol{\theta}}_{1}$ and $\boldsymbol{X}{\boldsymbol{\theta}}_{2}$ . A popular example is the averaged prediction squared error $\tfrac{1}{n}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}_{1}-\boldsymbol{X}{\boldsymbol{\theta}}_{2}\big{\|}_{2}^{2}$ , where $\big{\|}\cdot\big{\|}_{2}$ is the $\ell_{2}$ norm. $R_{n}$ will serve as a measure of the performance of the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ . More precisely, we aim to prove that ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ mimic as much as possible the best possible model. This idea is materialized in the following type of inequalities (stated here for EWA)

[TABLE]

where $C\geq 1$ is the leading constant of the oracle inequality and the remainder term $\Delta_{n,\lambda_{n},\beta}({\boldsymbol{\theta}})$ depends on the performance of the estimator, the complexity of ${\boldsymbol{\theta}}$ , the sample size $n$ , the dimension $p$ , and the regularization and temperature parameters $(\lambda_{n},\beta)$ . An estimator with good oracle properties would correspond to $C$ close to $1$ (ideally, $C=1$ , in which case the inequality is said “sharp”), and $\Delta_{n,p,\lambda_{n},\beta}({\boldsymbol{\theta}})$ is small and decreases rapidly to [math] as $n\to+\infty$ .

1.5 Contributions

We provide a unified analysis where we capture the essential ingredients behind the low-complexity priors promoted by $J$ , relying on sophisticated arguments from convex analysis and our previous work [27, 63, 65, 62, 64]. Our main contributions are summarized as follows:

•

We show that the EWA estimator ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ in (1.2) and the variational/penalized estimator ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ in (1.1) satisfy (deterministic) sharp oracle inequalities for prediction with optimal remainder term, for general data losses $F$ beyond the usual quadratic one, and $J$ is a proper finite-valued sublinear function (i.e. $J$ is finite-valued convex and positively homogeneous). We also highlight the differences between the two estimators in terms of the corresponding bounds.

•

When the observations are random, we prove oracle inequalities in probability. The theory is non-asymptotic in nature, as it yields explicit bounds that hold with high probability for finite sample sizes, and reveals the dependence on dimension and other structural parameters of the model.

•

For the standard linear model with Gaussian or sub-Gaussian noise, and a quadratic loss, we deliver refined versions of these oracle inequalities in probability. We underscore the role of the Gaussian width, a concept that captures important geometric characteristics of sets in $\mathbb{R}^{n}$ .

•

These results yield naturally a large number of corollaries when specialized to penalties routinely used in the literature, among which the Lasso, the group Lasso, their analysis-type counterparts (fused (group) Lasso), the $\ell_{\infty}$ and the nuclear norms. Soem of these corollaries are known and others novel.

The estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ can be easily implemented thanks to the framework of proximal splitting methods, and more precisely forward-backward type splitting. While the latter is well-known to solve (1.1) [64], its application within a proximal Langevin Monte-Carlo algorithm to compute ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ with provable guarantees has been recently developed by the authors in [26] to sample from log-semiconcave densities222In a forthcoming paper, this framework was extended to cover the even more general class of prox-regular functions., see also [25] for log-concave densities.

1.6 Relation to previous work

Our oracle inequality for ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ extends the work of [18] with an unprecedented level of generality, far beyond the Lasso and the nuclear norm. Our prediction sharp oracle inequality for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ specializes to that of [55] in the case of the Lasso (see also the discussion in [19] and references therein) and that of [34] for the case of the nuclear norm. Our work also goes much beyond that in [67] on weakly decomposable priors, where we show in particular that there is no need to impose decomposability on the regularizer, since it is rather an intrinsic property of it.

1.7 Paper organization

Section 2 states our main assumptions on the data loss and the prior penalty. All the concepts and notions are exemplified on some penalties some of which are popular in the literature. In Section 3, we prove our main oracle inequalities, and their versions in probability. We then tackle the case of linear regression with quadratic data loss in Section 4. Concepts from convex analysis that are essential to this work are gathered in Section A. A key intermediate result in the proof of our main results is established in Section B with an elegant argument relying on Moreau-Yosida regularization.

1.8 Notations

Vectors and matrices

For a $d$ -dimensional Euclidean space $\mathbb{R}^{d}$ , we endow it with its usual inner product $\langle\cdot,\cdot\rangle$ and associated norm $\left\|\cdot\right\|_{2}$ . $\mathrm{\bf Id}_{d}$ is the identity matrix on $\mathbb{R}^{d}$ . For $p\geq 1$ , $\left\|\cdot\right\|_{p}$ will denote the $\ell_{p}$ norm of a vector with the usual adaptation for $p=+\infty$ .

In the following, if $T$ is a vector space, $\operatorname{P}_{T}$ denotes the orthogonal projector on $T$ , and

[TABLE]

For a finite set ${\mathcal{C}}$ we denote $\big{|}{\mathcal{C}}\big{|}$ its cardinality. For $I\subset\{1,\dots,p\}$ , we denote by $I^{c}$ its complement. ${\boldsymbol{\theta}}_{I}$ is the subvector whose entries are those of ${\boldsymbol{\theta}}$ restricted to the indices in $I$ , and $\boldsymbol{X}_{I}$ the submatrix whose columns are those of $\boldsymbol{X}$ indexed by $I$ . For any matrix $\boldsymbol{X}$ , $\boldsymbol{X}^{\top}$ denotes its transpose and $\boldsymbol{X}^{+}$ its Moore-Penrose pseudo-inverse. For a linear operator $\boldsymbol{A}$ , $\boldsymbol{A}^{*}$ is its adjoint.

Sets

For a nonempty set ${\mathcal{C}}\in\mathbb{R}^{p}$ , we denote ${\overline{\mathrm{conv}}\left({\mathcal{C}}\right)}$ the closure of its convex hull, and $\iota_{\mathcal{C}}$ its indicator function, i.e. $\iota_{\mathcal{C}}({\boldsymbol{\theta}})=0$ if ${\boldsymbol{\theta}}\in{\mathcal{C}}$ and $+\infty$ otherwise. For a nonempty convex set ${\mathcal{C}}$ , its affine hull $\operatorname{aff}({\mathcal{C}})$ is the smallest affine manifold containing it. It is a translate of its parallel subspace $\operatorname{par}({\mathcal{C}})$ , i.e. $\operatorname{par}({\mathcal{C}})=\operatorname{aff}({\mathcal{C}})-{\boldsymbol{\theta}}=\mathbb{R}({\mathcal{C}}-{\mathcal{C}})$ ; for any ${\boldsymbol{\theta}}\in{\mathcal{C}}$ . The relative interior $\operatorname{ri}({\mathcal{C}})$ of a convex set ${\mathcal{C}}$ is the interior of ${\mathcal{C}}$ for the topology relative to its affine full.

Functions

A function $f:\mathbb{R}^{p}\to\mathbb{R}\cup\{+\infty\}$ is closed (or lower semicontinuous) if so is its epigraph. It is coercive if $\lim_{\left\|{\boldsymbol{\theta}}\right\|_{2}\to+\infty}f({\boldsymbol{\theta}})=+\infty$ , and strongly coercive if $\lim_{\left\|{\boldsymbol{\theta}}\right\|_{2}\to+\infty}f({\boldsymbol{\theta}})/\left\|x\right\|_{2}=+\infty$ . The effective domain of $f$ is $\operatorname*{dom}(f)=\big{\{}{\boldsymbol{\theta}}\in\mathbb{R}^{p}\;:\;f({\boldsymbol{\theta}})<+\infty\big{\}}$ and $f$ is proper if $\operatorname*{dom}(f)\neq\emptyset$ as is the case when it is finite-valued. A function is said sublinear if it is convex and positively homogeneous. The Legendre-Fenchel conjugate of $f$ is $f^{*}(\boldsymbol{z})=\sup_{{\boldsymbol{\theta}}\in\mathbb{R}^{p}}\langle\boldsymbol{z},{\boldsymbol{\theta}}\rangle-f({\boldsymbol{\theta}})$ . For $f$ proper, the functions $(f,f^{*})$ obey the Fenchel-Young inequality

[TABLE]

When $f$ is a proper lower semicontonuous and convex function, $(f,f^{*})$ is actually the best pair for which this inequality cannot be tightened. For a function $g$ on $\mathbb{R}_{+}$ , the function $g^{+}:a\in\mathbb{R}_{+}\mapsto g^{+}(a)=\sup_{t\geq 0}at-g(t)$ is called the monotone conjugate of $g$ . The pair $(g,g^{+})$ obviously obeys (1.5) on $\mathbb{R}_{+}\times\mathbb{R}_{+}$ .

For a $C^{1}$ -smooth function $f$ , $\nabla f({\boldsymbol{\theta}})$ is its (Euclidean) gradient. For a bivariate function $g:({\boldsymbol{\eta}},\boldsymbol{y})\in\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}$ that is $C^{2}$ with respect to the first variable ${\boldsymbol{\eta}}$ , for any $\boldsymbol{y}$ , we will denote $\nabla g({\boldsymbol{\eta}},\boldsymbol{y})$ the gradient of $g$ at ${\boldsymbol{\eta}}$ with respect to the first variable.

The subdifferential $\partial f({\boldsymbol{\theta}})$ of a convex function $f$ at ${\boldsymbol{\theta}}$ is the set

[TABLE]

An element of $\partial f({\boldsymbol{\theta}})$ is a subgradient. If the convex function $f$ is differentiable at ${\boldsymbol{\theta}}$ , then its only subgradient is its gradient, i.e. $\partial f({\boldsymbol{\theta}})=\{\nabla f({\boldsymbol{\theta}})\}$ .

The Bregman divergence associated to a convex function $f$ at ${\boldsymbol{\theta}}$ with respect to ${\boldsymbol{\eta}}\in\partial f({\boldsymbol{\theta}})\neq\emptyset$ is

[TABLE]

The Bregman divergence is in general nonsymmetric. It is also nonnegative by convexity. When $f$ is differentiable at $\overline{{\boldsymbol{\theta}}}$ , we simply write $D_{f}{\left({\boldsymbol{\theta}},\overline{{\boldsymbol{\theta}}}\right)}$ (which is, in this case, also known as the Taylor distance).

2 Estimation with low-complexity penalties

The estimators ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ and ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ in (1.1) and (1.3) require two essential ingredients: the data loss term $F$ and the prior penalty $J$ . We here specify the class of such functions covered in our work, and provide illustrating examples.

2.1 Data loss

The class of loss functions $F$ that we consider obey the following assumptions:

(H.1)

$F(\cdot,\boldsymbol{y}):\mathbb{R}^{n}\to\mathbb{R}$ is $C^{1}(\mathbb{R}^{n})$ and uniformly convex for all $\boldsymbol{y}$ of modulus $\varphi$ , i.e.

[TABLE]

where $\varphi:\mathbb{R}_{+}\to\mathbb{R}_{+}$ is a convex non-decreasing function that vanishes only at [math]. 2. (H.2)

For any $\overline{{\boldsymbol{\theta}}}\in\mathbb{R}^{p}$ and $\boldsymbol{y}\in\mathbb{R}^{n}$ , $\int_{\mathbb{R}^{p}}\exp{\left(-F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y})/(n\beta)\right)}\big{|}\langle\nabla F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y}),\boldsymbol{X}(\overline{{\boldsymbol{\theta}}}-{\boldsymbol{\theta}})\rangle\big{|}d{\boldsymbol{\theta}}<+\infty$ .

Recall that by Lemma A.1, the monotone conjugate $\varphi^{+}$ of $\varphi$ is a proper, closed, convex, strongly coercive and non-decreasing function on $\mathbb{R}_{+}$ that vanishes at [math]. Moreover, $\varphi^{++}=\varphi$ . $\varphi^{+}$ is finite-valued on $\mathbb{R}_{+}$ if $\varphi$ is strongly coercive, and it vanishes only at [math] under e.g. Lemma A.1(iii).

The class of data loss functions in (H.1) is fairly general. It is reminiscent of the negative log-likelihood in the regular exponential family. For the moment assumption (H.2) to be satisfied, it is suffient that

[TABLE]

where ${\boldsymbol{u}}^{\star}$ be a minimizer of $F(\cdot,\boldsymbol{y})$ , which is unique by uniform convexity. We here provide an example.

Example 2.1.

Consider the case where333We consider a scaled version of $\varphi$ for simplicity, but the same conclusions remain valid if we take $\varphi(t)=Ct^{q}/q$ , with $C>0$ . $\varphi(t)=t^{q}/q$ , $q\in]1,+\infty[$ , or equivalently $\varphi^{+}(t)=t^{q_{*}}/q_{*}$ where $1/q+1/q_{*}=1$ . For $q=q_{*}=2$ , (H.1) amounts to saying that $F(\cdot,\boldsymbol{y})$ is strongly convex for all $\boldsymbol{y}$ . In particular, [3, Proposition 10.13] shows that $F({\boldsymbol{u}},\boldsymbol{y})=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q}/q$ is uniformly convex for $q\in[2,+\infty[$ with modulus $\varphi(t)=C_{q}t^{q}/q$ , where $C_{q}>0$ is a constant that depends solely on $q$ .

For (H.2) to be verified, it is suffient that

[TABLE]

In particular, taking $F({\boldsymbol{u}},\boldsymbol{y})=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q}/q$ , $q\in[2,+\infty[$ , we have $\big{\|}\nabla F({\boldsymbol{u}},\boldsymbol{y})\big{\|}_{2}=\big{\|}{\boldsymbol{u}}-\boldsymbol{y}\big{\|}_{2}^{q-1}$ , and thus (H.2) holds since

[TABLE]

2.2 Prior penalty

Recall the main definitions and results from convex analysis that are collected in Section A. Our main assumption on $J$ is the following.

(H.3)

$J:\mathbb{R}^{p}\to\mathbb{R}$ is the gauge of a non-empty convex compact set containing the origin as an interior point.

By Lemma A.3, this assumption is equivalent to saying that $J\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}}$ is proper, convex, positively homogeneous, finite-valued and coercive. In turn, $J$ is locally Lipschitz continuous on $\mathbb{R}^{p}$ . Observe also that by virtue of Lemma A.4 and Lemma A.2, the polar gauge $J^{\circ}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}^{\circ}}$ enjoys the same properties as $J$ in (H.3).

2.3 Decomposability of the prior penalty

We are now in position to provide an important characterization of the subdifferential mapping of a function $J$ satisfying (H.3). This characterization will play a pivotal role in our proof of the oracle inequality.

We start by defining some essential geometrical objects that were introduced in [63].

Definition 2.1 (Model Subspace).

Let ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ . We denote by $e_{{\boldsymbol{\theta}}}$ as

[TABLE]

We denote

[TABLE]

$T_{{\boldsymbol{\theta}}}$ * is coined the model subspace of ${\boldsymbol{\theta}}$ associated to $J$ .*

It can be shown, see [63, Proposition 5], that ${\boldsymbol{\theta}}\in T_{{\boldsymbol{\theta}}}$ , hence the name model subspace. When $J$ is differentiable at ${\boldsymbol{\theta}}$ , we have $e_{{\boldsymbol{\theta}}}=\nabla J({\boldsymbol{\theta}})$ and $T_{{\boldsymbol{\theta}}}=\mathbb{R}^{p}$ . When $J$ is the $\ell_{1}$ -norm (Lasso), the vector $e_{{\boldsymbol{\theta}}}$ is nothing but the sign of ${\boldsymbol{\theta}}$ . Thus, $e_{{\boldsymbol{\theta}}}$ can be viewed as a generalization of the sign vector. Observe also that $e_{{\boldsymbol{\theta}}}=\operatorname{P}_{T_{{\boldsymbol{\theta}}}}(\partial J({\boldsymbol{\theta}}))$ , and thus $e_{{\boldsymbol{\theta}}}\in T_{{\boldsymbol{\theta}}}\cap\operatorname{aff}(\partial J({\boldsymbol{\theta}}))$ . However, in general, $e_{{\boldsymbol{\theta}}}\not\in\partial J({\boldsymbol{\theta}})$ .

We now provide a fundamental equivalent description of the subdifferential of $J$ at ${\boldsymbol{\theta}}$ in terms of $e_{{\boldsymbol{\theta}}}$ , $T_{{\boldsymbol{\theta}}}$ , $S_{{\boldsymbol{\theta}}}$ and the polar gauge $J^{\circ}$ .

Theorem 2.1.

Let $J$ satisfy (H.3). Let ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ and $f_{{\boldsymbol{\theta}}}\in\operatorname{ri}(\partial J({\boldsymbol{\theta}}))$ .

(i)

The subdifferential of $J$ at ${\boldsymbol{\theta}}$ reads

[TABLE] 2. (ii)

For any ${\boldsymbol{\omega}}\in\mathbb{R}^{p}$ , $\exists{\boldsymbol{\eta}}\in\partial J({\boldsymbol{\theta}})$ such that

[TABLE]

Proof.

(i)

This follows by piecing together [63, Theorem 1, Proposition 4 and Proposition 5(iii)]. 2. (ii)

From [63, Proposition 5(iv)], we have

[TABLE]

Thus there exists a supporting point ${\boldsymbol{v}}\in\partial J({\boldsymbol{\theta}})-f_{{\boldsymbol{\theta}}}\subset S_{{\boldsymbol{\theta}}}$ with normal vector ${\boldsymbol{\omega}}$ [3, Corollary 7.6(iii)], i.e.

[TABLE]

Taking ${\boldsymbol{\eta}}={\boldsymbol{v}}+f_{{\boldsymbol{\theta}}}$ concludes the proof.

∎

Remark 2.1.

The coercivity assumption in (H.3) is not needed for Theorem 2.1 to hold.

The decomposability of described in Theorem 2.1(i) depends on the particular choice of the mapping ${\boldsymbol{\theta}}\mapsto f_{{\boldsymbol{\theta}}}\in\operatorname{ri}(\partial J({\boldsymbol{\theta}}))$ . An interesting situation is encountered when $e_{{\boldsymbol{\theta}}}\in\operatorname{ri}(J({\boldsymbol{\theta}}))$ , so that one can choose $f_{{\boldsymbol{\theta}}}=e_{{\boldsymbol{\theta}}}$ . Strong gauges, see [63, Definition 6], are precisely a class of gauges for which this situation occurs, and in this case, Theorem 2.1(i) has the simpler form

[TABLE]

The Lasso, group Lasso and nuclear norms are typical examples of (symmetric) strong gauges. However, analysis sparsity penalties (e.g. the fused Lasso) or the $\ell_{\infty}$ -penalty are not strong gauges, though they obviously satisfy (H.3). See the next section for a detailed discussion.

2.4 Calculus with the prior family

The family of penalties complying with (H.3) form a robust class enjoying important calculus rules. In particular it is closed under the sum and composition with an injective linear operator as we now prove.

Lemma 2.1.

The set of functions satisfying (H.3) is closed under addition444It is obvious that the same holds with any positive linear combination. and pre-composition by an injective linear operator. More precisely, the following holds:

(i)

Let $J$ and $G$ be two gauges satisfying (H.3). Then $H\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}J+G$ also obeys (H.3). Moreover,

(a)

$T^{H}_{{\boldsymbol{\theta}}}=T^{J}_{{\boldsymbol{\theta}}}\cap T^{G}_{{\boldsymbol{\theta}}}$ * and $e_{{\boldsymbol{\theta}}}^{H}=\operatorname{P}_{T^{H}_{{\boldsymbol{\theta}}}}(e_{{\boldsymbol{\theta}}}^{J}+e_{{\boldsymbol{\theta}}}^{G})$ , where $T^{J}_{{\boldsymbol{\theta}}}$ and $e_{{\boldsymbol{\theta}}}^{J}$ (resp. $T^{G}_{{\boldsymbol{\theta}}}$ and $e_{{\boldsymbol{\theta}}}^{G}$ ) are the model subspace and vector at ${\boldsymbol{\theta}}$ associated to $J$ (resp. $G$ );* 2. (b)

$H^{\circ}({\boldsymbol{\omega}})=\max_{\rho\in[0,1]}{\overline{\mathrm{conv}}\left(\inf{\left(\rho J^{\circ}({\boldsymbol{\omega}}),(1-\rho)G^{\circ}({\boldsymbol{\omega}})\right)}\right)}$ . 2. (ii)

Let $J$ be a gauge satisfying (H.3), and ${\boldsymbol{D}}:\mathbb{R}^{q}\to\mathbb{R}^{p}$ be surjective. Then $H\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}J\circ{\boldsymbol{D}}^{\top}$ also fulfills (H.3). Moreover,

(a)

$T^{H}_{{\boldsymbol{\theta}}}=\operatorname*{Ker}({\boldsymbol{D}}_{S^{J}_{{\boldsymbol{u}}}}^{\top})$ * and $e_{{\boldsymbol{\theta}}}^{H}=\operatorname{P}_{T^{H}_{{\boldsymbol{\theta}}}}{\boldsymbol{D}}e_{{\boldsymbol{u}}}^{J}$ , where $T^{J}_{{\boldsymbol{u}}}$ and $e_{{\boldsymbol{u}}}^{J}$ are the model subspace and vector at ${\boldsymbol{u}}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}{\boldsymbol{D}}^{\top}{\boldsymbol{\theta}}$ associated to $J$ ;* 2. (b)

$H^{\circ}({\boldsymbol{\omega}})=J^{\circ}({\boldsymbol{D}}^{+}{\boldsymbol{\omega}})$ , where ${\boldsymbol{D}}^{+}={\boldsymbol{D}}^{\top}{\big{(}{\boldsymbol{D}}{\boldsymbol{D}}^{\top}\big{)}}^{-1}$ .

The outcome of Lemma 2.1 is naturally expected. For instance, assertion (i) states that combining several penalties/priors will promote objects living on the intersection of the respective low-complexity models. Similarly, for (ii), one promotes low-complexity in the image of the analysis operator ${\boldsymbol{D}}^{\top}$ . It then follows that one has not to deploy an ad hoc analysis when linearly pre-composing or combining (or both) several penalties (e.g. $\ell_{1}$ +nuclear norms for recovering sparse and low-rank matrices) since our unified analysis in Section 3 will apply to them just as well.

Proof.

(i)

Convexity, positive homogeneity, coercivity and finite-valuedness are straightforward.

(a)

This is [63, Proposition 8(i)-(ii)]. 2. (b)

We have from Lemma A.4 and calculus rules on support functions,

[TABLE] 2. (ii)

Again, Convexity, positive homogeneity and finite-valuedness are immediate. Coercivity holds by injectivity of ${\boldsymbol{D}}^{\top}$ .

(a)

This is [63, Proposition 10(i)-(ii)]. 2. (b)

Denote $J=\gamma_{{\mathcal{C}}}$ . We have

[TABLE]

where in the last equality, we used the fact that ${\boldsymbol{D}}^{+}{\boldsymbol{\omega}}\in\operatorname*{Span}{\big{(}{\boldsymbol{D}}^{\top}\big{)}}=\operatorname*{Ker}({\boldsymbol{D}})^{\perp}$ , and thus $\iota_{\operatorname*{Ker}({\boldsymbol{D}})}({\boldsymbol{D}}^{+}{\boldsymbol{\omega}})=+\infty$ unless ${\boldsymbol{\omega}}=0$ , and $J^{\circ}$ is continuous and convex by (H.3) and Lemma A.4.

∎

2.5 Examples

2.5.1 Lasso

The Lasso regularization is used to promote the sparsity of the minimizers, see [7] for a comphensive review. It corresponds to choosing $J$ as the $\ell_{1}$ -norm

[TABLE]

It is also referred to as $\ell_{1}$ -synthesis in the signal processing community, in contrast to the more general $\ell_{1}$ -analysis sparsity penalty detailed below.

We denote $(\boldsymbol{a}_{i})_{1\leq i\leq p}$ the canonical basis of $\mathbb{R}^{p}$ and $\mathrm{supp}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\dots,p\}\;:\;{\boldsymbol{\theta}}_{i}\neq 0\big{\}}$ . Then,

[TABLE]

2.5.2 Group Lasso

The group Lasso has been advocated to promote sparsity by groups, i.e. it drives all the coefficients in one group to zero together hence leading to group selection, see [2, 76, 1, 73] to cite a few. The group Lasso penalty with $L$ groups reads

[TABLE]

where $\bigcup_{i=1}^{L}b_{i}=\{1,\ldots,p\}$ , $b_{i},b_{j}\subset\{1,\ldots,p\},$ and $b_{i}\cap b_{j}=\emptyset$ whenever $i\neq j$ . Define the group support as $\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\ldots,L\}\;:\;{\boldsymbol{\theta}}_{b_{i}}\neq 0\big{\}}$ . Thus, one has

[TABLE]

2.5.3 Analysis (group) Lasso

One can push the structured sparsity idea one step further by promoting group/block sparsity through a linear operator, i.e. analysis-type sparsity. Given a linear operator ${\boldsymbol{D}}:\mathbb{R}^{q}\to\mathbb{R}^{p}$ (seen as a matrix), the analysis group sparsity penalty is

[TABLE]

This encompasses the 2-D isotropic total variation [52]. For when all groups of cardinality one, we have the analysis- $\ell_{1}$ penalty (a.k.a. general Lasso), which encapsulates several important penalties including that of the 1-D total variation [52], and the fused Lasso [58]. The overlapping group Lasso [31] is also a special case of (2.4) by taking ${\boldsymbol{D}}^{\top}$ to be an operator that exactract the blocks [43, 14] (in which case ${\boldsymbol{D}}$ has even orthogonal rows).

Let $\Lambda_{{\boldsymbol{\theta}}}=\bigcup_{i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{D}}^{\top}{\boldsymbol{\theta}})}b_{i}$ and $\Lambda_{{\boldsymbol{\theta}}}^{c}$ its complement. From Lemma 2.1(ii) and (2.5), we get

[TABLE]

If, in addition, ${\boldsymbol{D}}$ is surjective, then by virtue of Lemma 2.1(ii) we also have

[TABLE]

2.5.4 Anti-sparsity

If the vector to be estimated is expected to be flat (anti-sparse), this can be captured using the $\ell_{\infty}$ norm (a.k.a. Tchebychev norm) as prior

[TABLE]

The $\ell_{\infty}$ regularization has found applications in several fields [32, 38, 53]. Suppose that ${\boldsymbol{\theta}}\neq 0$ , and define the saturation support of ${\boldsymbol{\theta}}$ as $I^{\mathrm{sat}}_{{\boldsymbol{\theta}}}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}i\in\{1,\dots,p\}\;:\;\big{|}{\boldsymbol{\theta}}_{i}\big{|}=\left\|{\boldsymbol{\theta}}\right\|_{\infty}\big{\}}\neq\emptyset$ . From [63, Proposition 14], we have

[TABLE]

2.5.5 Nuclear norm

The natural extension of low-complexity priors to matrices ${\boldsymbol{\theta}}\in\mathbb{R}^{p_{1}\times p_{2}}$ is to penalize the singular values of the matrix. Let $\operatorname*{rank}({\boldsymbol{\theta}})=r$ , and ${\boldsymbol{\theta}}=\boldsymbol{U}\operatorname*{diag}(\uplambda({\boldsymbol{\theta}}))\boldsymbol{V}^{\top}$ be a reduced rank- $r$ SVD decomposition, where $\boldsymbol{U}\in\mathbb{R}^{p_{1}\times r}$ and $\boldsymbol{V}\in\mathbb{R}^{p_{2}\times r}$ have orthonormal columns, and $\uplambda({\boldsymbol{\theta}})\in(\mathbb{R}_{+}\setminus\{0\})^{r}$ is the vector of singular values $(\uplambda_{1}({\boldsymbol{\theta}}),\cdots,\uplambda_{r}({\boldsymbol{\theta}}))$ in non-increasing order. The nuclear norm of ${\boldsymbol{\theta}}$ is

[TABLE]

This penalty is the best convex surrogate to enforce a low-rank prior. It has been widely used for various applications [45, 10, 9, 28, 11].

Following e.g. [62, Example 21], we have

[TABLE]

3 Oracle inequalities for a general loss

Before delving into the details, in the sequel, we will need a bit of notations.

We recall $T_{{\boldsymbol{\theta}}}$ and $e_{{\boldsymbol{\theta}}}$ the model subspace and vector associated to ${\boldsymbol{\theta}}$ (see Definition 2.1). Denote $S_{{\boldsymbol{\theta}}}=T_{{\boldsymbol{\theta}}}^{\perp}$ . Given two coercive finite-valued gauges $J_{1}=\gamma_{{\mathcal{C}}_{1}}$ and $J_{2}=\gamma_{{\mathcal{C}}_{2}}$ , and a linear operator $\boldsymbol{A}$ , we define ${\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{A}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{J_{1}\to J_{2}}$ the operator bound as

[TABLE]

Note that ${\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{A}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{J_{1}\to J_{2}}$ is bounded (this follows from Lemma A.3(v)). Furthermore, we have from Lemma A.4 that

[TABLE]

In the following, whenever it is clear from the context, to lighten notation when $J_{i}$ is a norm, we write the subscript of the norm instead of $J_{i}$ (e.g. $p$ for the $\ell_{p}$ norm, $*$ for the nuclear norm, etc.).

Our main result will involve a measure of well-conditionedness of the design matrix $\boldsymbol{X}$ when restricted to some subspace $T$ . More precisely, for $c>0$ , we introduce the coefficient

[TABLE]

This generalizes the compatibility factor introduced in [68] for the Lasso (and used in [18]). The experienced reader may have recognized that this factor is reminescent of the null space property and restricted injectivity that play a central role in the analysis of the performance guarantees of variational/penalized estimators (1.1); see [27, 63, 65, 62, 64]. One can see in particular that $\Upsilon{\left(T,c\right)}$ is larger than the smallest singular value of $\boldsymbol{X}_{T}$ .

The oracle inequalites will provided in terms of the loss

[TABLE]

3.1 Oracle inequality for ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$

We are now ready to establish our first main result: an oracle inequality for the EWA estimator (1.3).

Theorem 3.1.

Consider the EWA estimator ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ in (1.3) with the density (1.2), where $F$ and $J$ satisfy Assumptions (H.1)-(H.2) and (H.3). Then, for any $\tau>1$ such that $\lambda_{n}\geq\tau J^{\circ}{\left(-\boldsymbol{X}^{\top}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right)}/n$ , the following holds,

[TABLE]

Remark 3.1.

It should be emphasized that Theorem 3.1 is actually a deterministic statement for a fixed choice of $\lambda_{n}$ . Probabilistic analysis will be required when the result is applied to particular statistical models as we will see later. For this, we will use concentration inequalities in order to provide bounds that hold with high probability over the data. 2. 2.

The oracle inequality is sharp. The remainder in it has two terms. The first one encodes the complexity of the model promoted by $J$ . The second one, $p\beta$ , captures the influence of the temperature parameter. In particular, taking $\beta$ sufficiently small of the order $O{\left((pn)^{-1}\right)}$ , this term becomes $O(n^{-1})$ . 3. 3.

When $\varphi(t)=\nu t^{2}/2$ , i.e. $F(\cdot,\boldsymbol{y})$ is $\nu$ -strongly convex, then $\varphi^{+}(t)=t^{2}/(2\nu)$ , and the reminder term becomes

[TABLE]

If, moreover, $\nabla F$ is also $\kappa$ -Lipschitz continuous, then it can be shown that $R_{n}{\big{(}{\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\big{)}}$ is equivalent to a quadratic loss. This means that the oracle inequality in Theorem 3.1 can be stated in terms of the quadratic prediction error. However, the inequality is not anymore sharp in this case as a constant factor equal to the condition number $\kappa/\nu\geq 1$ naturally multiplies the right-hand side. 4. 4.

If $J$ is such that $e_{{\boldsymbol{\theta}}}\in\partial J({\boldsymbol{\theta}})\subset{\mathcal{C}}^{\circ}$ (typically for a strong gauge by (2.1)), then $J^{\circ}(e_{{\boldsymbol{\theta}}})\leq 1$ (in fact an equality if ${\boldsymbol{\theta}}\neq 0$ ). Thus the term $J^{\circ}(e_{{\boldsymbol{\theta}}})$ can be omitted in (3.2). 5. 5.

A close inspection of the proof of Theorem 3.1 reveals that the term $p\beta$ can be improved to the smaller bound

[TABLE]

where the upper-bound is a consequence of Jensen inequality.

Proof.

By convexity of $J$ and assumption (H.1), we have for any ${\boldsymbol{\eta}}\in\partial V_{n}({\boldsymbol{\theta}})$ and any $\overline{{\boldsymbol{\theta}}}\in\mathbb{R}^{p}$ ,

[TABLE]

Since $\varphi$ is non-decreasing and convex, $\varphi\circ\left\|\cdot\right\|_{2}$ is a convex function. Thus, taking the expectation w.r.t. to $\mu_{n}$ on both sides and using Jensen inequality, we get

[TABLE]

This holds for any ${\boldsymbol{\eta}}\in\partial V_{n}({\boldsymbol{\theta}})$ , and in particular at the minimal selection ${\big{(}\partial V_{n}({\boldsymbol{\theta}})\big{)}}^{0}$ (see Section B for details). It then follows from the pillar result in Proposition B.1555In the appendix, we provide a self-contained proof based on a novel Moreau-Yosida regularization argument. In [18, Corollary 1 and 2], an alternative proof is given using an absolute continuity argument since $\mu_{n}$ is locally Lipschitz, hence a Sobolev function. that

[TABLE]

We thus deduce the inequality

[TABLE]

By definition of the Bregman divergence, we have

[TABLE]

By virtue of the duality inequality (A.1), we have

[TABLE]

Denote ${\boldsymbol{\omega}}={\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}-{\boldsymbol{\theta}}$ . By virtue of (H.3), Theorem 2.1 and (A.1), we obtain

[TABLE]

This inequality together with (3.4) (applied with $\overline{{\boldsymbol{\theta}}}={\boldsymbol{\theta}}$ ) and (3.1) yield

[TABLE]

where we applied Fenchel-Young inequality (1.5) to get the last bound. Taking the infimum over ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ yields the desired result. ∎

Stratifiable functions

Theorem 3.1 has a nice instanciation when $\mathbb{R}^{p}$ can be partitioned into a collection of subsets $\{{\mathcal{M}}_{i}\}_{i}$ that form a stratification of $\mathbb{R}^{p}$ . That is, $\mathbb{R}^{p}$ is a finite disjoint union $\cup_{i}{\mathcal{M}}_{i}$ such that the partitioning sets ${\mathcal{M}}_{i}$ (called strata) must fit nicely together and the stratification is endowed with a partial ordering for the closure operation. For example, it is known that a polyhedral function has a polyhedral stratification, and more generally, semialgebraic functions induce stratifications into finite disjoint unions of manifolds; see, e.g., [15]. Another example is that of partly smooth convex functions thoroughly studied in [63, 65, 62, 64] for various statistical and inverse problems. These functions induce a stratification into strata that are $C^{2}$ -smooth submanifolds of $\mathbb{R}^{p}$ . In turns out that all popular penalty functions discussed in this paper are partly smooth (see [62, 64]). Let’s denote $\mathscr{M}$ the set of strata associated to $J$ . With this notation at hand, the oracle inequality (3.2) now reads

[TABLE]

3.2 Oracle inequality for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$

The next result establishes that ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ satisfies a sharp prediction oracle inequality that we will compare to (3.2).

Theorem 3.2.

Consider the penalized estimator ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ in (1.1), where $F$ and $J$ satisfy Assumptions (H.1) and (H.3). Then, for any $\tau>1$ such that $\lambda_{n}\geq\tau J^{\circ}{\left(-\boldsymbol{X}^{\top}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right)}/n$ , the following holds,

[TABLE]

Proof.

The proof follows the same lines as that of Theorem 3.1 except that we use the fact that ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ is a global minimizer of $V_{n}$ , i.e. $0\in\partial V_{n}({\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}})$ . Indeed, we have for any ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$

[TABLE]

Continuing exactly as just after (3.4), replacing ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ with ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ and invoking (3.7) instead of (3.4), we arrive at the claimed result. ∎

Remark 3.2.

Observe that the penalized estimator ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ does not require the moment assumption (H.2) for (3.6) to hold. The convexity assumption on $\varphi$ in (H.1), which was important to apply Jensen’s inequality in the proof of (3.2), is not needed either to get (3.6). 2. 2.

As we remarked for Theorem 3.1, Theorem 3.2 is also a deterministic statement for a fixed choice of $\lambda_{n}$ that holds for any minimizer ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , which is not unique in general. The condition on $\lambda_{n}$ is similar to the one in **[40]** where authors established different guarantees for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ .

One clearly sees that the difference between the prediction performance of ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ lies in the term $p\beta$ (or rather its lower-bound in Remark 3.1-5). Thus letting $\beta\to 0$ in (3.2), one recovers the oracle inequality (3.6) of penalized estimators. In particular, for $\beta=O{\left((pn)^{-1}\right)}$ , this is on the order $O(n^{-1})$ .

3.3 Oracle inequalities in probability

It remains to check when the event $\mathscr{E}=\{\lambda_{n}\geq\tau J^{\circ}{\left(-\boldsymbol{X}^{\top}\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right)}/n\}$ holds with high probability when $\boldsymbol{y}$ is random. We will use concentration inequalities in order to provide bounds that hold with high probability over the data. Toward this goal, we will need the following assumption.

(H.4)

$\boldsymbol{y}=(\boldsymbol{y}_{1},\boldsymbol{y}_{2},\cdots,\boldsymbol{y}_{n})$ are independent and identically distributed observations, and $F({\boldsymbol{u}},\boldsymbol{y})=\sum_{i=1}^{n}f_{i}({\boldsymbol{u}}_{i},\boldsymbol{y}_{i})$ , $f_{i}:\mathbb{R}\times\mathbb{R}\to\mathbb{R}$ . Moreover,

(i)

$\mathbb{E}\left[\big{|}f_{i}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\big{|}\right]<+\infty$ , $\forall 1\leq i\leq n$ ; 2. (ii)

$\big{|}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},t)\big{|}\leq g(t)$ , where $\mathbb{E}\left[g(\boldsymbol{y}_{i})\right]<+\infty$ , $\forall 1\leq i\leq n$ ; 3. (iii)

Bernstein moment condition: $\forall 1\leq i\leq n$ and all integers $m\geq 2$ , $\mathbb{E}\left[\big{|}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\big{|}^{m}\right]\leq m!\kappa^{m-2}\sigma_{i}^{2}/2$ for some constants $\kappa>0$ , $\sigma_{i}>0$ independent of $n$ .

Observe that under (H.4), and by virtue of Lemma A.4(iv) and [30, Proposition V.3.3.4], we have

[TABLE]

Thus, checking the event $\mathscr{E}$ amounts to establishing a deviation inequality for the supremum of an empirical process666As $\boldsymbol{X}({\mathcal{C}})$ is compact, it has a dense countable subset. above its mean under the weak Bernstein moment condition (H.4)(iii), which essentially requires that the $f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})$ have sub-exponential tails, We will first tackle the case where ${\mathcal{C}}$ is the convex hull of a finite set (i.e. ${\mathcal{C}}$ is a polytope).

3.3.1 Polyhedral penalty

We here suppose that $J$ is a finite-valued gauge of ${\mathcal{C}}={\overline{\mathrm{conv}}\left({\mathcal{V}}\right)}$ , where ${\mathcal{V}}$ is finite, i.e. ${\mathcal{C}}$ is a polytope with vertices [49, Corollary 19.1.1]. Our first oracle inequality in probability is the following.

Proposition 3.1.

Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ and $J\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}}$ satisfy Assumptions (H.1), (H.2), (H.3) and (H.4), and ${\mathcal{C}}$ is a polytope with vertices ${\mathcal{V}}$ . Suppose that $\operatorname*{rank}(\boldsymbol{X})=n$ and $\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\big{\|}\boldsymbol{X}{\boldsymbol{v}}\big{\|}_{\infty}\leq 1$ , and take

[TABLE]

for some $\tau>1$ and $\delta>1$ . Then (3.2) and (3.6) hold with probability at least $1-2|{\mathcal{V}}|^{1-\delta}$ .

Proof.

In view of Assumptions (H.1) and (H.4), one can differentiate under the expectation sign (Leibniz rule) to conclude that $\mathbb{E}\left[F(\boldsymbol{X}\cdot,\boldsymbol{y})\right]$ is $C^{1}$ at ${\boldsymbol{\theta}}_{0}$ and $\nabla\mathbb{E}\left[F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right]=\boldsymbol{X}^{\top}\mathbb{E}\left[\nabla F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right]$ . As ${\boldsymbol{\theta}}_{0}$ minimizes the population risk, one has $\nabla\mathbb{E}\left[F(\boldsymbol{X}{\boldsymbol{\theta}}_{0},\boldsymbol{y})\right]=0$ . Using the rank assumption on $\boldsymbol{X}$ , we deduce that

[TABLE]

Moreover, (3.8) specializes to

[TABLE]

Let $t=\lambda_{n}n/\tau$ . By the union bound and (3.8), we have

[TABLE]

The random variables ${\big{(}f_{i}^{\prime}((\boldsymbol{X}{\boldsymbol{\theta}}_{0})_{i},\boldsymbol{y}_{i})\boldsymbol{z}_{i}\big{)}}_{i}$ are zero-mean independent, and $\forall i$ and $m\geq 2$

[TABLE]

We are then in position to apply the Bernstein inequality to get

[TABLE]

where $\sigma^{2}=\max_{1\leq i\leq n}\sigma_{i}^{2}$ . Every $t$ such that

[TABLE]

satisfies $t^{2}\geq 2\delta\log(|{\mathcal{V}}|)(\kappa t+n\sigma^{2})$ . Applying the trivial inequality $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ to the bound on $t$ , we conclude. ∎

Remark 3.3.

In the monograph [7, Lemma 14.12], the authors derived an exponential deviation inequality for the supremum of an empirical process with finite ${\mathcal{V}}$ and possibly unbounded empirical processes under a Bernstein moment condition similar to ours (in fact ours implies theirs). The very last part of our proof can be obtained by applying their result. We detailed it here for the sake of completeness.

Lasso

To lighten the notation, let $I_{{\boldsymbol{\theta}}}=\mathrm{supp}({\boldsymbol{\theta}})$ . From (2.3), it is easy to see that

[TABLE]

where last bound holds as an equality whenever ${\boldsymbol{\theta}}\neq 0$ . Further the $\ell_{1}$ norm is the gauge of the cross-polytope (i.e. the unit $\ell_{1}$ ball). Its vertex set ${\mathcal{V}}$ is the set of unit-norm one-sparse vectors $(\pm\boldsymbol{a}_{i})_{1\leq i\leq p}$ , where we recall $(\boldsymbol{a}_{i})_{1\leq i\leq p}$ the canonical basis. Thus

[TABLE]

Inserting this into Proposition 3.1, we obtain the following corollary.

Corollary 3.1.

Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where where $J$ is the Lasso penalty and $F$ satisfies Assumptions (H.1), (H.2) and (H.4). Suppose that $\operatorname*{rank}(\boldsymbol{X})=n$ and $\max_{i}\left\|\boldsymbol{X}_{i}\right\|_{\infty}\leq 1$ , and take

[TABLE]

for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-2(2p)^{1-\delta}$ , the following holds

[TABLE]

and

[TABLE]

For ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , we recover a similar scaling for $\lambda_{n}$ and the oracle inequality as in [66], though in the latter the oracle inequality is not sharp unlike ours. Note that the above oracle inequality extends readily to the case of analysis/fused Lasso $\big{\|}\boldsymbol{D}^{\top}\cdot\big{\|}_{1}$ where $\boldsymbol{D}$ is surjective. We leave the details to the interested reader (see also the analysis group Lasso example in Section 4).

Anti-sparsity

From Section 2.5.4, recall the saturation support $I^{\mathrm{sat}}_{{\boldsymbol{\theta}}}$ of ${\boldsymbol{\theta}}$ . From (2.10), we get

[TABLE]

with equality whenever ${\boldsymbol{\theta}}\neq 0$ . In addition, the $\ell_{\infty}$ norm is the gauge of the hypercube whose vertex set is ${\mathcal{V}}=\{\pm 1\}^{p}$ . Thus

[TABLE]

We have the following oracle inequalities.

Corollary 3.2.

Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where where $J$ is anti-sparsity penalty (2.9), and $F$ satisfies Assumptions (H.1), (H.2) and (H.4). Suppose that $\operatorname*{rank}(\boldsymbol{X})=n$ and $\max_{i,j}|\boldsymbol{X}_{i,j}|\leq 1/p$ , and take

[TABLE]

for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-2^{-p(\delta-1)+1}$ , the following holds

[TABLE]

and

[TABLE]

We are not aware of any result of this kind in the literature. The bound imposed on $\boldsymbol{X}$ is similar to what is generally assumed in the vector quantization literature [38, 53].

3.3.2 General penalty

Extending the above reasoning to a general penalty requires a deviation inequality for the supremum of an empirical process in (3.8) under the Bernstein moment condition (H.4)(iii), but without the need of uniform boundedness. This can be achieved via generic chaining along a tree using entropy with bracketing; see [69, Theorem 8]. The resulting deviation bound will thus depend on the entropies with bracketing. These quantities capture the complexity of the set $\boldsymbol{X}({\mathcal{C}})$ but are intricate to compute in general. This subject deserves further investigation that we leave to a future work.

Remark 3.4 (Group Lasso).

Using the union bound, we have

[TABLE]

This requires a concentration inequality for quadratic forms of independent random variables satisfying the Bernstein moment assumption above. We are not aware of any such a result. But if our moment condition is strengthened to

[TABLE]

then one can use [4, Theorem 3]. Indeed, assume the nroamlization $\max_{i}{\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{X}_{b_{i}}^{\top}\boldsymbol{X}_{b_{i}}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{2\to 2}\leq n$ , which entails

[TABLE]

It then follows that taking

[TABLE]

the oracle inequalities (4.5) and (4.6) hold for the group Lasso with probability at least $1-L^{1-\delta}$ . A similar result can be proved for the analysis group Lasso just as well with a proper normalization assumption on $\boldsymbol{X}$ (see Section 4.3.3).

4 Oracle inequalities for low-complexity linear regression

In this section, we consider the classical linear regression problem where the $n$ response-covariate pairs $(\boldsymbol{y}_{i},\boldsymbol{X}_{i})$ are linked as

[TABLE]

where $\boldsymbol{\xi}$ is a noise vector. The data loss will be set to $F({\boldsymbol{u}},\boldsymbol{y})=\tfrac{1}{2}\big{\|}\boldsymbol{y}-{\boldsymbol{u}}\big{\|}_{2}^{2}$ . This in turn entails that $\varphi=\varphi^{+}=\tfrac{1}{2}{\left(\cdot\right)}^{2}$ on $\mathbb{R}_{+}$ and $R_{n}{\big{(}{\boldsymbol{\theta}},{\boldsymbol{\theta}}_{0}\big{)}}=\tfrac{1}{2n}\big{\|}\boldsymbol{X}{\boldsymbol{\theta}}-\boldsymbol{X}{\boldsymbol{\theta}}_{0}\big{\|}_{2}^{2}$ .

In this section, we assume that the noise $\boldsymbol{\xi}$ is a zero-mean sub-Gaussian vector in $\mathbb{R}^{n}$ with parameter $\sigma$ . That is, its one-dimensional marginals $\langle\boldsymbol{\xi},\boldsymbol{z}\rangle$ are sub-Gaussian random variables $\forall\boldsymbol{z}\in\mathbb{R}^{n}$ , i.e. they satisfy

[TABLE]

In this case, the bounds of Section 3.3 can be improved.

4.1 General penalty

As we will shortly show, the event $\mathscr{E}$ will depend on the Gaussian width, a summary geometric quantity which, informally speaking, measures the size of the bulk of a set in $\mathbb{R}^{n}$ .

Definition 4.1.

The Gaussian width of a subset ${\mathcal{S}}\subset\mathbb{R}^{n}$ is defined as

[TABLE]

The concept of Gaussian width has appeared in the literature in different contexts. In particular, it has been used to establish sample complexity bounds to ensure exact recovery (noiseless case) and mean-square estimation stability (noisy case) for low-complexity penalized estimators from Gaussian measurements; see e.g. [51, 12, 59, 70, 64].

The Gaussian width has deep connections to convex geometry and it enjoys many useful properties. It is well-known that it is positively homogeneous, monotonic w.r.t. inclusion, and invariant under orthogonal transformations. Moreover, $w({\overline{\mathrm{conv}}\left({\mathcal{S}}\right)})=w({\mathcal{S}})$ . From Lemma A.2(ii)-(iii), $w({\mathcal{S}})$ is a non-negative finite quantity whenever the set ${\mathcal{S}}$ is bounded and contains the origin.

We are now ready to state our oracle inequality in probability with sub-Gaussian noise.

Proposition 4.1.

Let the data generated by (4.1) where $\boldsymbol{\xi}$ is a zero-mean sub-Gaussian random vector with parameter $\sigma$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ and $J\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}}$ satisfy Assumptions (H.1)-(H.2) and (H.3). Suppose that $\lambda_{n}\geq\frac{\tau\sigma c_{1}\sqrt{2\log(c_{2}/\delta)}w{\left(\boldsymbol{X}({\mathcal{C}})\right)}}{n}<$ , for some $\tau>1$ and $0<\delta<\min(c_{2},1)$ , where $c_{1}$ and $c_{2}$ are positive absolute constants. Then with probability at least $1-\delta$ , (3.2) and (3.6) hold with the remainder term given by (3.3) with $\nu=1$ .

The proof requires sophisticated ideas from the theory of generic chaining [56], but we only apply these results. The constants $c_{1}$ and $c_{2}$ can be traced back to the proof of these results as detailed in [56].

Proof.

First, from (4.2), we have the bound

[TABLE]

i.e. the increment condition [56, (0.4)] is verified. Thus combining (3.8) with the probability bound in [56, page 11], the generic chaining theorem [56, Theorem 1.2.6] and the majorizing measure theorem [56, Theorem 2.1.1], we have

[TABLE]

∎

If the noise is Gaussian, an enhanced version can be proved by invoking Gaussian concentration of Lipschitz functions [36].

Proposition 4.2.

Let the data generated by (4.1) with noise $\boldsymbol{\xi}\sim{\mathcal{N}}(0,\sigma^{2}\mathrm{\bf Id}_{n})$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ and $J\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}}$ satisfy Assumptions (H.1)-(H.2) and (H.3). Suppose that $\lambda_{n}\geq\frac{(1+\delta)\tau\sigma w{\left(\boldsymbol{X}({\mathcal{C}})\right)}}{n}$ , for some $\tau>1$ and $\delta>0$ . Then with probability at least $1-\exp{\left(-\frac{\delta^{2}w{\left(\boldsymbol{X}({\mathcal{C}})\right)}^{2}}{2{\left|\kern-1.05487pt\left|\kern-1.05487pt\left|\boldsymbol{X}\right|\kern-1.05487pt\right|\kern-1.05487pt\right|}_{J\to 2}^{2}}\right)}$ , (3.2) and (3.6) hold with the remainder term given by (3.3) with $\nu=1$ .

Proof.

Thanks to sublinearity (see Lemma A.3(i) and Lemma A.4), the function $\boldsymbol{\xi}\mapsto J^{\circ}(\boldsymbol{X}^{\top}\boldsymbol{\xi})$ is Lipschitz continuous with Lipschitz constant ${\big{|}\kern-1.50696pt\big{|}\kern-1.50696pt\big{|}\boldsymbol{X}^{\top}\big{|}\kern-1.50696pt\big{|}\kern-1.50696pt\big{|}}_{2\to J^{\circ}}={\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{X}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{J\to 2}$ . From (3.8), we also have

[TABLE]

Observe that $\boldsymbol{X}({\mathcal{C}})$ is a convex compact set containing the origin. Setting $\epsilon=\lambda_{n}n/\tau-\sigma w{\left(\boldsymbol{X}({\mathcal{C}})\right)}\geq\delta\sigma w{\left(\boldsymbol{X}({\mathcal{C}})\right)}$ , it follows from (3.8) and the Gaussian concentration of Lipschitz functions [36] that

[TABLE]

∎

Estimating theoretically the Gaussian width of a set777Not to mention its image with a linear operator as for $\boldsymbol{X}({\mathcal{C}})$ . is a non-trivial problem that has been extensively studied in the areas of probability in Banach spaces and stochastic processes. There are classical bounds on the Gaussian width (Sudakov’s and Dudley’s inequalities), but they are difficult to estimate in most cases and neither of these bounds is tight for all sets. When the set is a convex cone (intersected with a sphere), tractable estimates based on polarity arguments were proposed in, e.g., [12].

4.2 Polyhedral penalty

When ${\mathcal{C}}$ and is polytope, enhanced oracle inequalities can be obtained by invoking a simple union bound argument.

Proposition 4.3.

Let the data generated by (4.1) where $\boldsymbol{\xi}$ is a zero-mean sub-Gaussian random vector with parameter $\sigma$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ and $J\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\gamma_{{\mathcal{C}}}$ satisfy Assumptions (H.1)-(H.2) and (H.3), and moreover ${\mathcal{C}}$ is a polytope with vertices ${\mathcal{V}}$ . Suppose that $\lambda_{n}\geq\frac{\tau\sigma{\big{(}\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\big{)}}\sqrt{2\delta\log(|{\mathcal{V}}|)}}{n}$ , for some $\tau>1$ and $\delta>1$ . Then with probability at least $1-2|{\mathcal{V}}|^{1-\delta}$ , (3.2) and (3.6) hold with the remainder term given by (3.3) with $\nu=1$ .

In particular, if $\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\leq\sqrt{n}$ , then one can take $\lambda_{n}\geq\tau\sigma\sqrt{\frac{2\delta\log(|{\mathcal{V}}|)}{n}}$ .

Proof.

From (3.8) we have

[TABLE]

where in the last inequality, we used the fact that a convex function attains its maximum on ${\mathcal{C}}$ at an extreme point ${\mathcal{V}}$ . Let $\epsilon=\sigma{\big{(}\max_{{\boldsymbol{v}}\in{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\big{)}}\sqrt{2\delta\log(|{\mathcal{V}}|)}$ . By the union bound, (4.2) and (3.8), we have

[TABLE]

∎

4.3 Applications

In this section, we exemplify our oracle inequalities for the penalties described in Section 2.5.

4.3.1 Lasso

Recall the derivations for the Lasso in Section 3.3.1. We obtain the following corollary of Proposition 4.3.

Corollary 4.1.

Let the data generated by (4.1) where $\boldsymbol{\xi}$ is a zero-mean sub-Gaussian random vector with parameter $\sigma$ . Assume that $\boldsymbol{X}$ is such that $\max_{i}\left\|\boldsymbol{X}_{i}\right\|_{2}\leq\sqrt{n}$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $J$ is the Lasso penalty (2.2) and $F$ satisfies Assumptions (H.1)-(H.2). Suppose that $\lambda_{n}\geq\tau\sigma\sqrt{\frac{2\delta\log(2p)}{n}}$ , for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-2(2p)^{1-\delta}$ , the following holds

[TABLE]

and

[TABLE]

The remainder term grows as $\tfrac{|I|\log(p)}{n}$ . The oracle inequality (4.4) recovers [18, Theorem 1] in the exactly sparse case, and (4.4) the one in [55, Theorem 4] (see also [34, Theorem 11] and [19, Theorem 2]). It is worth mentioning, however, that [18, Theorem 1] handles the inexactly sparse case while we do not.

4.3.2 Group Lasso

Recall the notations in Section 2.5.2, and denote $I_{{\boldsymbol{\theta}}}=\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{\theta}})$ the set indexing active blocks in ${\boldsymbol{\theta}}$ . From (2.5), we have

[TABLE]

where the last bound holds as an equality whenever ${\boldsymbol{\theta}}\neq 0$ .

We have the following oracle inequalities as corollaries of Proposition 4.1 and Proposition 4.2.

Corollary 4.2.

Let the data generated by (4.1). Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ satisfies Assumptions (H.1)-(H.2), and $J$ is the group Lasso (2.4) with $L$ non-overlapping blocks of equal size $K$ . Assume that $\boldsymbol{X}$ is such that $\max_{i}{\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{X}_{b_{i}}^{\top}\boldsymbol{X}_{b_{i}}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{2\to 2}\leq n$ .

(i)

$\boldsymbol{\xi}$ * is a zero-mean sub-Gaussian random vector with parameter $\sigma$ : suppose that $\lambda_{n}\geq 3\tau\sigma c_{1}\frac{\sqrt{2\log(c_{2}/\delta)}{\left(\sqrt{K}+\sqrt{2\log(L)}\right)}}{\sqrt{n}}$ , for some $\tau>1$ and $0<\delta<\min(c_{2},1)$ , where $c_{1}$ and $c_{2}$ are the positive absolute constants in Proposition 4.1. Then, with probability at least $1-\delta$ , the following holds*

[TABLE]

and

[TABLE] 2. (ii)

$\boldsymbol{\xi}\sim{\mathcal{N}}(0,\sigma^{2}\mathrm{\bf Id}_{n})$ : suppose that $\lambda_{n}\geq\tau\sigma\frac{\sqrt{K}+\sqrt{2\delta\log(L)}}{\sqrt{n}}$ , for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-L^{1-\delta}$ , (4.5) and (4.6) hold.

The first remainder term is on the order $\frac{|I|{\left(\sqrt{K}+\sqrt{2\log(L)}\right)}^{2}}{n}$ . This is similar to the scaling that has been provided in the literature for EWA with other group sparsity priors and noises [48, 26]. Similar rates were given for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ with the group Lasso in [40, 37, 67].

Proof.

(i)

This is a consequence of Proposition 4.1, for which we need to bound

[TABLE]

We first have, for any block $b_{i}$

[TABLE]

Furthermore, $\big{\|}\boldsymbol{X}_{b_{i}}^{\top}\cdot\big{\|}_{2}$ is Lipschitz continuous with Lipschitz constant ${\left|\kern-1.50696pt\left|\kern-1.50696pt\left|\boldsymbol{X}_{b_{i}}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{2\to 2}\leq\sqrt{n}$ . Thus the union bound and Gaussian concentration of Lipschitz functions [36] yield, for any $t>0$ ,

[TABLE]

Let $\kappa=\sqrt{Kn}+\sqrt{2n\log(L)}$ . $w(\boldsymbol{X}({\mathcal{C}}))$ can be expressed as

[TABLE] 2. (ii)

The proof follows the lines of Proposition 4.2 where we additionally use the union bound. Indeed,

[TABLE]

where used the Gaussian concentration of Lipschitz functions [36] in the last inequality.

∎

We observe in passing that another way to prove the oracle inequalities in the sub-Gaussian is to use Dudley’s inequality on the sphere in $\mathbb{R}^{K}$ after applying a union bound on the $L$ blocks. In addition, in the Gaussian case, the (similar) bound $\lambda_{n}\geq 3\delta\tau\sigma\frac{\sqrt{K}+\sqrt{2\log(L)}}{\sqrt{n}}$ can be obtained by combining Proposition 4.2 and the estimate $w(\boldsymbol{X}({\mathcal{C}}))\leq 3(\sqrt{Kn}+\sqrt{2n\log(L)})$ in the proof of (i). The corresponding probability of success would be at least $1-L^{-9(\delta-1)^{2}}$ .

4.3.3 Analysis group Lasso

We now turn to the prior penalty (2.6). Recall the notations in Section 2.5.3, and remind $\Lambda_{{\boldsymbol{\theta}}}=\bigcup_{i\in\mathrm{supp}_{{\mathcal{B}}}({\boldsymbol{D}}^{\top}{\boldsymbol{\theta}})}b_{i}$ . We assume that ${\boldsymbol{D}}$ is a frame of $\mathbb{R}^{p}$ , hence surjective, meaning that there exist $c,d>0$ such that for any ${\boldsymbol{\omega}}\in\mathbb{R}^{p}$

[TABLE]

This together with (2.7)-(2.8) and Cauchy-Schwarz inequality entail

[TABLE]

Note, however, that from (2.7), we do not have in general $\left\|{\boldsymbol{D}}^{+}\operatorname{P}_{\operatorname*{Ker}({\boldsymbol{D}}^{\top}_{\Lambda^{c}_{{\boldsymbol{\theta}}}})}{\boldsymbol{D}}e_{{\boldsymbol{D}}^{\top}{\boldsymbol{\theta}}}^{\left\|\right\|_{1,2}}\right\|_{\infty,2}\leq 1$ .

With exactly the same arguments to those for proving Corollary 4.2, replacing $\boldsymbol{X}$ by $\boldsymbol{X}{\boldsymbol{D}}$ , we arrive at the following oracle inequalities.

Corollary 4.3.

Let the data generated by (4.1). Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ satisfies Assumptions (H.1)-(H.2), and $J$ is the analysis group Lasso (2.6) with $L$ blocks of equal size $K$ . Assume that ${\boldsymbol{D}}$ is a frame, and $\boldsymbol{X}$ is such that $\max_{i}{\left|\kern-1.50696pt\left|\kern-1.50696pt\left|{\boldsymbol{D}}_{b_{i}}^{\top}\boldsymbol{X}^{\top}\boldsymbol{X}{\boldsymbol{D}}_{b_{i}}\right|\kern-1.50696pt\right|\kern-1.50696pt\right|}_{2\to 2}\leq n$ .

(i)

$\boldsymbol{\xi}$ * is a zero-mean sub-Gaussian random vector with parameter $\sigma$ : suppose that $\lambda_{n}\geq 3\tau\sigma c_{1}\frac{\sqrt{\log(c_{2}/\delta)}{\left(\sqrt{K}+\sqrt{2\log(L)}\right)}}{\sqrt{n}}$ , for some $\tau>1$ and $0<\delta<\min(c_{2},1)$ , where $c_{1}$ and $c_{2}$ are the positive absolute constants in Proposition 4.1. Then, with probability at least $1-\delta$ , the following holds*

[TABLE]

and

[TABLE] 2. (ii)

$\boldsymbol{\xi}\sim{\mathcal{N}}(0,\sigma^{2}\mathrm{\bf Id}_{n})$ : suppose that $\lambda_{n}\geq\tau\sigma\frac{\sqrt{K}+\sqrt{2\delta\log(L)}}{\sqrt{n}}$ , for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-L^{1-\delta}$ , ((i)) and ((i)) hold.

To the best of our knowledge, this result is new to the literature. The scaling of the remainder term is the same as in **[26]** and **[48]** with analysis sparsity priors different from ours (the authors in the latter also assume that ${\boldsymbol{D}}$ is invertible).

4.3.4 Anti-sparsity

Recall the derivations for the $\ell_{\infty}$ norm example in Section 3.3.1. We have the following oracle inequalities from Proposition 4.3.

Corollary 4.4.

Let the data generated by (4.1) where $\boldsymbol{\xi}$ is a zero-mean sub-Gaussian random vector with parameter $\sigma$ . Assume that $\boldsymbol{X}$ is such that $\max_{i,j}|\boldsymbol{X}_{i,j}|\leq 1/p$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ satisfies Assumptions (H.1)-(H.2), and $J$ is the anti-sparsity penalty (2.9). Suppose that $\lambda_{n}\geq\tau\sigma\sqrt{2\delta\log(2)}\sqrt{\frac{p}{n}}$ , for some $\tau>1$ and $\delta>1$ . Then, with probability at least $1-2^{-p(\delta-1)+1}$ , the following holds

[TABLE]

and

[TABLE]

The first remainder term scales as $\tfrac{p}{n}$ which reflects that anti-sparsity regularization requires an overdetermined regime to ensure good stability performance. This is in agreement with **[63, Theorem 7]**. This phenomenon was also observed by **[24]** who studied sample complexity thresholds for noiseless recovery from random projections of the hypercube.

4.3.5 Nuclear norm

*We now turn to the nuclear norm case. Recall the notations of Section 2.5.5. For matrices ${\boldsymbol{\theta}}\in\mathbb{R}^{p_{1}\times p_{2}}$ , a measurement map $\boldsymbol{X}$ takes the form of a linear operator whose * $i$ th component is given by the Frobenius scalar product

[TABLE]

where $\boldsymbol{X}^{i}$ is a matrix in $\mathbb{R}^{p_{1}\times p_{2}}$ . We denote $\left\|\cdot\right\|_{\mathrm{F}}$ the associated norm. From (2.12), it is immediate to see that whenever ${\boldsymbol{\theta}}\neq 0$ ,

[TABLE]

Moreover, from (2.12), we have

[TABLE]

To apply Proposition 4.1 and Proposition 4.2, we need to bound $w(\boldsymbol{X}({\mathcal{C}}))$ ( ${\mathcal{C}}$ * is the nuclear ball), or equivalently, to bound*

[TABLE]

which is the expectation of the operator norm of a random series with matrix coefficients. Thus using **[60, Theorem 4.1.1(4.1.5)]** to get this bound, and inserting it into Proposition 4.1 and Proposition 4.2, we get the following oracle inequalities for the nuclear norm. Define

[TABLE]

Corollary 4.5.

Let the data generated by (4.1) with a linear operator $\boldsymbol{X}:\mathbb{R}^{p_{1}\times p_{2}}\to\mathbb{R}^{n}$ . Assume that $v(\boldsymbol{X})\leq n$ . Consider the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ , where $F$ satisfies Assumptions (H.1)-(H.2), and $J$ is the nuclear norm (2.11).

(i)

$\boldsymbol{\xi}$ * is a zero-mean sub-Gaussian random vector with parameter $\sigma$ : suppose that $\lambda_{n}\geq 2\tau\sigma c_{1}\sqrt{\frac{\log(c_{2}/\delta)\log(p_{1}+p_{2})}{n}}$ , for some $\tau>1$ and $0<\delta<\min(c_{2},1)$ , where $c_{1}$ and $c_{2}$ are the positive absolute constants in Proposition 4.1. Then, with probability at least $1-\delta$ , the following holds*

[TABLE]

and

[TABLE] 2. (ii)

$\boldsymbol{\xi}\sim{\mathcal{N}}(0,\sigma^{2}\mathrm{\bf Id}_{n})$ : suppose that $\lambda_{n}\geq(1+\delta)\tau\sigma\sqrt{\frac{2\log(p_{1}+p_{2})}{n}}$ , for some $\tau>1$ and $\delta>0$ . Then, with probability at least $1-(p_{1}+p_{2})^{-\delta^{2}}$ , (4.11) and (4.12) hold.

The set over which the infimum is taken just reminds us that the nuclear norm is partly smooth (see above) relative to the constant rank manifold (which is a Riemannian submanifold of $\mathbb{R}^{p_{1}\times p_{2}}$ ) **[22, Theorem 3.19]**. The first remainder term now scales as $\frac{r\log(p_{1}+p_{2})}{n}$ . In the iid Gaussian case, we recover the same rate as in **[18, Theorem 3]** for ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and in **[34, Theorem 2]** for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ .

4.4 Discussion of minimax optimality

In this section, we discuss the optimality of the estimators ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ (we remind the reader that the design $\boldsymbol{X}$ is fixed). Recall the discussion on stratification at the end of Section 3.1. Let ${\mathcal{M}}_{0}\in\mathscr{M}$ be the stratum active at ${\boldsymbol{\theta}}_{0}\in{\mathcal{M}}_{0}$ . In this setting, with $\beta=O(1/(pn))$ , (3.5) and Proposition 4.2 ensure that

[TABLE]

with high probability. In particular, for a polyhedral gauge penalty, in which case ${\mathcal{M}}_{0}=T_{{\boldsymbol{\theta}}_{0}}$ (see **[63]**), and under the normalization $\max_{{\boldsymbol{v}}{\mathcal{V}}}\left\|\boldsymbol{X}{\boldsymbol{v}}\right\|_{2}\leq\sqrt{n}$ , Proposition 4.3 entails

[TABLE]

with high probability. Thus the risk bounds only depend on ${\mathcal{M}}_{0}$ . A natural question that arises is whether the above bounds are optimal, i.e. whether an estimator can achieve a significantly better prediction risk than ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ uniformly on ${\mathcal{M}}_{0}$ . A classical way to answer this question is the minimax point of view. This amounts to finding a lower bound on the minimax probabilities of the form

[TABLE]

where $\psi_{n}$ is the rate, which ideally, should be comparable to the risk bounds above. A standard path to derive such a lower bound is to exhibit a subset of ${\mathcal{M}}_{0}$ of well-separated points while controlling its diameter, see **[61, Chapter 2]** or **[39, Section 4.3]**. This however must be worked out on a case-by-case basis.

Example 4.1.

For the Lasso case, ${\mathcal{M}}_{0}=T_{{\boldsymbol{\theta}}_{0}}$ is the subspace of vectors whose support is contained in that of ${\boldsymbol{\theta}}_{0}$ . Let $I=\mathrm{supp}({\boldsymbol{\theta}}_{0})$ and $s=\left\|{\boldsymbol{\theta}}_{0}\right\|_{0}$ . Define the set

[TABLE]

We have ${\mathcal{B}}_{0}\subset{\mathcal{M}}_{0}$ and $\left\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}\right\|_{0}\leq 2s$ for all $({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})\in{\mathcal{B}}_{0}$ . Define ${\mathcal{F}}_{0}\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\big{\{}r\boldsymbol{X}{\boldsymbol{\theta}}\;:\;{\boldsymbol{\theta}}\in{\mathcal{B}}_{0}\big{\}}$ , for $r>0$ to be specified later. Due to the Varshamov-Gilbert lemma [39, Lemma 4.7], given $a\in]0,1[$ , there exists a subset ${\mathcal{B}}\subset{\mathcal{B}}_{0}$ with cardinality $|{\mathcal{B}}|\geq 2^{\rho s/2}$ such that for two distinct elements $\boldsymbol{X}{\boldsymbol{\theta}}$ and $\boldsymbol{X}{\boldsymbol{\theta}}^{\prime}$ in ${\mathcal{F}}_{0}$

[TABLE]

where

[TABLE]

Standard results from random matrix theory ensure that $\underline{\kappa}>0$ for a Gaussian design with high probability as long as $n\geq s+C\sqrt{s}$ [59] for some positive absolute constant $C$ .

Then choosing $r^{2}=\frac{c\rho\sigma^{2}}{4\overline{\kappa}}$ , where $c\in]0,1/8[$ and $\rho=(1+a)\log(1+a)+(1-a)\log(1-a)$ , we get the bounds

[TABLE]

We are now in position to apply [61, Theorem 2.5] to conclude that there exists $\eta\in]0,1[$ (that depends on $a$ ) such that

[TABLE]

This lower bound together with Corollary 4.1 show that ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ (with $\beta=O(1/(pn))$ ) and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ are nearly minimax (up to a logarithmic factor) over ${\mathcal{M}}_{0}$ .

One can generalize this reasoning to get a minimax lower bound over the larger class of $s$ -sparse vectors, i.e. $\bigcup\big{\{}V=\operatorname*{Span}\{(\boldsymbol{a}_{j})_{1\leq j\leq p}\}\;:\;\dim(V)=s\big{\}}$ , which is a finite union of subspaces that contains ${\mathcal{M}}_{0}$ . Let $(a,b)\in]0,1[^{2}$ such that $1\leq s\leq abp$ and $a(-1+b-\log(b))\geq\log(2)$ 888E.g. take $b=1/(1+e\sqrt[a]{2})$ ., $c\in]0,1/8[$ . Then combining [61, Theorem 2.5] and [39, Lemma 4.6 and Lemma 4.10], we have for $\eta\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\frac{1}{1+(ab)^{\rho s/2}}{\left(1-2c-\sqrt{\frac{2c}{-\rho\log(ab)}}\right)}\in]0,1[$

[TABLE]

where $\rho=-a(-1+b-\log(b))/\log(ab)$ , and $\underline{\kappa}$ and $\overline{\kappa}$ are now the restricted isometry constants of $\boldsymbol{X}$ of degree $2s$ , i.e.

[TABLE]

For this lower bound to be meaningful, $\underline{\kappa}$ should be positive. From the compressed sensing literature, many random designs are known to verify this condition for $n$ large enough compared to $s$ , e.g. sub-Gaussian designs with $n\gtrsim s\log(p)$ .

One can see that the difference between this lower bound and the one on ${\mathcal{M}}_{0}$ lies in the $\log(p/s)$ factor, which basically derives from the control over the union of subspaces. The minimax prediction risk (in expectation) over the $\ell_{0}$ -ball were studied in [47, 44, 71, 75, 72], where similar lower bounds were obtained.

Example 4.2.

For the group Lasso with $L$ groups of equal size $K$ , ${\mathcal{M}}_{0}$ is the subspace group sparse vectors whose group support is included in that of ${\boldsymbol{\theta}}_{0}$ . Let $s$ be the number of non-zero (active) groups in ${\boldsymbol{\theta}}_{0}$ . Following exactly the same reasoning as for the Lasso, one can show that the risk lower bound in probability scales as $C\sigma^{2}sK/n$ , which together with Corollary 4.2, shows that ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ are nearly minimax (up again to a logarithmic factor) over ${\mathcal{M}}_{0}$ . One can also derive the lower bound $C\sigma^{2}s(K+\log(L/s))/n$ over the set of $s$ -block sparse vectors. Such minimax lower bound is comparable to the one in [37].

Example 4.3.

Let’s consider the $\ell_{\infty}$ -penalty. Denote the saturation support of ${\boldsymbol{\theta}}_{0}$ as $I^{\mathrm{sat}}$ and recall the subspace $T_{{\boldsymbol{\theta}}_{0}}$ form (2.10). Thus, ${\mathcal{M}}_{0}=T_{{\boldsymbol{\theta}}_{0}}$ is the subspace of vectors which are collinear to $\operatorname*{sign}({\boldsymbol{\theta}}_{0})$ on $I^{\mathrm{sat}}$ and free on its complement. Observe that $\dim({\mathcal{M}}_{0})=p-s+1$ , where $s=|I^{\mathrm{sat}}|$ . Define the set

[TABLE]

By construction, ${\mathcal{B}}_{0}\subset{\mathcal{M}}_{0}$ , and $\left\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}^{\prime}\right\|_{0}\leq 2(p-s)$ for all $({\boldsymbol{\theta}},{\boldsymbol{\theta}}^{\prime})\in{\mathcal{B}}_{0}$ . Thus following the same arguments as for the Lasso example (using again Varshamov-Gilbert lemma and [61, Theorem 2.5]), we conclude that there exists $\eta\in]0,1[$ (that depends on $a$ ) such that

[TABLE]

where the restricted isometry constants are defined similarly to the Lasso but with respect to the model subspace ${\mathcal{M}}_{0}$ of the $\ell_{\infty}$ norm. Again, for a Gaussian design, $\underline{\kappa}>0$ with high probability as long as $n\geq(p-s+1)+C\sqrt{p-s+1}$ [59].

The obtained minimax lower bound is consistent with the sample complexity thresholds derived in [24] for noiseless recovery from random projections of the hypercube. For a saturation support size small compared to $p$ , the bound of Corollary 4.4 comes close to the minimax lower bound.

Example 4.4.

Let $r=\operatorname*{rank}({\boldsymbol{\theta}}_{0})$ , where ${\boldsymbol{\theta}}_{0}\in\mathbb{R}^{p_{1}\times p_{2}}$ , and $p=\max(p_{1},p_{2})$ . For the nuclear norm, ${\mathcal{M}}_{0}$ is the manifold of rank- $r$ matrices. Thus arguing as in [34, Theorem 5] (who use the Varshamov-Gilbert lemma [39] to find the covering set), one can show that the minimax risk lower bound over ${\mathcal{M}}_{0}$ is $C\sigma^{2}r/n$ . In view of Corollary 4.5, we deduce that ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$ and ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$ are nearly minimax over the constant rank manifolds.

Appendix A Pre-requisites from convex analysis

We here collect some ingredients from convex analysis that are essential to our exposition.

Monotone conjugate

Lemma A.1.

Let $g$ be a non-decreasing function on $\mathbb{R}_{+}$ that vanishes at [math]. Then the following hold:

(i)

$g^{+}$ * is a proper closed convex and non-decreasing function on $\mathbb{R}_{+}$ that vanishes at [math].* 2. (ii)

If $g$ is also closed and convex, then $g^{++}=g$ . 3. (iii)

Let $f:t\in\mathbb{R}\mapsto g(|t|)$ such that $f$ is differentiable on $\mathbb{R}$ , where $g$ is finite-valued, strictly convex and strongly coercive. Then $g^{+}$ is likewise finite-valued, strictly convex, strongly coercive, and $f^{*}=g^{+}\circ|\cdot|$ is differentiable on $\mathbb{R}$ . In particular, both $g$ and $g^{+}$ are strictly increasing on $\mathbb{R}_{+}$ .

Proof.

(i)

By [3, Proposition 13.11], $g^{+}$ is a closed convex function. We have $\inf_{t\geq 0}g(t)=-\sup_{t\geq 0}t\cdot 0-g(t)=-g^{+}(0)$ . Since $g$ is non-decreasing and $g(0)=0$ , then $g^{+}(0)=-\inf_{t\geq 0}g(t)=-g(0)=0$ . In addition, by (1.5), we have $g^{+}(a)\geq a\cdot 0-g(0)=0$ , $\forall a\in\mathbb{R}_{+}$ . This shows that $g^{+}$ is non-negative and $\operatorname*{dom}(g^{+})\neq\emptyset$ , and in turn, it is also proper.

Let $a,b$ in $\mathbb{R}_{+}$ such that $a<b$ . Then

[TABLE]

That is, $g^{+}$ is non-decreasing on $\mathbb{R}_{+}$ . 2. (ii)

This follows from [49, Theorem 12.4]. 3. (iii)

By definition of $f$ , $f$ is a finite-valued function on $\mathbb{R}$ , strictly convex, differentiable and strognly coercive. It then follows from [30, Corollary X.4.1.4] that $f^{*}$ enjoys the same properties. In turn, using the fact that both $f$ and $f^{*}$ are even, we have $g^{+}$ is strongly coercive, and strict convexity of $f$ (resp. $f^{*}$ ) is equivalent to that of $g$ (resp. $g^{+}$ ). Altogether, this shows the first claim. We now prove that $g$ vanishes only at [math] (and similary for $g^{+}$ ). As $g$ is non-decreasing and strictly convex, we have, for any $\rho\in]0,1[$ and $a,b$ in $\mathbb{R}_{+}$ such that $a<b$ ,

[TABLE]

∎

Support function

The support function of ${\mathcal{C}}\subset\mathbb{R}^{p}$ is

[TABLE]

We recall the following properties whose proofs can be found in e.g. **[49, 30]**.

Lemma A.2.

Let ${\mathcal{C}}$ be a non-empty set.

(i)

$\sigma_{\mathcal{C}}$ * is proper lsc and sublinear.* 2. (ii)

$\sigma_{\mathcal{C}}$ * is finite-valued if and only if ${\mathcal{C}}$ is bounded.* 3. (iii)

If $0\in{\mathcal{C}}$ , then $\sigma_{\mathcal{C}}$ is non-negative. 4. (iv)

If ${\mathcal{C}}$ is convex and compact with $0\in\operatorname{int}({\mathcal{C}})$ , then $\sigma_{\mathcal{C}}$ is finite-valued and coercive.

Gauges and polars

Definition A.1 (Polar set).

Let ${\mathcal{C}}$ be a nonempty convex set. The set ${\mathcal{C}}^{\circ}$ given by

[TABLE]

is called the polar of ${\mathcal{C}}$ .

The set ${\mathcal{C}}^{\circ}$ is closed convex and contains the origin. When ${\mathcal{C}}$ is also closed and contains the origin, then it coincides with its bipolar, i.e. ${\mathcal{C}}^{\circ\circ}={\mathcal{C}}$ .

Let ${\mathcal{C}}\subseteq\mathbb{R}^{p}$ be a non-empty closed convex set containing the origin. The gauge of ${\mathcal{C}}$ is the function $\gamma_{\mathcal{C}}$ defined on $\mathbb{R}^{p}$ by

[TABLE]

As usual, $\gamma_{\mathcal{C}}({\boldsymbol{\theta}})=+\infty$ if the infimum is not attained.

Lemma A.3 hereafter recaps the main properties of a gauge that we need. In particular, (ii) is a fundamental result of convex analysis that states that there is a one-to-one correspondence between gauge functions and closed convex sets containing the origin. This allows to identify sets from their gauges, and vice versa.

Lemma A.3.

(i)

$\gamma_{\mathcal{C}}$ * is a non-negative, lsc and sublinear function.* 2. (ii)

${\mathcal{C}}$ * is the unique closed convex set containing the origin such that*

[TABLE] 3. (iii)

$\gamma_{\mathcal{C}}$ * is finite-valued if, and only if, $0\in\operatorname{int}({\mathcal{C}})$ , in which case $\gamma_{\mathcal{C}}$ is Lipschitz continuous.* 4. (iv)

$\gamma_{\mathcal{C}}$ * is finite-valued and coercive if, and only if, ${\mathcal{C}}$ is compact and $0\in\operatorname{int}({\mathcal{C}})$ .*

See **[63]** for the proof.

Observe that thanks to sublinearity, local Lipschitz continuity valid for any finite-valued convex function is streghthned to global Lipschitz continuity. Moreover, $\gamma_{\mathcal{C}}$ is a norm, having ${\mathcal{C}}$ as its unit ball, if and only if ${\mathcal{C}}$ is bounded with nonempty interior and symmetric.

We now define the polar gauge.

Definition A.2 (Polar Gauge).

The polar of a gauge $\gamma_{\mathcal{C}}$ is the function $\gamma_{\mathcal{C}}^{\circ}$ defined by

[TABLE]

An immediate consequence is that gauges polar to each other have the property

[TABLE]

just as dual norms satisfy a duality inequality. In fact, polar pairs of gauges correspond to the best inequalities of this type.

Lemma A.4.

Let ${\mathcal{C}}\subseteq\mathbb{R}^{p}$ be a closed convex set containing the origin. Then,

(ii)

$\gamma_{\mathcal{C}}^{\circ}$ * is a gauge function and $\gamma_{\mathcal{C}}^{\circ\circ}=\gamma_{\mathcal{C}}$ .* 2. (iii)

$\gamma_{\mathcal{C}}^{\circ}=\gamma_{{\mathcal{C}}^{\circ}}$ , or equivalently

[TABLE] 3. (iv)

The gauge of ${\mathcal{C}}$ and the support function of ${\mathcal{C}}$ are mutually polar, i.e.

[TABLE]

See **[49, 30, 63]** for the proof.

Appendix B Expectation of the inner product

We start with some definitions and notations that will be used in the proof. For a non-empty closed convex set ${\mathcal{C}}\in\mathbb{R}^{p}$ , we denote ${\big{(}{\mathcal{C}}\big{)}}^{0}$ its minimal selection, i.e. the element of minimal norm in ${\mathcal{C}}$ . This element is of course unique. For a proper lsc and convex function $f$ and $\gamma>0$ , its Moreau envelope (or Moreau-Yosida regularization) is defined by

[TABLE]

The Moreau envelope enjoys several important properties that we collect in the following lemma.

Lemma B.1.

Let $f$ be a finite-valued and convex function. Then

(i)

${\left(\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\right)}_{\gamma>0}$ * is a decreasing net, and $\forall{\boldsymbol{\theta}}\in\mathbb{R}^{p}$ , $\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\nearrow f({\boldsymbol{\theta}})$ as $\gamma\searrow 0$ .* 2. (ii)

$\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf\in C^{1}(\mathbb{R}^{p})$ * with $\gamma^{-1}$ -Lipschitz continuous gradient.* 3. (iii)

$\forall{\boldsymbol{\theta}}\in\mathbb{R}^{p}$ *, $\nabla\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\to{\big{(}\partial f({\boldsymbol{\theta}})\big{)}}^{0}$ and $\big{\|}\nabla\mathop{}\mathopen{\vphantom{f}}^{\gamma}\kern-0.5ptf({\boldsymbol{\theta}})\big{\|}_{2}\nearrow\big{\|}{\big{(}\partial f({\boldsymbol{\theta}})\big{)}}^{0}\big{\|}_{2}$ as $\gamma\searrow 0$ .

Proof.

(ii)(i) [3, Proposition 12.32]. (ii)(ii) [3, Proposition 12.29]. (ii)(iii) By assumption, $f$ is subdifferentiable everywhere and its subdifferential is a maximal monotone operator with domain $\mathbb{R}^{p}$ , and the result follows from [3, Corollary 23.46(i)]. ∎

We are now equipped to prove the following important result999The result will be proved using Moreau-Yosida regularization. Yet another alternative proof could be based on mollifiers for approximating subdifferentials..

Proposition B.1.

Let the density $\mu_{n}$ in (1.2), where

(a)

$F$ * satisfies Assumptions (H.1)-(H.2);* 2. (b)

$J$ * is a finite-valued lower-bounded convex function, and $\exists R>0$ and $\rho\geq 0$ , such that $\forall{\boldsymbol{\theta}}\in\mathbb{R}^{p}$ , $\big{\|}{\big{(}\partial J({\boldsymbol{\theta}})\big{)}}^{0}\big{\|}_{2}\leq R\left\|{\boldsymbol{\theta}}\right\|_{2}^{\rho}$ ;* 3. (c)

and $V_{n}$ is coercive.

Then, $\forall\overline{{\boldsymbol{\theta}}}\in\mathbb{R}^{p}$ ,

[TABLE]

This result covers of course the situation where $J$ fulfills (H.3). In this case, since $\partial J({\boldsymbol{\theta}})\subset{\mathcal{C}}^{\circ}$ by Theorem 2.1(i), we have $\rho=0$ and $R=\operatorname{diam}({\mathcal{C}}^{\circ})$ , the diameter of the convex compact set ${\mathcal{C}}^{\circ}$ containing the origin. It can be shown that, when $F(\cdot,\boldsymbol{y})$ is strongly coercive, the coercivity assumption (c) can be equivalently stated as $J_{\infty}({\boldsymbol{\theta}})>0$ , $\forall{\boldsymbol{\theta}}\in\ker(\boldsymbol{X})\setminus\{0\}$ , where $J_{\infty}$ is the recession/asymptotic function of $J$ ; see e.g. **[50]**.

Proof.

Let $V^{\gamma}_{n}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\tfrac{1}{n}F(\boldsymbol{X}{\boldsymbol{\theta}},\boldsymbol{y})+\lambda_{n}\mathop{}\mathopen{\vphantom{J}}^{\gamma}\kern-0.5ptJ({\boldsymbol{\theta}})$ and define $\mu^{\gamma}_{n}({\boldsymbol{\theta}})\stackrel{{\scriptstyle\text{\tiny def}}}{{=}}\exp{\left(-V^{\gamma}_{n}({\boldsymbol{\theta}})/\beta\right)}/Z$ , where $0<Z<+\infty$ is the normalizing constant of the density $\mu_{n}$ . Assumption (H.1) and Lemma B.1 (ii)(ii)-(ii)(iii) tell us that $V^{\gamma}_{n}\in C^{1}(\mathbb{R}^{p})$ and $\nabla V^{\gamma}_{n}({\boldsymbol{\theta}})\to{\big{(}\partial V_{n}({\boldsymbol{\theta}})\big{)}}^{0}$ as $\gamma\to 0$ . Thus

[TABLE]

We now check that $\langle\mu^{\gamma}_{n}({\boldsymbol{\theta}})\nabla V^{\gamma}_{n}({\boldsymbol{\theta}}),\overline{{\boldsymbol{\theta}}}-{\boldsymbol{\theta}}\rangle$ is dominated by an integrable function. From the definition of the Moreau envelope, we have

[TABLE]

From coercivity of $V_{n}$ , the objective in the $\min$ is also coercive in $({\boldsymbol{\theta}},\overline{{\boldsymbol{\theta}}})$ by [50, Exercise 3.29(b)]. It then follows from [50, Theorem 3.31] that $V^{\gamma}_{n}$ is also coercive. In turn, [50, Theorem 11.8(c) and 3.26(a)] allow to assert that for some $a\in]0,+\infty[$ , $\exists b\in]-\infty,+\infty[$ such that for all $\gamma>0$ and ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$

[TABLE]

Lemma B.1-(ii)(iii) and assumption (b) on $J$ entail that for any ${\boldsymbol{\theta}}\in\mathbb{R}^{p}$ ,

[TABLE]

Altogether, we have

[TABLE]

where the constant $C>0$ reflects the lower-boudedness of $J$ . It is easy to see that the function in this upper-bound is integrable, where we also use (H.2). Hence, we can apply the dominated convergence theorem to get

[TABLE]

Now, by simple differential calculus (chain and product rules), we have

[TABLE]

Integrating the first term, we get by Fubini theorem and the Newton-Leibniz formula

[TABLE]

where we used coercivity of $V^{\gamma}_{n}$ (see (B.1)) to conclude that $\lim_{|{\boldsymbol{\theta}}_{i}|\to+\infty}\mu^{\gamma}_{n}({\boldsymbol{\theta}})(\overline{{\boldsymbol{\theta}}}_{i}-{\boldsymbol{\theta}}_{i})=0$ . For the second term, we have from Lemma B.1 (ii)(i) that $\mu^{\gamma}_{n}\to\mu_{n}$ as $\gamma\to 0$ . Thus, arguing again as in (B.1), we can apply the dominated convergence theorem to conclude that

[TABLE]

This concludes the proof. ∎

Acknowledgement.

This work was supported by Conseil Régional de Basse-Normandie and partly by Institut Universitaire de France.

Bibliography76

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research , 9:1179–1225, 2008.
2[2] S. Bakin. Adaptive regression and model selection in data mining problems, 1999. Thesis (Ph.D.)–Australian National University, 1999.
3[3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces . Springer, 2011.
4[4] P. Bellec. Concentration of quadratic forms under a bernstein moment assumption. Technical report, Ecole Polytechnique, 2014.
5[5] P. J. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. Annals of Statistics , 37(4):1705–1732, 2009.
6[6] M. Bogdan, E. van den Berg, C. Sabatti, W. Su, and E. J. Candès. Slope – adaptive variable selection via convex optimization. Annals of Applied Statistics , 9(3):1103–1140, 2014.
7[7] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications . Springer Series in Statistics. Springer-Verlag Berlin Heidelberg, 2011.
8[8] E. Candès and Y. Plan. Near-ideal model selection by ℓ 1 subscript ℓ 1 \ell_{1} minimization. Annals of Statistics , 37(5A):2145–2177, 2009.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Sharp Oracle Inequalities for Low-complexity Priors

Abstract

1 Introduction

1.1 Problem statement

1.2 Variational/Penalized Estimators

1.3 Exponential Weighted Aggregation (EWA)

1.4 Oracle inequalities

1.5 Contributions

1.6 Relation to previous work

1.7 Paper organization

1.8 Notations

Vectors and matrices

Sets

Functions

2 Estimation with low-complexity penalties

2.1 Data loss

Example 2.1**.**

2.2 Prior penalty

2.3 Decomposability of the prior penalty

Definition 2.1** (Model Subspace).**

Theorem 2.1**.**

Proof.

Remark 2.1**.**

2.4 Calculus with the prior family

Lemma 2.1**.**

Proof.

2.5 Examples

2.5.1 Lasso

2.5.2 Group Lasso

2.5.3 Analysis (group) Lasso

2.5.4 Anti-sparsity

2.5.5 Nuclear norm

3 Oracle inequalities for a general loss

3.1 Oracle inequality for θ^nEWA{\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}θnEWA​

Theorem 3.1**.**

Remark 3.1**.**

Proof.

Stratifiable functions

3.2 Oracle inequality for θ^nPEN{\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}θnPEN​

Theorem 3.2**.**

Proof.

Remark 3.2**.**

3.3 Oracle inequalities in probability

3.3.1 Polyhedral penalty

Proposition 3.1**.**

Proof.

Remark 3.3**.**

Lasso

Corollary 3.1**.**

Anti-sparsity

Corollary 3.2**.**

3.3.2 General penalty

Remark 3.4** (Group Lasso).**

4 Oracle inequalities for low-complexity linear regression

4.1 General penalty

Definition 4.1**.**

Proposition 4.1**.**

Proof.

Proposition 4.2**.**

Proof.

4.2 Polyhedral penalty

Proposition 4.3**.**

Proof.

4.3 Applications

4.3.1 Lasso

Corollary 4.1**.**

4.3.2 Group Lasso

Corollary 4.2**.**

Proof.

4.3.3 Analysis group Lasso

Corollary 4.3**.**

4.3.4 Anti-sparsity

Corollary 4.4**.**

Example 2.1.

Definition 2.1 (Model Subspace).

Theorem 2.1.

Remark 2.1.

Lemma 2.1.

3.1 Oracle inequality for ${\widehat{\boldsymbol{\theta}}_{n}^{\mathrm{EWA}}}$

Theorem 3.1.

Remark 3.1.

3.2 Oracle inequality for ${\widehat{\boldsymbol{\theta}}^{\mathrm{PEN}}_{n}}$

Theorem 3.2.

Remark 3.2.

Proposition 3.1.

Remark 3.3.

Corollary 3.1.

Corollary 3.2.

Remark 3.4 (Group Lasso).

Definition 4.1.

Proposition 4.1.

Proposition 4.2.

Proposition 4.3.

Corollary 4.1.

Corollary 4.2.

Corollary 4.3.

Corollary 4.4.

Corollary 4.5.

Example 4.1.

Example 4.2.

Example 4.3.

Example 4.4.

Lemma A.1.

Lemma A.2.

Definition A.1 (Polar set).

Lemma A.3.

Definition A.2 (Polar Gauge).

Lemma A.4.

Lemma B.1.

Proposition B.1.