Oracle inequalities for square root analysis estimators with application   to total variation penalties

Francesco Ortelli; Sara van de Geer

arXiv:1902.11192·math.ST·February 12, 2021

Oracle inequalities for square root analysis estimators with application to total variation penalties

Francesco Ortelli, Sara van de Geer

PDF

TL;DR

This paper develops oracle inequalities for square root analysis estimators, including total variation penalties on graphs, providing theoretical guarantees and extending previous entropy-based results.

Contribution

It introduces new oracle inequalities for square root analysis estimators and extends the theory to total variation regularization on graphs.

Findings

01

Oracle inequalities with fast and slow rates derived for analysis estimators.

02

Extension of theory to square root analysis estimators.

03

Constant-friendly rates for total variation regularized estimators on graphs.

Abstract

Through the direct study of the analysis estimator we derive oracle inequalities with fast and slow rates by adapting the arguments involving projections by Dalalyan, Hebiri and Lederer (2017). We then extend the theory to the square root analysis estimator. Finally, we focus on (square root) total variation regularized estimators on graphs and obtain constant-friendly rates, which, up to log-terms, match previous results obtained by entropy calculations. We also obtain an oracle inequality for the (square root) total variation regularized estimator over the cycle graph.

Equations290

\hat{f} = X \hat{β}, \hat{β} = ar g β \in R^{p} min {∥ Y - X β ∥_{n}^{2} + 2 λ ∥ β ∥_{1}}, λ > 0,

\hat{f} = X \hat{β}, \hat{β} = ar g β \in R^{p} min {∥ Y - X β ∥_{n}^{2} + 2 λ ∥ β ∥_{1}}, λ > 0,

\hat{f} = ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0.

\hat{f} = ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0.

(d_{i}^{'})_{j} = ⎩ ⎨ ⎧ - 1, 1, 0, j = e_{i}^{-}, j = e_{i}^{+}, else,

(d_{i}^{'})_{j} = ⎩ ⎨ ⎧ - 1, 1, 0, j = e_{i}^{-}, j = e_{i}^{+}, else,

G (C) := {f \in rowspan (D) : ∥ D_{G} f ∥_{1} \leq C} .

G (C) := {f \in rowspan (D) : ∥ D_{G} f ∥_{1} \leq C} .

\hat{β} := ar g β \in R^{n} min {∥ Y - X β ∥_{n} + λ_{0} ∥ β ∥_{1}}, λ_{0} > 0,

\hat{β} := ar g β \in R^{n} min {∥ Y - X β ∥_{n} + λ_{0} ∥ β ∥_{1}}, λ_{0} > 0,

(\hat{β}, \overset{σ}{^}) := ar g β \in R^{p}, σ > 0 min {\frac{∥ Y - X β ∥ _{n}^{2}}{σ} + σ + 2 λ_{0} ∥ β ∥_{1}}, λ_{0} > 0.

(\hat{β}, \overset{σ}{^}) := ar g β \in R^{p}, σ > 0 min {\frac{∥ Y - X β ∥ _{n}^{2}}{σ} + σ + 2 λ_{0} ∥ β ∥_{1}}, λ_{0} > 0.

\overset{σ}{^}^{2} = ∥ Y - X \hat{β} ∥_{n}^{2} and \frac{X ^{'} ( Y - X β ^ )}{n} = λ_{0} \overset{σ}{^} \partial ∥ \hat{β} ∥_{1},

\overset{σ}{^}^{2} = ∥ Y - X \hat{β} ∥_{n}^{2} and \frac{X ^{'} ( Y - X β ^ )}{n} = λ_{0} \overset{σ}{^} \partial ∥ \hat{β} ∥_{1},

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

S = S (D) := {S \subseteq [m] ∣\exists f \in R^{n} : S = S (f)} \subseteq P ([m]),

S = S (D) := {S \subseteq [m] ∣\exists f \in R^{n} : S = S (f)} \subseteq P ([m]),

Π_{N^{⊥} (D_{- S})} = Π_{rowspan (D_{- S})} = D_{- S}^{+} D_{- S},

Π_{N^{⊥} (D_{- S})} = Π_{rowspan (D_{- S})} = D_{- S}^{+} D_{- S},

Y = f^{0} + ϵ, ϵ \sim N_{n} (0, σ^{2} I_{n}), σ \in (0, \infty) .

Y = f^{0} + ϵ, ϵ \sim N_{n} (0, σ^{2} I_{n}), σ \in (0, \infty) .

\hat{f} := ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0,

\hat{f} := ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0,

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

κ^{2} (S, \tilde{W}) := r_{S} min {∥ f ∥_{n}^{2} : ∥ D_{S} f ∥_{1} - ∥ \tilde{W}_{- S} D_{- S} f ∥_{1} = 1} .

κ^{2} (S, \tilde{W}) := r_{S} min {∥ f ∥_{n}^{2} : ∥ D_{S} f ∥_{1} - ∥ \tilde{W}_{- S} D_{- S} f ∥_{1} = 1} .

\frac{r _{S} ∥ f ∥ _{n}^{2}}{∥ D _{S} f ∥ _{1}^{2}},

\frac{r _{S} ∥ f ∥ _{n}^{2}}{∥ D _{S} f ∥ _{1}^{2}},

\frac{r _{S} ∥ f ∥ _{n}^{2}}{(∥ D _{S} f ∥ _{1} - ∥ W ~ _{- S} D _{- S} f ∥ _{1} ) ^{2}},

\frac{r _{S} ∥ f ∥ _{n}^{2}}{(∥ D _{S} f ∥ _{1} - ∥ W ~ _{- S} D _{- S} f ∥ _{1} ) ^{2}},

ω_{i} := {∥ d_{i^{*} (i)}^{+} ∥_{n}, 0, i \in - S, i \in S .

ω_{i} := {∥ d_{i^{*} (i)}^{+} ∥_{n}, 0, i \in - S, i \in S .

ω_{i}^{2} = ((D_{- S} D_{- S}^{'})^{- 1} / n)_{i^{*} (i), i^{*} (i)}, i \in - S .

ω_{i}^{2} = ((D_{- S} D_{- S}^{'})^{- 1} / n)_{i^{*} (i), i^{*} (i)}, i \in - S .

γ := i \in - S max ω_{i} .

γ := i \in - S max ω_{i} .

w_{i} := 1 - ω_{i} / γ, i \in [m] .

w_{i} := 1 - ω_{i} / γ, i \in [m] .

\hat{f} := ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0.

\hat{f} := ar g f \in R^{n} min {∥ Y - f ∥_{n}^{2} + 2 λ ∥ D f ∥_{1}}, λ > 0.

∥ \hat{f} - f^{0} ∥_{n}^{2} \leq ∥ f - f^{0} ∥_{n}^{2} + 4 λ ∥ D_{- S} f ∥_{1} + (σ \frac{2 x}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt + σ \frac{r _{S}}{n} \lower 0.4 pt \vrule height=8.37164pt,depth=-6.69734pt + \frac{λ r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt}{κ ( S , W )})^{2} .

∥ \hat{f} - f^{0} ∥_{n}^{2} \leq ∥ f - f^{0} ∥_{n}^{2} + 4 λ ∥ D_{- S} f ∥_{1} + (σ \frac{2 x}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt + σ \frac{r _{S}}{n} \lower 0.4 pt \vrule height=8.37164pt,depth=-6.69734pt + \frac{λ r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt}{κ ( S , W )})^{2} .

∥ \hat{f} - f^{0} ∥_{n}^{2} + 2 λ ∥ D_{S} \hat{f} ∥_{1} \leq ∥ f - f^{0} ∥_{n}^{2} + \frac{σ ^{2}}{n} (2 x \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt + r_{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt)^{2} + 4 λ ∥ D f ∥_{1} .

∥ \hat{f} - f^{0} ∥_{n}^{2} + 2 λ ∥ D_{S} \hat{f} ∥_{1} \leq ∥ f - f^{0} ∥_{n}^{2} + \frac{σ ^{2}}{n} (2 x \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt + r_{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt)^{2} + 4 λ ∥ D f ∥_{1} .

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

\hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} := ar g f \in R^{n} min {∥ Y - f ∥_{n} + λ_{0} ∥ D f ∥_{1}}, λ_{0} > 0.

λ_{0} \geq \frac{1}{1 - η} R and ∥ D f^{0} ∥_{1} \leq c σ 1 - 8 a / n \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt / λ_{0},

λ_{0} \geq \frac{1}{1 - η} R and ∥ D f^{0} ∥_{1} \leq c σ 1 - 8 a / n \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt \lower 0.4 pt \vrule height=7.5pt,depth=-6.00003pt / λ_{0},

c = \frac{η}{2} - \frac{r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt + 2 a \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}{n - 8 an \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}^{2} + 4 \lower 0.4 pt \vrule height=11.77998pt,depth=-9.42403pt - 2.

c = \frac{η}{2} - \frac{r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt + 2 a \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}{n - 8 an \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}^{2} + 4 \lower 0.4 pt \vrule height=11.77998pt,depth=-9.42403pt - 2.

η > 2 \frac{r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt + 2 a \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}{n - 8 an \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt} .

η > 2 \frac{r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt + 2 a \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt}{n - 8 an \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt \lower 0.4 pt \vrule height=6.44444pt,depth=-5.15558pt} .

\frac{∥ ϵ ^ ∥ _{n}}{∥ ϵ ∥ _{n}} - 1 \leq η .

\frac{∥ ϵ ^ ∥ _{n}}{∥ ϵ ∥ _{n}} - 1 \leq η .

∥ \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} - f^{0} ∥_{n}^{2} \leq ∥ f - f^{0} ∥_{n}^{2} + 16 σ λ_{0} ∥ D_{- S} f ∥_{1} + σ^{2} (\frac{2 a}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt + \frac{r _{S}}{n} \lower 0.4 pt \vrule height=8.37164pt,depth=-6.69734pt + \frac{4 λ _{0} r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt}{κ ( S , W )})^{2} .

∥ \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} - f^{0} ∥_{n}^{2} \leq ∥ f - f^{0} ∥_{n}^{2} + 16 σ λ_{0} ∥ D_{- S} f ∥_{1} + σ^{2} (\frac{2 a}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt + \frac{r _{S}}{n} \lower 0.4 pt \vrule height=8.37164pt,depth=-6.69734pt + \frac{4 λ _{0} r _{S} \lower 0.4 pt \vrule height=4.30554pt,depth=-3.44446pt}{κ ( S , W )})^{2} .

∥ \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} - f^{0} ∥_{n}^{2} + 2 (1 - η) 1 - \frac{8 a}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt σ λ_{0} ∥ D_{S} \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} ∥_{1}

∥ \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} - f^{0} ∥_{n}^{2} + 2 (1 - η) 1 - \frac{8 a}{n} \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt \lower 0.4 pt \vrule height=8.59721pt,depth=-6.8778pt σ λ_{0} ∥ D_{S} \hat{f}_{\lower 0.4 pt \vrule height=0.0pt,depth=0.0pt} ∥_{1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Oracle inequalities for square root analysis estimators with application to total variation penalties

Francesco Ortellilabel=e1][email protected] [

Sara van de Geerlabel=e2][email protected] [ Rämistrasse 101

8092 Zürich

;

Seminar for Statistics, ETH Zürich

(0000)

Abstract

Through the direct study of the analysis estimator we derive oracle inequalities with fast and slow rates by adapting the arguments involving projections by Dalalyan, Hebiri and Lederer (2017). We then extend the theory to the square root analysis estimator. Finally, we focus on (square root) total variation regularized estimators on graphs and obtain constant-friendly rates, which, up to log-terms, match previous results obtained by entropy calculations. We also obtain an oracle inequality for the (square root) total variation regularized estimator over the cycle graph.

Analysis,

Total variation regularization,

Lasso,

Edge Lasso,

Cycle graph,

Sparsity,

Trend filtering,

Oracle inequality,

Nullspace,

Square root Lasso,

doi:

10.1214/154957804100000000

keywords:

††volume: 0

\arxiv

0000.00000 \startlocaldefs

m[1]∥∥_0#1 m[1]∥∥_1#1 m[1]∥∥_2#1

rm[1]∥∥_n^2#1

\endlocaldefs

and

1 Introduction
1.1 Review of the literature
1.1.1 Synthesis and analysis
1.1.2 Total variation regularized estimators
1.1.3 Square root regularization
1.2 Contributions
1.3 Notation
1.4 Model assumptions and preliminary definitions
1.4.1 Model assumptions
1.4.2 Definitions
2 Oracle inequalities for the analysis estimator
2.1 Fast rates with compatibility conditions
2.2 Slow rates without compatibility conditions
3 Oracle inequalities for the square root analysis estimator
3.1 Fast rates with compatibility conditions
3.2 Slow rates without compatibility conditions
4 Total variation
4.1 Incidence matrices
4.1.1 Trees and cycles
4.1.2 Two dimensional grid graph
4.2 Fast rates
4.2.1 Path graph
4.2.2 Cycle graph
4.3 Slow rates
4.3.1 Trees and cycles
4.3.2 Two dimensional grid graph
4.3.3 Comparison with other results
5 Conclusion
A Probability inequalities
B Proofs of Section 2
B.1 Basic inequality
B.2 Bound on the increments of the empirical process
B.3 Proof of the oracle inequalities
C Proofs of Section 3
C.1 Proving that the square root analysis estimator does not overfit
C.2 Basic inequality
C.3 Bound on the increments of the empirical process
C.4 Proof of the oracle inequalities
D Proofs of Section 4

1 Introduction

1.1 Review of the literature

1.1.1 Synthesis and analysis

In the literature we find two approaches to regularized empirical risk minimization: the synthesis and the analysis approach, see Elad, Milanfar and Rubinstein (2007). Given a dictionary $X\in\mathbb{R}^{n\times p}$ , the synthesis approach to the estimation of $f^{0}\in\mathbb{R}^{n}$ is expressed by the synthesis estimator

[TABLE]

where $Y=f^{0}+\epsilon,\ \epsilon\sim\mathcal{N}_{n}(0,\sigma^{2}\text{I}_{n}),\ \sigma\in(0,\infty)$ , and for a vector $f\in\mathbb{R}^{n}$ we write $\lVert f\rVert^{2}_{n}=\sum_{i=1}^{n}f_{i}^{2}/n$ . An instance of synthesis estimator is the classical lasso (Tibshirani (1996), see Bühlmann and van de Geer (2011) and van de Geer (2016) for a thorough exposition of the theory about the lasso).

On the other side, for an analysis operator $D\in\mathbb{R}^{m\times n}$ , the analysis estimator is given by

[TABLE]

The analysis approach to the estimation of $f^{0}$ has previously been studied in e.g. Vaiter et al. (2013) and Nam et al. (2013). Instances of analysis estimators are total variation regularized estimators over graphs, in particular the fused lasso (Tibshirani et al. (2005)), which corresponds to the case of the path graph. For such estimators, $D$ is taken to be the incidence matrix of some directed graph $\vec{G}=(V,E)$ .

Algorithms to solve both the analysis and the synthesis problem are exposed in Tibshirani and Taylor (2011).

1.1.2 Total variation regularized estimators

Let $\vec{G}=(V,E)$ be a general directed graph, where the set $V=[n]$ is the set of vertices and the set $E=\{e_{1},\ldots,e_{m}\}$ is the set of edges. Every edge $e_{i}=(e_{i}^{-},e_{i}^{+})$ is directed from a vertex $e_{i}^{-}\in V$ to a vertex $e_{i}^{+}\in V$ , $e_{i}^{-}\not=e_{i}^{+}$ .

We define $D_{\vec{G}}\in\{-1,0,1\}^{m\times n}$ , the the incidence matrix of $\vec{G}$ , as

[TABLE]

where $d_{i}^{\prime},i\in[m]$ denote the $i^{\text{th}}$ row of $D_{\vec{G}}$ . Total variation regularized estimators are analysis estimators, where the anaylsis operator $D$ is taken to be $D=D_{\vec{G}}$ for some graph $\vec{G}$ . Thus, the differences of the candidate estimator $f$ across the edges of the graph $\vec{G}$ are penalized.

Some previous studies of total variation regularized estimators (Dalalyan, Hebiri and Lederer (2017); Ortelli and van de Geer (2018)) used a step through a synthesis formulation (cf. Ortelli and van de Geer (2019a)) to prove oracle inequalities. However, these studies were confined to restrictive graph structures: the path in Dalalyan, Hebiri and Lederer (2017) and a class of tree graphs in Ortelli and van de Geer (2018). Other studies focusing on the fused lasso and not directly involving its synthesis form also implicitly relied on some kind of dictionary to handle the error term by projections onto some columns of this dictionary, see for instance the lower interpolant by Lin et al. (2017).

The approach by Hütter and Rigollet (2016), in spite of handling directly the analysis estimator, is not able to guarantee the convergence of the mean squared error for the fused lasso.

For $C>0$ , define

[TABLE]

The minimax rate of estimation over the path graph for functions $f\in\mathcal{G}(C),\ C\asymp 1$ is $n^{-2/3}$ (Donoho and Johnstone (1998)). Moreover the fused lasso tuned with $\lambda\asymp n^{-2/3}C^{-1/3}$ has $\lVert\hat{f}-f^{0}\rVert^{2}_{n}=\mathcal{O}_{\mathbb{P}}(n^{-2/3}C^{2/3})$ if $f^{0}\in\mathcal{G}(C)$ and thus achieves the minimax rate (Mammen and van de Geer (1997)). This result is based on entropy bounds (see ed. Babenko (1979); Birman and Solomjak (1967)) on the class $\mathcal{G}(C)$ , which are not constant-friendly. On the opposite side, Sadhanala, Wang and Tibshirani (2016) showed that estimators given by linear transformations of the observations are suboptimal on $\mathcal{G}(C)$ .

In a recent paper, Padilla et al. (2018) prove that when $\vec{G}$ is a tree graph with bounded maximal degree and $f^{0}\in\mathcal{G}(C)$ , then the minimax rate is $n^{-2/3}C^{2/3}$ .

Moreover, Padilla et al. (2018) prove that the total variation regularized estimator over any connected graph has a mean squared error of order at most $n^{-2/3}C^{2/3}$ if $f^{0}\in\mathcal{G}(C)$ . Thus the total variation regularized estimator over tree graphs of bounded maximal degree is proved to be minimax-optimal. This result is based on entropy bounds by Wang et al. (2016) and is not constant-friendly.

In Sadhanala, Wang and Tibshirani (2016), the authors prove that, for $\vec{G}$ being the two dimensional grid graph, the minimax rate of estimation for $f^{0}\in\mathcal{G}(C)$ with the canonical scaling $C\asymp n^{1/2}$ is $\textstyle\sqrt{{\log n}/{n}\,}$ . The paper by Hütter and Rigollet (2016) shows that this rate is retrieved by the total variation regularized estimator up to log terms.

In a recent work, Chatterjee and Goswami (2019) obtain convergence rates for the total variation regularized estimator over the two dimensional grid by proof techniques involving bounds on the Gaussian width of tangent cones.

These previous results will serve as a benchmark for the evaluation of the rates of the oracle inequalities presented in this paper.

1.1.3 Square root regularization

The square root lasso estimator, defined as

[TABLE]

was first introduced by Belloni, Chernozhukov and Wang (2011) and allows to simulataneously estimate the regression coefficients and the noise level. Thus, when tuning the estimator to obtain oracle properties, one can choose $\lambda_{0}$ not depending on the unknown noise level $\sigma$ . The square root lasso estimator is studied in Belloni, Chernozhukov and Wang (2011), Sun and Zhang (2012) (where it is called scaled lasso), van de Geer (2016) and Stucky and van de Geer (2017), among the others.

One can rewrite the minimization problem in the following form

[TABLE]

The objective function of this second expression of the estimator is not differentiable at $\sigma=0$ and thus if $\hat{\sigma}=0$ the KKT conditions do not hold. By differentiating the penalized loss and assuming that $\hat{\sigma}\not=0$ we get the KKT conditions

[TABLE]

where $\partial\lVert\hat{\beta}\rVert_{1}$ is the subdifferential of $\lVert{\beta}\rVert_{1}$ at $\beta=\hat{\beta}$ .

The papers Belloni, Chernozhukov and Wang (2011); Sun and Zhang (2012) propose algorithms to compute the square root lasso estimator, which are extended by Bunea, Lederer and She (2014) and Derumigny (2018) to the cases of the group square root lasso and of the square root slope respectively.

In this paper we focus on analysis estimators. Our interest is motivated by the possibility to apply the results to the case of total variation regularization. As it will turn out in Theorems 2.1 and 2.2, the choice of the tuning parameter $\lambda$ needed to ensure oracle properties for plain analysis estimators depends on the noise variance $\sigma^{2}$ , which might be unknown. Therefore, we are interested in the square root version of the analysis estimator: the square root analysis estimator

[TABLE]

Indeed, square root estimators are known to be able to estimate the signal and the noise variance simultaneously and therefore allow for a choice of the tuning parameter $\lambda_{0}$ that does not depend on $\sigma$ to guarantee oracle inequalities. This will turn our to be the case in Theorem 3.1 and 3.2. Square root analysis estimators could be computed either by transforming them into square root synthesis estimators by using the insights provided by Ortelli and van de Geer (2019a) (which are largely based on Elad, Milanfar and Rubinstein (2007)) or by adapting to the square root case the algorithm provided by Tibshirani and Taylor (2011) to solve plain analysis problems.

We want to combine the arguments exposed by van de Geer (2016) and Dalalyan, Hebiri and Lederer (2017) and extend them to the square root analysis estimator.

1.2 Contributions

The main points profiling our results are:

•

we study directly the analysis estimator without passing through its synthesis formulation;

•

we apply the projection arguments by Dalalyan, Hebiri and Lederer (2017) to the case of square root regularization;

•

to do so we use projection theory for analysis operators.

We make the following contributions:

We present a framework for proving oracle inequalities with fast and slow rates for a general analysis estimator without transforming the analysis estimation problem into a synthesis estimation problem. This constitutes an analysis counterpart of the results obtained by Dalalyan, Hebiri and Lederer (2017) for the synthesis estimator. 2. 2.

We introduce, inspired by some remarks by Padilla et al. (2018), $r_{S_{0}}:=\text{dim}(\mathcal{N}(D_{-S_{0}}))$ as measure for the sparsity of the signal (see Subsection 1.3 for the notation). In Hütter and Rigollet (2016), the sparsity of the true signal was measured as $\lVert Df^{0}\rVert_{0}$ , while we argue that $r_{S_{0}}$ is more appropriate. 3. 3.

For the total variation regularized estimator on the path graph, we show that an analogue of the bound on the increments of the empirical process by projections exposed by Dalalyan, Hebiri and Lederer (2017) is only off by log-terms from the one which can be obtained by entropy calculations, if we allow the tuning parameter $\lambda$ to depend on some aspects of $f^{0}\in\mathcal{G}(C)$ . We thus match, up to log-terms, the result obtained by means of entropy calculations by Padilla et al. (2018) for general graphs and by Mammen and van de Geer (1997) for the path graph. Note that entropy calculations are not constant-friendly, while the bounds we expose are and might be advantageous for a small enough value of $n$ . 4. 4.

For the total variation regularized estimator over the cycle graph, we prove an oracle inequality with fast rates, which to our knowledge is a new contribution. 5. 5.

We adapt a lemma by van de Geer (2016), showing that the square root lasso does not overfit, to the case where the increments of the empirical process for the square root analysis estimator are bounded by means of the projection arguments by Dalalyan, Hebiri and Lederer (2017). This is a starting point for the development of oracle inequalities for the square root analysis estimator, which produce results analogous to the ones obtained for the plain analysis estimator (which match the ones found in Dalalyan, Hebiri and Lederer (2017)). We then narrow down these results to square root total variation regularized estimators on graphs.

1.3 Notation

Analysis operator $D$ . Let $D\in\mathbb{R}^{m\times n}$ be a given matrix. Let $\{d^{\prime}_{i}\}_{i\in[m]}$ denote the row vectors of $D$ . By $\mathcal{N}(D)$ , we denote the nullspace of $D$ , i.e. $\mathcal{N}(D):=\left\{x\in\mathbb{R}^{n}:Dx=0\right\}$ . Let $\mathcal{N}^{\perp}(D):=\left\{x\in\mathbb{R}^{n}:x^{\prime}z=0,\forall z\in\mathcal{N}(D)\right\}$ denote the orthogonal complement of $\mathcal{N}(D)$ . Note that $\mathcal{N}^{\perp}(D)=\text{rowspan}(D)$ . By penalizing $\lVert Df\rVert_{1}$ , we favor an estimator lying almost in $\mathcal{N}(D)$ , while we penalize estimators having high correlation with the rows of $D$ .

Active set $S\subset[m]$ . Let $S\subseteq[m]$ denote a subset of the row indices of $D$ . We denote the cardinality of the set $S$ by $s:=\lvert S\rvert$ . We write $-S:=[m]\setminus S$ . Moreover, we write $D_{S}=\{d_{i}^{\prime}\}_{i\in S}\in\mathbb{R}^{s\times n}$ and $D_{-S}=\{d_{i}^{\prime}\}_{i\in-S}\in\mathbb{R}^{(m-s)\times n}$ . For instance, let us suppose that, for $S_{0}\subseteq[m]$ , the true signal is s.t. $D_{S_{0}}f^{0}\not=0$ and $D_{-S_{0}}f^{0}=0$ . Then $S_{0}$ is the true active set for $Df^{0}$ , i.e. the set of indices of rows of $D$ , to which the true signal is not orthogonal.

Set of admissible active sets $\mathcal{S}$ . Define $S(f):=\text{support}(Df)=\{j\in[m]:d_{j}^{\prime}f\not=0\}$ . and

[TABLE]

where $\mathcal{P}([m])$ denotes the power set of $[m]$ . If $D$ is not of full row rank, then there might be some subsets of $[m]$ , that can not be the active sets of $Df$ for any $f\in\mathbb{R}^{n}$ . Thus, from now on, we restrict our attention to active sets $S\in\mathcal{S}(D)$ . More on this in Remark Remark in Section 4.

The nullspace $\mathcal{N}(D_{-S})$ . Note that, since $D_{-S(f)}f=0$ , $f\in\mathcal{N}(D_{-S(f)})$ . Thus $\mathcal{N}(D_{-S})$ encompasses all the signals $f$ , s.t. $S\supseteq S(f)$ . In a vector $f\in\mathbb{R}^{n}$ we have $n$ “pieces” of information. Note that $\mathcal{N}(D)$ can be nonempty and thus the part of $f$ lying in $\mathcal{N}(D)$ will always be active, because it is not penalized. Moreover, since we can have $m>n$ and $\lVert Df\rVert_{0}>n$ , we see that $\lVert Df\rVert_{0}=\lvert S(f)\rvert$ is not a good measure for the sparsity of the signal. We thus use as a measure of sparsity $r_{S}=\text{dim}(\mathcal{N}(D_{-S}))\leq n$ to denote the pieces of information that the estimator effectively had to estimate if the active set were $S$ .

We use the shorthand notations $\mathcal{N}_{S}:=\mathcal{N}(D_{S})$ and $\mathcal{N}_{-S}:=\mathcal{N}(D_{-S})$ . Similarly, we write $\mathcal{N}_{S}^{\perp}:=\mathcal{N}^{\perp}(D_{S})$ and $\mathcal{N}_{-S}^{\perp}:=\mathcal{N}^{\perp}(D_{-S})$ . Note that $\mathcal{N}(D)=\mathcal{N}(D_{S})\cap\mathcal{N}(D_{-S}).$ Moreover, if $S,S^{\prime}\subseteq[m]$ are s.t. $S\subset S^{\prime}$ , then we have that $\mathcal{N}(D_{S})\supseteq\mathcal{N}(D_{S^{\prime}})$ . In addition, if the rows of $D_{S^{\prime}\setminus S}$ can be written as linear combinations of the rows of $D_{S}$ , then $\mathcal{N}(D_{S^{\prime}})=\mathcal{N}(D_{S})$ .

Diagonal matrices of weights. Let $\tilde{w}\in\mathbb{R}^{m}$ be a vector, for instance a vector of weights. For the diagonal matrix $\tilde{W}=\text{diag}(\{\tilde{w}_{i}\}_{i\in[m]})\in\mathbb{R}^{m\times m}$ we write $\tilde{W}_{S}:=\text{diag}(\{\tilde{w}_{i}\}_{i\in S})\in\mathbb{R}^{s\times s}$ and $\tilde{W}_{-S}:=\text{diag}(\{\tilde{w}_{i}\}_{i\in-S})\in\mathbb{R}^{(m-s)\times(m-s)}$ . We will need these notations for bounding the weighted weak compatibility constant, defined in Definition 1.1 below.

Linear projections. Let $\text{I}_{n}\in\mathbb{R}^{n\times n}$ denote the identity matrix and let $\mathbb{I}_{n}=\{1\}^{n\times n}$ .

Let $V\subset\mathbb{R}^{n}$ be a linear space. By $\Pi_{V}\in\mathbb{R}^{n\times n}$ we denote the orthogonal projection matrix onto $V$ and by $A_{V}:=\text{I}_{n}-\Pi_{V}$ the orthogonal antiprojection matrix onto $V$ .

Let $f\in\mathbb{R}^{n}$ . We write $f=(\Pi_{\mathcal{N}_{-S}}+\Pi_{\mathcal{N}_{-S}^{\perp}})f=:f_{\mathcal{N}_{-S}}+f_{\mathcal{N}_{-S}^{\perp}}$ , i.e. for a set $S\in\mathcal{S}(D)$ we decompose a signal $f$ into a low rank part (since usually $r_{S}$ will be small) orthogonal to $D_{-S}$ and a part collinear to $D_{-S}$ . We will use this decomposition when bounding the increments of the empirical processes in the proofs of the oracle inequalities.

Note that $\Pi_{\mathcal{N}_{-S}^{\perp}}=\text{I}_{n}-\Pi_{\mathcal{N}_{-S}}=:A_{\mathcal{N}_{-S}}$ and $\Pi_{\mathcal{N}_{-S}}=\text{I}_{n}-\Pi_{\mathcal{N}_{-S}^{\perp}}=:A_{\mathcal{N}_{-S}^{\perp}}$ .

Computing $\Pi_{\mathcal{N}^{\perp}_{-S}}$ . Let $S\in\mathcal{S}$ be a set of row indices of $D$ . We have that

[TABLE]

where $D_{-S}^{+}\in\mathbb{R}^{n\times(m-s)}$ denotes the Moore-Penrose pseudoinverse of $D_{-S}$ . If $D_{-S}\in\mathbb{R}^{(m-s)\times n}$ is of full row rank we have that $D_{-S}^{+}=D_{-S}^{\prime}(D_{-S}D_{-S}^{\prime})^{-1}$ .

1.4 Model assumptions and preliminary definitions

1.4.1 Model assumptions

Throughout the paper we will use the following model, which assumes that we observe a signal contaminated with Gaussian noise. Let $f^{0}\in\mathbb{R}^{n}$ be a signal. We observe

[TABLE]

Moreover, for an analysis operator $D\in\mathbb{R}^{m\times n}$ we will study the two following estimators.

•

The analysis estimator $\hat{f}$ of $f^{0}$ , defined as

[TABLE]

•

The square root analysis estimator $\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}$ of $f^{0}$ , defined as

[TABLE]

In particular, Section 2 will focus on the study of the analysis estimator $\hat{f}$ for a general analysis operator $D$ , while Section 3 will deal with its square root counterpart $\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}$ . In Section 4 we will then apply the results of the two previous sections to total variation regularization.

1.4.2 Definitions

Let $D_{-S}^{+}$ denote the Moore-Penrose pseudoinverse of $D_{-S}$ and $d^{+}_{i}\in\mathbb{R}^{n},i\in[m-s]$ the column vectors of $D^{+}_{-S}$ .

We define the map $i^{*}:-S\mapsto[m-s]$ , s.t. $i^{*}(i)=\sum_{j=1}^{i}1_{\{j\in-S\}}$ . The index $i^{*}(i)$ denotes the row index of the $i^{\text{th}}$ row of $D$ , $i\in-S$ , in the matrix $D_{-S}$ .

We use a proof technique inspired by Dalalyan, Hebiri and Lederer (2017). The key aspect of this proof technique is to decompose the noise into two parts by using orthogonal projections:

•

a part projected onto a low-rank linear subspace, which will be bounded by using the Cauchy-Schwarz inequality,

•

a remainder (i.e. the antiprojection), involving the weights defined below, which will be bounded with more refined techniques. These techniques involve, in the case of oracle inequalities with fast rates, the weak weighted compatibility constant.

Definition 1.1 (Weighted weak compatibility constant)

Let $\tilde{W}\in\mathbb{R}^{m\times m}$ be a diagonal matrix of weights with $\tilde{W}_{S}=\text{I}_{s}$ and $\lVert\tilde{W}\rVert_{\infty}\leq 1$ (e.g. as in Definition 1.4). The weighted compatibility constant $\kappa^{2}(S,\tilde{W})$ is defined as

[TABLE]

Remark

This weighted compatibility constant extends the definition given by Dalalyan, Hebiri and Lederer (2017) to the case of analysis estimators. A distinguishing feature is the factor $r_{S}$ . When $S=S(f^{0})=:S^{0}$ , $r_{S_{0}}$ expresses the number of parameters to estimate .

Note that the weak weighted compatibility constant relaxes the definition of compatibility constant given by Hütter and Rigollet (2016). There, one has to lower bound

[TABLE]

while the weighted weak compatibility constant is applied to lower bound

[TABLE]

which is easier, since the denominator is smaller. The additional term $\lVert\tilde{W}_{-S}D_{-S}f\rVert_{1}$ comes form the remainder term mentioned in the above sketch of the proof technique used in this paper.

Note that bounds on the compatibility constant by Hütter and Rigollet (2016) imply bounds on the weighted weak compatibility constant but the converse is not true. This is relevant, for instance, for the case of the total variation regularization over the path graph. In that case the bound by Hütter and Rigollet (2016) is too rough. One can obtain more refined bounds by studying the weighted weak compatibility constant as done in Dalalyan, Hebiri and Lederer (2017); Ortelli and van de Geer (2018). For instance, Ortelli and van de Geer (2018) showed that the weighted weak compatibility constant also holds on a certain class of tree graphs. We will show in Section 4 a new bound on the weighted weak compatibility constant for the total variation regularized estimator over the cycle graph.

Definition 1.2 (Length of antiprojections)

In analogy to Dalalyan, Hebiri and Lederer (2017), the vector $\omega\in\mathbb{R}^{m}$ is defined as

[TABLE]

Moreover, we write $\Omega:=\text{diag}(\{\omega_{i}\}_{i\in[m]})\in\mathbb{R}^{m\times m}$ .

Note

One can see that, if $D_{-S}$ is of full row rank,

[TABLE]

We want to find a vector of weights with values in $[0,1]$ , based on $\Omega$ defined above. We thus define hereafter the normalized scaling factor $\gamma$ as the maximum entry of $\Omega$ .

Definition 1.3 (Normalized inverse scaling factor)

In analogy to the quantity $\rho_{T}$ used by Dalalyan, Hebiri and Lederer (2017) and to the scaling factor defined by Hütter and Rigollet (2016), the normalized inverse scaling factor $\gamma=\gamma(D,S(S))$ is defined as

[TABLE]

We now normalize $\Omega$ by dividing its entries by $\gamma$ to obtain a vector of weights $w\in[0,1]^{m}$ .

Definition 1.4 (Weights)

In analogy to Dalalyan, Hebiri and Lederer (2017), the vector of weights $w\in\mathbb{R}^{m}$ is defined as

[TABLE]

Moreover, we write $W:=\text{diag}(\{w_{i}\}_{i\in[m]})\in\mathbb{R}^{m\times m}$ . Note that $W_{S}=\text{I}_{s}$ .

2 Oracle inequalities for the analysis estimator

In this section we study the analysis estimator, which is defined as

[TABLE]

This section produces analogous results to Dalalyan, Hebiri and Lederer (2017). We however use an approach that does not take a detour via synthesis, but instead directly handles the analysis estimator. In Section 3 we are going to explore how this approach translates to the case of the square root analysis estimator.

2.1 Fast rates with compatibility conditions

Theorem 2.1 (Oracle inequality with fast rates for the analysis estimator)

Let $S\in\mathcal{S}$ be arbitrary and $x,\ t>0$ . Choose $\lambda\geq{\gamma\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ . For the analysis estimator it holds that, $\forall f\in\mathbb{R}^{n}$ , with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Proof of Theorem 2.1.

See Appendix B. ∎

2.2 Slow rates without compatibility conditions

Theorem 2.2 (Oracle inequality with slow rates for the analysis estimator)

Let $S\in\mathcal{S}$ be arbitrary and $x,\ t>0$ . Choose $\lambda\geq{\gamma\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ . For the analysis estimator it holds that, $\forall f\in\mathbb{R}^{n}$ , with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Proof of Theorem 2.2.

See Appendix B. ∎

Remark

Theorem 2.2 does not need the assumption that the (weighted) compatibility constant is bounded away from zero.

3 Oracle inequalities for the square root analysis estimator

In this section we study the square root analysis estimator, defined as

[TABLE]

Throughout this section we will make use of the following assumption.

Assumption 3.1

Assume for some $a>0$ that $n>8a$ and that for some $R>0,\eta\in(0,1)$

[TABLE]

where,

[TABLE]

We assume that $S\in\mathcal{S}$ is s.t.

[TABLE]

Note

Assumption 3.1 is also an assumption on $S$ and will thus be a criterion to determine for which $S\in\mathcal{S}$ our (oracle) results hold.

For the square root analysis estimator, to get the KKT conditions we have to make sure that $\hat{\epsilon}:=Y-\hat{f}\not=0$ , i.e. that the estimator does not overfit.

The following lemma, showing that $\lVert\hat{\epsilon}\rVert_{n}>0$ , is an adaptation of Lemma 3.1 in van de Geer (2016) to the case of the square root analysis estimator where the increments of the empirical process are bounded by the projection arguments found in Dalalyan, Hebiri and Lederer (2017).

Lemma 3.1

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . Choose $R\geq\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1 we have that, with probability at least $1-3e^{-a}-e^{-t}$ ,

[TABLE]

Proof of Lemma 3.1.

See Appendix C. ∎

Remark

While Lemma 3.1 by van de Geer (2016) only requires a lower bound on $\lVert\epsilon\rVert_{n}$ , Lemma 3.1 presented here requires that $\lVert\Pi_{\mathcal{N}(D_{-S})}\epsilon\rVert_{n}$ is upper and lower bounded and that $\lVert A_{\mathcal{N}(D_{-S})}\epsilon\rVert_{n}$ is lower bounded. It is the price to pay for a more refined technique to handle the increments of the empirical process.

Corollary 3.1

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . Choose $R\geq\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1, we have that, with probability at least $1-3e^{-a}-e^{-t}$ , $\lambda_{0}\lVert\hat{\epsilon}\rVert_{n}\geq R\lVert\epsilon\rVert_{n}$ .

Proof of Corollary 3.1.

Under Assumption 3.1 on $\mathcal{A}\cap\mathcal{R}$ Lemma 3.1 holds and thus $\lVert\hat{\epsilon}\rVert_{n}\geq(1-\eta)\lVert\epsilon\rVert_{n}$ . It follows that $\frac{1}{1-\eta}\geq\frac{\lVert\epsilon\rVert_{n}}{\lVert\hat{\epsilon}\rVert_{n}}$ . By inserting this inequality into the assumption $\lambda_{0}\geq\frac{1}{1-\eta}R$ we get the claim. ∎

We now expose oracle inequalities for the square root analysis estimator with fast and slow rates. The results are similar to Theorems 2.1 and 2.2 up to the constants and the assumptions one has to make.

3.1 Fast rates with compatibility conditions

Theorem 3.1 (Oracle inequality with fast rates for the square root analysis estimator)

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . For $\eta\in(0,1)$ , choose $\lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1, $\forall f\in\mathbb{R}^{n}$ , it holds that, with probability at least $1-4e^{-a}-e^{-t}$ ,

[TABLE]

Proof of Theorem 3.1.

See Appendix C. ∎

3.2 Slow rates without compatibility conditions

Theorem 3.2 (Oracle inequality with slow rates for the square root analysis estimator)

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . For $\eta\in(0,1)$ , choose $\lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1, $\forall f\in\mathbb{R}^{n}$ , it holds that, with probability at least $1-4e^{-a}-e^{-t}$ ,

[TABLE]

Proof of Theorem 3.2.

See Appendix C. ∎

Remark

The claim of Theorem 3.2 implies also the simpler inequality

[TABLE]

Remark

We can simplify for the ease of exposition Assumption 3.1 on $\lVert Df^{0}\rVert_{1}$ to $\lVert Df^{0}\rVert_{1}=\mathcal{O}(1/\lambda_{0})$ . Note that if we take $\lambda_{0}\asymp\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{{\log n}/{n},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{{\log n}/{n},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{{\log n}/{n},} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{{\log n}/{n},} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ , then the assumption becomes

[TABLE]

If $\lVert Df^{0}\rVert_{1}$ is growing with $n$ , then the rates obtained with the slow rate oracle inequality by setting $f=f^{0}$ will be slower as well. In particular, if $\lVert Df^{0}\rVert_{1}\asymp 1/\lambda_{0}$ , then Theorem 3.2 does not guarantee the convergence in $\lVert\cdot\rVert_{n}$ .

Remark

The choice of the tuning parameter $\lambda_{0}$ depends on $S$ through $\gamma$ . Therefore, in practice, the oracle inequalities will only hold for certain active sets $S$ . To find out for which $S$ the oracle inequality holds with high probability we proceed as follows.

We choose $a>0$ , $t\in(0,{(n-1)}/{2}-\log(2n))$ , $\eta\in(0,1)$ and $\lambda_{0}>0$ . Then, an active set $S$ for which the oracle inequality holds has to satisfy the following requirements:

[TABLE]

and

[TABLE]

4 Total variation

4.1 Incidence matrices

Let $\vec{G}=(V,E)$ be a general directed graph, where the set $V=[n]$ is the set of vertices and the set $E=\{e_{1},\ldots,e_{m}\}$ is the set of edges. Let $D_{\vec{G}}\in\{-1,0,1\}^{m\times n}$ be the incidence matrix of $\vec{G}$ (for more details see Subsubsection 1.1.2). In this section we will set $D=D_{\vec{G}}$ . It is known that the rank of $D$ is given by the number of vertices of $\vec{G}$ minus its number of connected components.

We now consider a set $S\in\mathcal{S}$ . Let us define the set of edges $E_{S}:=\{e_{i}\in E,i\in S\}$ . The number of connected components of $\vec{G}_{-S}:=(V,E\setminus E_{S})$ is $r_{S}$ . These connected components can be any sort of graph: tree or non-tree graphs.

Let $n_{1},\ldots,n_{r_{S}}$ be the number of vertices of each connected component $\vec{C}_{i}:=([n_{i}],E_{i}),i\in[r_{S}]$ of $\vec{G}_{-S}$ . Let us define $n_{\min}:=\min\{n_{1},\ldots,n_{r_{S}}\}$ and $n_{\max}:=\max\{n_{1},\ldots,n_{r_{S}}\}$ . The matrix $D_{-S}$ can be rewritten as block matrix by rearranging rows and columns. From now on, when writing $D_{-S}$ we intend the matrix in its block form.

By Lemma 1 in Ijiri (1965) we have that

[TABLE]

Remark

The restriction to the class $\mathcal{S}$ can be seen as a requirement to have an active set $S$ which makes sense. The incidence matrix of all connected graphs is of row rank $n-1$ . However, graphs containing cycles, as the cycle graph or the two dimensional grid graph, have more than $n-1$ rows. The dimension $r_{S}$ of $\mathcal{N}_{D_{-S}}$ is the number of connected patches of the graph on which the signal is constant, if the active set is $S$ . A non-empty active set, means that the signal should have at least two constant pieces, otherwise no edge would be active.

If the active set is $S=\emptyset$ , then the dimension of $\mathcal{N}_{D_{-S}}$ is one. Now consider for instance the cycle graph. If $S=\{i\}$ for an $i\in[n]$ , then the dimension of $\mathcal{N}_{D_{-S}}$ is still one. Thus this active set does not make sense at all since it would imply that we have a constant signal on the cycle graph but yet also a non-empty active set. Indeed, it is impossible to find a constant signal on a graph which results in some active edges.

For tree graphs, we have that $\mathcal{S}=\mathcal{P}([m])$ , while for graph structures containing cycles we have that $\mathcal{S}\subset\mathcal{P}([m])$ . In particular, for the cycle graph it holds that $\mathcal{S}=\mathcal{P}([m])\setminus\{\{1\},\ldots,\{n\}\}$ .

4.1.1 Trees and cycles

If $\vec{G}$ is a tree or a cycle graph, then the connected components $\vec{C}_{i}=([n_{i}],E_{i}),i\in[r_{S}]$ of $\vec{G}_{-S},S\not=\emptyset$ are tree graphs, i.e. connected graphs with $\lvert E_{i}\rvert=n_{i}-1,i\in[r_{S}]$ . Let $D_{\vec{C}_{i}}\in\mathbb{R}^{(n_{i}-1)\times n_{i}},\ i\in[r_{S}]$ be the incidence matrices of the tree graphs $\vec{C}_{i},\ i\in[r_{S}]$ .

Lemma 4.1 (Upper bound for the normalized inverse scaling factor)

Let $\vec{G}$ be a tree graph. Then, $\forall S\in\mathcal{S}(D_{\vec{G}})$ , the normalized inverse scaling factor $\gamma=\max_{i\in-S}\omega_{i}$ is bounded by

[TABLE]

Let $S\not=\emptyset$ and let $\vec{G}$ be a cycle graph. Then

[TABLE]

Proof of Lemma 4.1.

See Appendix D. ∎

4.1.2 Two dimensional grid graph

We report and slightly adapt the bound on the normalized inverse scaling factor for the two dimensional grid graph by Hütter and Rigollet (2016).

Lemma 4.2 (Proposition 4 in Hütter and Rigollet (2016))

Let $\vec{G}$ be a two dimensional $\mathchoice{{\hbox{$ \displaystyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$ \textstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$ \scriptstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth=-1.72223pt}}}\times\mathchoice{{\hbox{$ \displaystyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$ \textstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$ \scriptstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n,} $}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth=-1.72223pt}}}$ grid graph. Let $S\in\mathcal{S}$ be s.t. the connected components of $\vec{G}_{-S}$ are square two dimensional grid graphs. Then, for some sufficiently large constant $C>0$ , the normalized inverse scaling factor $\gamma=\max_{i\in-S}\omega_{i}$ is bounded by

[TABLE]

4.2 Fast rates

To prove oracle inequalities with fast rates we need to find an explicit lower bound for the weighted compatibility constant.

Results for the analysis estimator on the path graph have already been obtained by Ortelli and van de Geer (2018). We extend them to the square root analysis estimator. Moreover, we also show that the tools developed in Ortelli and van de Geer (2018) together with the new framework presented here, allow to handle the case of the cycle graph. We are aware of results treating the ${k}^{\text{th}}$ power graphs of cycles (Hütter and Rigollet (2016)) but not of any oracle inequality implying the convergence of the mean squared error for the case of the cycle graph.

4.2.1 Path graph

We now consider the path graph $\vec{G}=([n],\{(1,2),\ldots,(n-1,n)\})$ , for which $\mathcal{S}=\mathcal{P}([m])$ and $S=S$ .

We see that $D_{-S}$ is a block matrix, where the blocks are incidence matrices of some smaller path graphs. By recycling the proof of Lemma 4.1 we obtain that

[TABLE]

The following lemma by van de Geer (2018), later also used in Ortelli and van de Geer (2018), allows us to lower bound $\kappa(S,W)$ , for a diagonal matrix $W=\text{diag}(\{w_{j}\}_{j\in[n-1]})\in\mathbb{R}^{(n-1)\times(n-1)}$ with $\lVert W\rVert_{\infty}\leq 1$ and where by convention we choose $w_{n}=1$ .

Lemma 4.3 (Theorem 15 and Lemma 21 in van de Geer (2018))

Assume that $S\subseteq[m]$ is s.t. $n_{1},n_{r_{S}}\geq 2$ and $n_{i}\geq 4,\forall i\in\{n_{2},\ldots,n_{r_{S}-1}\}$ . Then

[TABLE]

and the inequality is tight. Moreover

[TABLE]

Proof of Lemma 4.3.

The first statement follows form Theorem 15 and the second from Lemma 21 in van de Geer (2018). The proofs are also exposed also in Ortelli and van de Geer (2018), in Lemmas 5.3-5. ∎

In Ortelli and van de Geer (2018) it is explained that to bound the weak weighted compatibility constant for the path graph one needs to cut it into $s$ smaller modules. These modules lie around an edge in $S$ and consist of at least one additional edge on each side of the edge in $S$ , see Figure 1. Therefore we see that the assumption $n_{i}\geq 4,\forall i\in[s+1]$ guarantees that we are in a situation where the bounds on the weak weighted compatibility constant apply. Indeed, if $n_{i}\geq 4,\forall i\in[s+1]$ we have at least four vertices on the left and on the right of each edge in $S$ and thus we can decompose the path graph into modules being at least as large as the one shown in Figure 1. Since the oracles inequalities with fast rates exposed here are based on the bound on the weighted weak compatibility constant by Ortelli and van de Geer (2018), for fast rates we will require that $n_{i}\geq 4,\forall i\in[s+1]$ .

The edges not in $S$ between modules can be ignored. Each module needs at least $4$ vertices, s.t. we need $\lvert S\rvert\leq n/4$ to hope to be able to upper bound $\kappa^{2}(S,W)$ by using the method proposed by van de Geer (2018). Moreover, a vertex not involved in an edge in $S$ can only be involved in one module to obtain the bounds exposed in Ortelli and van de Geer (2018).

Note also that the weights in $w$ have a direct correspondence to the edges of the graph, where the edges in $S$ are s.t. $\omega_{S}=0$ . Moreover, the weights for the edges between modules can be chosen arbitrarily when it comes to bounding $\lVert D_{S}f\rVert_{1}-\lVert W_{-S}D_{-S}f\rVert_{1}$ from above, even if a value for them can be obtained by computation of the $\lVert\cdot\rVert_{n}$ -norm of the corresponding columns of $D^{+}_{-S}$ .

We take the arbitrary decision to use the convention $w_{n}:=1$ , as in Lemma 4.3.

Lemma 4.4

Assume that $S\subseteq[m]$ is s.t. $n_{i}\geq 4,\ \forall i\in[r_{S}]$ . We have that

[TABLE]

Proof of Lemma 4.4.

See the proof of Corollary 5.6 in Ortelli and van de Geer (2018). ∎

Let $D\in\mathbb{R}^{(n-1)\times n}$ be the incidence matrix of the path graph with $n$ vertices. With the tools developed we can prove the following corollaries.

Analysis estimator on the path graph

Corollary 4.1 below, is a result already found in Ortelli and van de Geer (2018). It is reported here for comparison with the analogous result obtained for the square root analysis estimator on the path graph (s. Corollary 4.3). Corollary 4.2 follows directly from Corollary 4.1.

Corollary 4.1

Let $S\subseteq[m]$ be an arbitrary active set with s.t. $n_{\min}\geq 4$ and let $x,\ t>0$ . Choose $\lambda\geq\sigma\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/{n}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that, with probability at least $1-e^{-t}-e^{-x}$ ,

[TABLE]

Proof of Corollary 4.1.

See Appendix D. ∎

Due to the use of the bound given by Lemma 4.4, Corollary 4.1 assumes a minimal length condition. This condition does not depend on $n$ and is therefore weaker than the one found in Guntuboyina et al. (2017). Note that the choice of the tuning parameter depends both on $\sigma$ and $n_{\max}=n_{\max}(S)$ .

The next corollary makes a stronger assumption on $S$ .

Corollary 4.2

Let $S\subseteq[m]$ be an arbitrary active set with s.t. $n_{\min}=n_{\max}\geq 4$ and $n_{\max}$ even. Let $x,\ t>0$ . Choose $\lambda\geq{\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=13.10944pt,depth=-10.4876pt}}}{{\hbox{$ \textstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=9.19028pt,depth=-7.35226pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that, with probability at least $1-e^{-t}-e^{-x}$ ,

[TABLE]

Proof of Corollary 4.2.

If $n_{\min}=n_{\max}$ , then $n_{i}=n/r_{S},\ \forall i\in[r_{S}]$ . Moreover, $K\leq 4r_{S}^{2}/n$ and the statement of Corollary 4.2 follows by plugging in these insights into Corollary 4.1. ∎

Corollary 4.2 says that, if $n_{\min}=n_{\max}$ , then we can choose $\lambda$ smaller than the universal choice $\lambda\asymp\sigma\mathchoice{{\hbox{$ \displaystyle\sqrt{\log n/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{\log n/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{\log n/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\log n/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ . The choice of the constant-friendly tuning parameter in the two corollaries above assumes however the knowledge of some aspects of the oracle signal $f$ minimizing the right hand side and can be seen as a motivation to choose the tuning parameter smaller than the universal choice if we know or suspect a certain specific structure for it. These insights were already developed by Dalalyan, Hebiri and Lederer (2017) and applied to total variation on the path graph in the case of slow rates.

Square root analysis estimator on the path graph

We now extend the results obtained for the analysis estimator to the case of the square root analyisis estimator.

Corollary 4.3

Let $S\subseteq[m]$ be an arbitrary active set having $n_{\min}\geq 4$ and satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . Choose $\lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}$ . Then, under Assumption 3.1, $\forall f\in\mathbb{R}^{n}$ , for the square root version of the total variation regularized estimator over the path graph it holds that, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

Proof of Corollary 4.3.

The proof of Corollary 4.3 is analogous to the proof of Corollary 4.1. ∎

Corollary 4.4

Let $S\subseteq[m]$ be an arbitrary active set having $n_{\min}=n_{\max}\geq 4$ with $n_{\max}$ even and satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . Choose $\lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}$ . Then, under Assumption 3.1, $\forall f\in\mathbb{R}^{n}$ , for the square root version of the total variation regularized estimator over the path graph it holds that, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

Proof of Corollary 4.4.

If $n_{\min}=n_{\max}$ , then $n_{i}=n/r_{S},\ \forall i\in[r_{S}]$ . Moreover, $K\leq 4r_{S}^{2}/n$ and the statement of Corollary 4.4 follows by plugging in these insights into Corollary 4.3. ∎

Remark

We notice that there is a tradeoff in the choice of $\eta$ . A small $\eta$ will result in a narrower bound for $\lVert\hat{\epsilon}\rVert_{n}$ in terms of $\lVert\epsilon\rVert_{n}$ and in smaller constants in the tuning parameter and in the oracle bound. However, it might result in a more restrictive condition on $S$ in Assumption 3.1.

4.2.2 Cycle graph

We consider the the cycle graph $\vec{G}=([n],\{(1,2),\ldots,(n-1,n),(n,1)\})$ and its incidence matrix $D\in\mathbb{R}^{n\times n}$ . We have $\mathcal{S}(D)=\mathcal{P}([m])\setminus\{\{1\},\ldots,\{n\}\}$ .

We bound the weighted compatibility constant by cutting the graph into smaller modules as we explained in Subsubsection 4.2.1 for the path graph.

By concatenating such modules, one can obtain a path graph. Whether or not the two ends of the path graph are joined by an edge is not relevant for the possibility to bound the compatibility constant and obtain an oracle inequality with fast rates, since the edges connecting such modules are neglected in the bound.

Remark

Note that for the path graph we have that $r_{S}=\lvert S\rvert+1$ , while for the cycle graph it holds that $r_{S}=\lvert S\rvert$ .

Corollary 4.5

Assume that $S\in\mathcal{S}$ is s.t. $n_{i}\geq 4,\ \forall i\in[r_{S}]$ . Then

[TABLE]

and the inequality is tight. Moreover

[TABLE]

Proof of Corollary 4.5.

Corollary 4.5 follows from Lemma 4.3 and from the considerations above. ∎

Remark

From Lemma 4.3 we get that, if $S\in\mathcal{S}$ is s.t. $n_{i}\geq 4,\ \forall i\in[r_{S}]$ , then

[TABLE]

We now have all the tools to derive an oracle inequality for the total variation regularized estimator over the cycle graph and its square root version.

Analysis estimator on the cycle graph

Corollary 4.6

Let $S\in\mathcal{S}\setminus\emptyset$ be an arbitrary active set with $n_{\min}\geq 4$ and let $x,t>0$ . Choose $\lambda\geq\sigma\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/{n}$ . Then, $\forall f\in\mathbb{R}^{n}$ , for the total variation regularized estimator over the cycle graph it holds that, with probability at least $1-e^{-t}-e^{-x}$ ,

[TABLE]

An analogous version of Corollary 4.2 can be derived from Corollary 4.6.

Square root analysis estimator on the cycle graph

Corollary 4.7

Let $S\in\mathcal{S}\setminus\emptyset$ be an arbitrary active set having $n_{\min}\geq 4$ and satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . Choose $\lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , for the square root version of the total variation regularized estimator over the cycle graph it holds that, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

An analogous version of Corollary 4.4 can be derived from Corollary 4.7.

4.3 Slow rates

Note that in the case of the so-called slow rates we do not need to lower bound the compatibility constant.

4.3.1 Trees and cycles

In this subsection we identify the analysis operator $D$ with the incidence matrix of a general tree or cycle graph $\vec{G}$ .

Analysis estimator on trees and cycles

Corollary 4.8

Let $\vec{G}$ be a tree or a cycle graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ (and under the condition $S\not=\emptyset$ for cycle graphs) be arbitrary and let $x,t>0$ . Choose $\lambda\geq{\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/n$ . Then, $\forall f\in\mathbb{R}^{n}$ , we have that, with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Proof of Corollary 4.8.

Corollary 4.8 follows by combining Theorem 2.2 and Lemma 4.1. ∎

Corollary 4.9

Let $\vec{G}$ be a tree or a cycle graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ (with the condition $S\not=\emptyset$ for cycle graphs) having $n_{\max}=n_{\min}$ be arbitrary and let $x,t>0$ . Choose $\lambda\geq{\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=13.10944pt,depth=-10.4876pt}}}{{\hbox{$ \textstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=9.19028pt,depth=-7.35226pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},} $}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , we have that, with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Square root analysis estimator on trees and cycles

Corollary 4.10

Let $\vec{G}$ be a tree or a cycle graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ (and under the condition $S\not=\emptyset$ for cycle graphs) be an arbitrary active set satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . Choose $\lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that under Assumption 3.1, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

Proof of Corollary 4.10.

Corollary 4.10 follows by combining Theorem 3.2 and Lemma 4.1. ∎

Corollary 4.11

Let $\vec{G}$ be a tree or a cycle graph graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ (and under the condition $S\not=\emptyset$ for cycle graphs) be an arbitrary active set having $n_{\max}=n_{\min}$ and satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . Choose $\lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.55833pt,depth=-6.04669pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.55833pt,depth=-6.04669pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that under Assumption 3.1, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

4.3.2 Two dimensional grid graph

In this subsection we identify the analysis operator $D$ with the incidence matrix of a square two dimensional grid graph $\vec{G}$ .

Analysis estimator on the two dimensional grid

Corollary 4.12

Let $\vec{G}$ be a square two dimensional grid graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ be an arbitrary active set s.t. the connected components of $\vec{G}_{-S}$ are square two dimensional grid graphs and let $x,t>0$ . For a constant $C>0$ large enough, choose $\lambda\geq C{\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{\log n(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{\log n(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{\log n(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\log n(\log(2n)+t),} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/n$ . Then, $\forall f\in\mathbb{R}^{n}$ , we have that, with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Proof of Corollary 4.12.

Corollary 4.12 follows by combining Theorem 2.2 and Lemma 4.2. ∎

Square root analysis estimator on the two dimensional grid

Corollary 4.13

Let $\vec{G}$ be a tree or a cycle graph. Let $S\in\mathcal{S}(D_{\vec{G}})$ be an arbitrary active set being s.t. the connected components of $\vec{G}_{-S}$ are square two dimensional grid graphs and satisfying Assumption 3.1. Let $a>0$ and $t\in(0,(n-1)/2-\log(2(n-r_{S})))$ . For a constant $C>0$ large enough, choose $\lambda_{0}\geq\frac{C}{1-\eta}\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{$ \textstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that under Assumption 3.1, with probability at least $1-e^{-t}-4e^{-a}$ ,

[TABLE]

Proof of Corollary 4.13.

Corollary 4.13 follows by combining Theorem 3.2 and Lemma 4.2. ∎

4.3.3 Comparison with other results

Consider Corollary 4.9 with the choice $f=f^{0}$ and assume that $\sigma$ does not depend on $n$ . Then the following holds with probability at least $1-e^{-x}-e^{-t}$ .

•

With $r_{S}\asymp n^{1/3}(\log(2n)+t)^{1/3}\lVert Df^{0}\rVert_{1}^{2/3}$ , then $\lVert\hat{f}-f^{0}\rVert^{2}_{n}=\mathcal{O}(n^{-2/3}(\log(2n)+t)^{1/3}\lVert Df^{0}\rVert_{1}^{2/3})$ and $\lambda$ explicitely depends on $f^{0}$ .

•

With $r_{S}\asymp n^{1/3}(\log(2n)+t)^{1/3}$ , then $\lVert\hat{f}-f^{0}\rVert^{2}_{n}=\mathcal{O}(n^{-2/3}(\log(2n)+t)^{1/3}\lVert Df^{0}\rVert_{1})$ and $\lambda$ does not explicitely depend on $f^{0}$ .

One can reason analogously starting from Corollary 4.11 for the square root analysis estimator.

In both cases, if $\lVert Df^{0}\rVert=\mathcal{O}(1)$ we obtain that $\lVert\hat{f}-f^{0}\rVert^{2}_{n}=\mathcal{O}(n^{-2/3}\log^{1/3}(n))$ . However, it is known that the minimax rate for that case (when the graph considered is the path graph) is $\lVert\hat{f}-f^{0}\rVert^{2}_{n}=\mathcal{O}(n^{-2/3})$ and thus our results lead to a redundant log-term. The result about the minimax rate over the class of functions with bounded total variation obtained by entropy calculations (Mammen and van de Geer (1997) and references therein) are not constant-friendly, so that it may well be that, for $n$ small enough, the log-term is smaller than the constants of the entropy arguments.

The same remark applies to the case of tree graphs of bounded maximal degrees. For such graphs, Padilla et al. (2018) proved that the minimax rate of estimation of $f^{0}:\lVert D_{\vec{G}}f^{0}\rVert_{1}\leq C$ is $n^{-2/3}C^{2/3}$ . Moreover, they proved by entropy arguments that the total variation regularized estimator achieves the minimax rate. We prove that this minimax rate is achieved by the (square root) total variation regularized estimator up to a log term by using constant-friendly arguments (cf. Corollary 4.9 and 4.11).

We thus saw that for the path graph, the constant-friendly projection argument introduced by Dalalyan, Hebiri and Lederer (2017) to handle the increments of the empirical process might produce optimal rates up to a log-term for both the total variation regularized estimator and the square root total variation regularized estimator.

Another question is whether we can retrieve almost minimax rates by Corollary 4.12 for $D_{\vec{G}}$ being the incidence matrix of a two dimensional grid graph. For that case, the minimax rate is $\textstyle\sqrt{{\log n}/{n}\,}$ Sadhanala, Wang and Tibshirani (2016) and an oracle inequality proved by Hütter and Rigollet (2016) almost retrieves it. Moreover, a natural scaling for that case is $\lVert D_{\vec{G}}f^{0}\rVert_{1}\asymp n^{1/2}$ (Sadhanala, Wang and Tibshirani (2016)). Note that the part of Assumption 3.1 concerning $\lVert D_{\vec{G}}f^{0}\rVert_{1}$ , which translates to $\lVert Df^{0}\rVert_{1}=\mathcal{O}\left(n/\log n\right)$ , is thus satisfied.

Thus, for $t,\ x>0$ fixed, from Corollaries 4.12 and 4.13 we get that, if $S_{0}$ is s.t. the connected components of $\vec{G}_{-S_{0}}$ are square two dimensional grid graphs and

[TABLE]

under the canonical scaling $\lVert Df^{0}\rVert_{1}\asymp n^{1/2}$ we have the rate

[TABLE]

which corresponds to the minimax rate up to a log term. Note however that, due to the utilization of Lemma 4.2, Corollaries 4.12 and 4.13, from which this insight is derived, are not constant-friendly.

5 Conclusion

We introduced a class of active sets dependent on the analysis operator $D$ , to which it is natural to restrict the attention. Indeed, as some examples from total variation regularization on graphs show, there can be some elements of $\mathcal{P}([m])$ which can not be seen as true active sets of any signal, depending on the graph structure.

We then derived oracle inequalities with fast rates under some compatibility conditions and oracle inequalities with slow rates. The results with fast rates show that, if one can find a suitable bound on the weighted weak compatibility constant, the analysis estimator and its square root version are adaptive, i.e. they can adapt to the unknown sparsity of $Df^{0}$ . For both the analysis and the square root analysis estimators, the results with slow rates were used as tool to retrieve in a simple and constant-friendly way minimax rates obtained by entropy calculations, at the price of an extra log factor. The choice of the tuning parameters $\lambda$ and $\lambda_{0}$ , which includes some information about the structure of the analysis operator $D$ and of the active set $S$ via the inverse scaling factor $\gamma$ , seems to be advantageous in theoretical terms and allows us to show that the “slow” rates can almost match the minimax lower bound for the total variation regularized estimator on graph structures as the path graph and tree graphs with bounded maximal degree.

We obtained parallel and very similar results for both the analysis and the square root analysis estimators. The differences in these results come from the fact that for the square root analysis estimator we first have to prove that the estimator does not overfit and that the KKT conditions hold. In spite of being mathematically more involved, the results for the square root analysis estimator tell us that we can get with high probability theoretical guarantees being very similar to the ones obtained for the analysis estimator by choosing a tuning parameter not depending on the unknown noise level. This fact might be helpful in practice and might speak in favor of the utilization of the square root analysis estimator.

We then narrowed down our results to (square root) total variation regularized estimators over graphs. For fast rates we considered the cases of the path graph and of the cycle graph. In these cases we were able to show that the compatibility conditions are satisfied.

For the case of slow rates, we obtained oracle inequalities matching up to a log term the optimal rate over the path graph, the two dimensional grid graph and tree graphs of maximal bounded degree. These results do not require any compatibility condition.

These oracle inequalities can be interpreted in two senses. Either we can choose a smaller tuning parameter depending on $S$ and obtain better rates. Or we can choose a larger tuning parameter not depending on $S$ and get worse rates. This might be a justification for incorporating eventual prior knowledge of $S$ into the tuning parameter.

The main tool used to derive the oracle inequalities presented in this paper is a bound on the increments of the empirical process inspired by the projection arguments by Dalalyan, Hebiri and Lederer (2017). This bound is very simple and constant-friendly, while entropy bounds are more involved and can have large constants. There are two routes one can take after having bounded the increments of the empirical process by projection arguments. Either one uses a more refined version of the bound on the increments of the empirical process and then bounds the compatibility constant to derive fast rates. Or one bounds the increments of the empirical process in a rougher way and obtains oracle inequalities with slow rates. In this last case one only needs to bound the inverse scaling factor. Bounds on the inverse scaling factor can be very simple and constant-friendly, while bounds on the compatibility constant can sometimes lead to large constants (cf. Ortelli and van de Geer (2019b)). Moreover, results with slow rates have been shown to almost retrieve the minimax rate in a constant-friendly way also in other settings, for instance in higher order total variation regularization (Ortelli and van de Geer (2019b)). If we compare the results obtained by entropy calculations with our results with slow rates, we see that, at the expense of a log term, we are able to retrieve almost the same rate by two simple steps: the constant-friendly bound on the increments of the empirical process and the bound on the inverse scaling factor. The bound on the inverse scaling factor is constant-friendly for graph structures as tree graphs and cycle graphs, while the bound on the inverse scaling factor for the two dimensional grid graph we borrow from Hütter and Rigollet (2016) is more involved. For total variation regularized estimators on the path graph and on tree graphs of bounded maximal degree, we thus obtain nonasymptotic counterparts, in form of oracle inequalities with slow rates, to results found in the previous literature (Mammen and van de Geer (1997); Padilla et al. (2018)).

A question for further investigation is the possibility to use the framework exposed here to obtain oracle inequalities with fast rates for other graph structures. The answer depends on the ability to lower bound the compatibility constant for graphs other than tree graphs and cycles. We leave this questions to future research.

Appendix A Probability inequalities

We expose three lemmas helping us to deal with the random part of the oracle inequalities.

Lemma A.1 (The maximum of $p$ random variables, Lemma 17.5 in van de Geer (2016))

Let $V_{1},\ldots,V_{p}$ be real valued random variables. Assume $\forall j\in\{1,\ldots,p\}$ and $\forall r>0$ that $\mathbb{E}[e^{r\lvert V_{j}\rvert}]\leq 2e^{\frac{r^{2}}{2}}$ . Then, $\forall t>0$

[TABLE]

Lemma A.2 (The special case of $\chi^{2}$ random variables, Lemma 1 in Laurent and Massart (2000), Lemma 8.6 in van de Geer (2016))

Let $X\sim\chi^{2}_{d}$ . Then, $\forall x>0$

[TABLE]

Remark

Note that from Lemma A.2 it follows that

[TABLE]

Lemma A.3 (Lemma 8.1 in van de Geer (2016))

For $n\geq 2$ , let $\epsilon\sim\mathcal{N}_{n}(0,\sigma^{2}\text{I}_{n})$ . Then, $\forall u\in\mathbb{R}^{n}:\lVert u\rVert_{n}=1$ we have that, for $t\in(0,(n-1)/2)$ ,

[TABLE]

Remark

Let $u_{1},\ldots,u_{p}\in\mathbb{R}^{n}$ be vectors. Then by the union bound and by Lemma A.3 we have that for $t^{\prime}\in(0,(n-1)/2)$

[TABLE]

Now select $t=t^{\prime}-\log(2p)$ . Then we have that for $t\in(0,(n-1)/2-\log(2p))$ ,

[TABLE]

Appendix B Proofs of Section 2

B.1 Basic inequality

The case of the analysis estimator is more simple than the one of the square root analysis estimator, because we have the basic inequality without assuming any extra conditions.

Lemma B.1 (Basic inequality)

For the analysis estimator we have the so called basic inequality, i.e. $\forall f\in\mathbb{R}^{n}$

[TABLE]

Proof of Lemma B.1.

The KKT conditions for the analysis estimator write as

[TABLE]

Thanks to the chain rule of the subdifferential, $D^{\prime}\partial\lVert D\hat{f}\rVert_{1}$ is the subdifferential of $\lVert Df\rVert_{1}$ with respect to $f$ at $\hat{f}$ . We have that, for $\hat{f}\in\mathbb{R}^{n}$ , ${\hat{f}^{\prime}(Y-\hat{f})}/{n}=\lambda\lVert D\hat{f}\rVert_{1}$ and that, for a generic $f\in\mathbb{R}^{n}$ , ${f^{\prime}(Y-\hat{f})}/{n}=\lambda(Df)^{\prime}\partial\lVert D\hat{f}\rVert_{1}\leq\lambda\lVert Df\rVert_{1}$ , where the last inequality follows by the dual norm inequality and by the fact that $\lVert\partial\lVert D\hat{f}\rVert_{1}\rVert_{\infty}\leq 1$ .

By subtracting the first of the two above expressions from the second, we find that

[TABLE]

By polarization we obtain the basic inequality

[TABLE]

∎

B.2 Bound on the increments of the empirical process

Lemma B.2

Let $S\in\mathcal{S}$ be arbitrary and $x,\ t>0$ . Choose $\lambda\geq{\gamma\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ . Then, $\forall f\in\mathbb{R}^{n}$ , it holds that, with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Proof of Lemma B.2.

We have that

[TABLE]

We have that, since $D_{-S}$ is of full rank,

[TABLE]

For $\lambda>0$ define the set

[TABLE]

where $V_{i}={\epsilon^{\prime}d^{+}_{i}}/{(\sigma\lVert d^{+}_{i}\rVert_{2})}\sim\mathcal{N}(0,1),i\in[m-s]$ , since $\epsilon^{\prime}d^{+}_{i}\sim\mathcal{N}(0,\sigma^{2}\lVert d^{+}_{i}\rVert^{2}_{2})$ .

Since $\gamma=\lVert\Omega\rVert_{\infty}$ , on $\mathcal{T}$ we have that

[TABLE]

To find a lower bound on $\mathbb{P}(\mathcal{T})$ we apply Lemma A.1 to $\mathcal{T}$ .

The moment generating function of $\lvert V_{i}\rvert$ is $\mathbb{E}\left[e^{r\lvert V_{i}\rvert}\right]=2(1-\Phi(-r))e^{\frac{r^{2}}{2}}\leq 2e^{\frac{r^{2}}{2}},\ \forall r>0$ .

Choosing, for some $t>0$ , $\lambda\geq{\gamma\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{2(\log(2(n-r_{S}))+t)/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ , e.g. $\lambda={\gamma\sigma}\mathchoice{{\hbox{$ \displaystyle\sqrt{2(\log(2n)+t)/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{2(\log(2n)+t)/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{2(\log(2n)+t)/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{2(\log(2n)+t)/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}$ , and applying Lemma A.1 with $p=m-s=n-r_{S}$ and $t>0$ , we obtain that $\mathbb{P}(\mathcal{T})\geq 1-e^{-t}$ . 2. 2.

We have that

[TABLE]

For $x>0$ , define the set

[TABLE]

On $\mathcal{X}$ we have that

[TABLE]

Since $\mathcal{N}(D_{-S})$ is a linear space of dimension $r_{S}$ , we have that

[TABLE]

Moreover note that

[TABLE]

By applying Lemma A.2 for some $x>0$ we thus get that $\mathbb{P}(\mathcal{X})\geq 1-e^{-x}$ .

∎

Remark

To obtain fast rates by using compatibility conditions one makes use of the more refined bound given by Lemma B.2 involving $\lVert\Omega_{-S}D_{-S}f\rVert_{1}/\gamma$ . This term will flow into the weighted compatibility constant.

To obtain slow rates without needing compatibility conditions one utilizes the less refined version of the bound given by Lemma B.2 involving $\lVert D_{-S}f\rVert_{1}$ .

B.3 Proof of the oracle inequalities

Proof of Theorem 2.1.

By Lemma B.1 we have the basic inequality. By the triangle inequality, we have

[TABLE]

We now handle the random part, which is constituted by an increment of the empirical process, by using Lemma B.2. By Lemma B.2 we have that with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

Putting the pieces together, we get that,

[TABLE]

If $\kappa(S,W)>0$ we have that

[TABLE]

and thus

[TABLE]

where the last inequality follows by $2ab\leq a^{2}+b^{2},a,b\in\mathbb{R}$ .

The term $\lVert\hat{f}-f\rVert^{2}_{n}$ cancels out and we get the statement of the theorem.

∎

Proof of Theorem 2.2.

By Lemma B.1 we have the basic inequality. By Lemma B.2, we have that with probability at least $1-e^{-x}-e^{-t}$ ,

[TABLE]

We thus get that

[TABLE]

∎

Appendix C Proofs of Section 3

Define for $S\in\mathcal{S}$

[TABLE]

For $a>0,R>0$ , define the sets $\mathcal{R}:=\left\{\gamma\hat{R}\leq R\right\}$ ,

[TABLE]

and

[TABLE]

Note that on $\mathcal{A}^{\prime}$ we have that, by the Cauchy-Schwarz inequality,

[TABLE]

Remark

By Lemma A.2 (Lemma 1 in Laurent and Massart (2000)) we have that for $a>0$ both $\mathbb{P}(\mathcal{A})\geq 1-3e^{-a}$ and $\mathbb{P}(\mathcal{A}^{\prime})\geq 1-4e^{-a}$ hold true.

Moreover by Lemma A.3 (Lemma 8.1 in van de Geer (2016)) and using the union bound, we see that if we choose

[TABLE]

we have that $\mathbb{P}(\mathcal{R})\geq 1-e^{-t}$ . Thus, by such a choice of $R$ we get that

[TABLE]

Remark

Motivated by a more simple exposition of the results, we chose the same parameter $a$ for the upper and lower bounds for both $\lVert\Pi_{\mathcal{N}(D_{-S})}\epsilon\rVert_{n}$ and $\lVert A_{\mathcal{N}(D_{-S})}\epsilon\rVert_{n}$ . However one could of course choose four different parameters, say $a_{i},\ i\in[4]$ , for the four different bounds and obtain results holding with probability $1-e^{-t}-\sum_{i=1}^{3}e^{a_{i}}$ resp. $1-e^{-t}-\sum_{i=1}^{4}e^{a_{i}}$ .

C.1 Proving that the square root analysis estimator does not overfit

Proof of Lemma 3.1.

Assumption 3.1 expresses a particular choice of the constant $c$ in Proposition C.1 below. For $\eta\in(0,1)$ we have that $\eta/2\leq\eta/(1+\eta)$ and thus the choice of $c$ in Assumption 3.1 satisfies the upper bound given by Proposition C.1 (see below), which then holds, since all of its assumtpions are satisfied and we consider the sets $\mathcal{A}\cap\mathcal{R}$ .

The choice of $c$ implies that $q=\eta/2$ and that $c\leq\eta/2$ . Thus the claim follows.

By Remark Remark, if we choose $a>0$ and $R\geq\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ , then $\mathbb{P}(\mathcal{A}\cap\mathbb{R})\geq 1-3e^{-a}-e^{-t}$ .

∎

Propostion C.1 (The square root analysis estimator does not overfit)

Assume for some $a>0$ that $n>8a$ and that for some $R>0,\eta\in(0,1)$

[TABLE]

where

[TABLE]

We assume that $S\in\mathcal{S}$ is s.t.

[TABLE]

Let

[TABLE]

Let $a>0$ . Choose $R\geq\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Then with probability at least $1-3e^{-a}-e^{-t}$ it holds that

[TABLE]

Proof of Proposition C.1, based on the proof of Lemma 3.1 by van de Geer (2016).

On the set $\mathcal{A}$ we have that, the Cauchy-Schwarz inequality,

[TABLE]

Thus,

[TABLE]

We now show an upper and a lower bound for $\lVert\hat{\epsilon}\rVert_{n}$ .

Upper bound:

Since the estimator $\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}$ minimizes the objective function we have that

[TABLE]

It follows that

[TABLE]

Lower bound:

Note that, by the triangle inequality, we have that

[TABLE]

Thus the lemma follows if we can prove a bound of the type $\lVert\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}-f^{0}\rVert_{n}\leq\text{const.}\lVert\epsilon\rVert_{n}$ , with leading constant in $(0,1)$ . We are not allowed to use the KKT conditions. Instead we use the convexity of the loss function and of the penalty.

Define for $t\in(0,1)$ the convex combination $\hat{f}_{t}:=t\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}+(1-t)f^{0}$ and its residuals

[TABLE]

Choose

[TABLE]

Then

[TABLE]

We thus get that

[TABLE]

By the convexity of the loss and of the penalty and by the fact that $\hat{f}_{\mathchoice{{\hbox{$ \displaystyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \textstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{,} $}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}$ is a minimizer of the objective function it follows that

[TABLE]

By squaring the inequality we get that

[TABLE]

We have that

[TABLE]

By combining the squared inequality with the lower bound for $\lVert\hat{\epsilon}_{t}\rVert_{n}$ and the expression for $\lVert\hat{\epsilon}_{t}\rVert_{n}^{2}$ we get that

[TABLE]

On $\mathcal{R}$ , for an $S$ satisfying the assumptions of the lemma, we have that

[TABLE]

Thus

[TABLE]

Moreover we have that

[TABLE]

Thus we obtain that

[TABLE]

and

[TABLE]

Note that

[TABLE]

By using the spectral decomposition, $\Pi_{\mathcal{N}(D_{-S})}\mathbb{R}^{n\times n}$ can be written as $PP^{\prime}$ , where $P\in\mathbb{R}^{n\times r_{S}}$ is s.t. $P^{\prime}P=\text{I}_{r_{S}}$ . Moreover $A_{\mathcal{N}(D_{-S})}\in\mathbb{R}^{n\times n}$ can be written as $QQ^{\prime}$ , where $Q\in\mathbb{R}^{n\times(n-r_{S})}$ is s.t. $Q^{\prime}Q=\text{I}_{n-r_{S}}$ and $Q^{\prime}P=0$ .

Let $u:=P^{\prime}\epsilon\in\mathbb{R}^{r_{S}}$ and $v:=Q^{\prime}\epsilon\in\mathbb{R}^{n-r_{S}}$ . We have that $u\sim\mathcal{N}_{r_{S}}(0,\sigma^{2}\text{I}_{r_{S}})$ , $v\sim\mathcal{N}_{n-r_{S}}(0,\sigma^{2}\text{I}_{n-r_{S}})$ and $u$ and $v$ are independent. We have that $\lVert\Pi_{\mathcal{N}(D_{-S})}\epsilon\rVert^{2}_{2}=\lVert u\rVert^{2}_{2}$ and that $\lVert A_{\mathcal{N}(D_{-S})}\epsilon\rVert^{2}_{2}=\lVert v\rVert^{2}_{2}$ . It follows that

[TABLE]

and thus the two terms are independent and can be handled separately.

On $\mathcal{A}$ we have that

[TABLE]

Therefore

[TABLE]

It follows that

[TABLE]

By expressing $\lVert\hat{f}_{t}-f^{0}\rVert_{n}$ more explicitly we get that

[TABLE]

We conclude that

[TABLE]

The last step is to find out how to choose $c$ s.t. $q\eta/(\eta-q)<1$ . We get that $q<\eta/(1+\eta)$ , hence

[TABLE]

Note that we also get the assumption $p<\eta/(1+\eta)$ , which results in the assumption

[TABLE]

Note that the result holds on $\mathcal{A}\cap\mathbb{R}$ , which by Remark Remark has probability at least $1-3e^{-a}-e^{-t}$ for $a>0$ and $R\geq\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . ∎

C.2 Basic inequality

Lemma C.1

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . For $\eta\in(0,1)$ , choose $\lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1, it holds that $\forall f\in\mathbb{R}^{n}$ , with probability at least $1-3e^{-a}-e^{-t}$ ,

[TABLE]

Proof of Lemma C.1.

Under Assumption 3.1, on $\mathcal{A}\cap\mathcal{R}$ the KKT conditions hold

[TABLE]

We then obtain the basic inequality as in Lemma B.1 (cf. also Lemma 2 in Stucky and van de Geer (2017)). Note that by Remark Remark, the choice of $\lambda_{0}$ implies that $\mathbb{P}(\mathcal{A}\cap\mathbb{R})\geq 1-3e^{-a}-e^{-t}$ . ∎

C.3 Bound on the increments of the empirical process

Lemma C.2

Let $S\in\mathcal{S}$ be an arbitrary active set satisfying Assumption 3.1 and let $a>0$ . For $\eta\in(0,1)$ , choose $\lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{$ \displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{$ \textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{$ \scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},} $}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S})))$ . Under Assumption 3.1 we have that $\forall f\in\mathbb{R}^{n}$ , with probability at least $1-3e^{-a}-e^{-t}$

[TABLE]

Proof of Lemma C.2.

On $\mathcal{R}$ , by using the decomposition in antiprojection and projection onto the nullspace of $D_{-S}$ and by applying the dual norm inequality to the second term we have that

[TABLE]

Moreover, on $\mathcal{A}\cap\mathcal{R}$ , under Assumption 3.1, by Corollary 3.1 we have that $R\lVert\epsilon\rVert_{n}\leq\lambda_{0}\lVert\hat{\epsilon}\rVert_{n}$ and thus the claim follows. Note that the choice of $\lambda_{0}$ implies, by Remark Remark, that $\mathbb{P}(\mathcal{A}\cap\mathbb{R})\geq 1-3e^{-a}-e^{-t}$ . ∎

C.4 Proof of the oracle inequalities

Proof of Theorem 3.1.

We work under Assumption 3.1 on $\mathcal{A}^{\prime}\cap\mathcal{R}$ . By combining Lemma C.1 and Lemma C.2, we get that, in complete analogy to the proof of Theorem 2.1,

[TABLE]

Moreover, by Corollary 3.1, we have that on $\mathcal{A}^{\prime}$

[TABLE]

Thus we get that

[TABLE]

Since Assumption 3.1 implies that $\eta<1$ and $n>8a$ we get that $(1+\eta)(1+\mathchoice{{\hbox{$ \displaystyle\sqrt{4a/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \textstyle\sqrt{4a/n,} $}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{$ \scriptstyle\sqrt{4a/n,} $}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{$ \scriptscriptstyle\sqrt{4a/n,} $}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}})\leq 4$ and

[TABLE]

By Remark Remark and the choice of $\lambda_{0}$ in the statement of the theorem, we have that $\mathbb{P}(\mathcal{A}^{\prime}\cap\mathbb{R})\geq 1-4e^{-a}-e^{-t}$ . ∎

Proof of Theorem 3.2.

We work under Assumption 3.1 on $\mathcal{A}^{\prime}\cap\mathcal{R}$ . By Lemma C.1 and Lemma C.2 we get that, in analogy with the proof of Theorem 2.2,

[TABLE]

By Corollary 3.1 we have that

[TABLE]

Moreover on $\mathcal{A}^{\prime}$

[TABLE]

We thus get that

[TABLE]

By Remark Remark and the choice of $\lambda_{0}$ in the statement of the theorem, we have that $\mathbb{P}(\mathcal{A}^{\prime}\cap\mathbb{R})\geq 1-4e^{-a}-e^{-t}$ . ∎

Appendix D Proofs of Section 4

Proof of Lemma 4.1.

Notice that for a cycle graph, all elements of $\mathcal{S}\setminus\emptyset$ have at least $s=2$ (cf. Remark Remark). Thus under the assumption $S\not=\emptyset$ , bounding $\gamma$ for the cycle graph reduces to bounding $\gamma$ for a tree graph.

Let $D\in\mathbb{R}^{(n-1)\times n}$ be the incidence matrix of a directed tree graph rooted at vertex 1. Let $D^{+}\in\mathbb{R}^{n\times(n-1)}$ be its Moore-Penrose pseudoinverse. By Lemma 2.2 in Ortelli and van de Geer (2019a) we have that $D^{+}$ can be obtained as $D^{+}=(\text{I}_{n}-\mathbb{I}_{n}/n)X_{-1}$ , where $X=\begin{pmatrix}(1,0,\ldots,0)\\ D\end{pmatrix}^{-1}$ . As pointed out in Ortelli and van de Geer (2018), $X$ has the meaning of the rooted path matrix of the tree graph considered. Thus, the columns of $X_{-1}$ contain a minimum of $1$ and a maximum of $(n-1)$ entries having value 1, while the remaining entries are zeroes.

Let $i$ be the number of entries having value 1 of a column of $X_{-1}$ . Let $v(i)\in\mathbb{R}^{n},\ i\in[n]$ denote any vector with $i$ entries having value 1 and $(n-i)$ entries having value 0. Define $g(i,n):=\lVert(\text{I}_{n}-\mathbb{I}_{n}/n)v(i)\rVert^{2}_{2}$ . We have that $g(i,n)=i(1-i/n)^{2}+(n-i)(i/n)^{2}={i(n-i)}/{n},\ i\in[n-1]$ . The maximum of $g(i,n)$ for a given $n$ is reached at $i=n/2$ if $n$ is even and at $i\in\{\lfloor n/2\rfloor,\lceil n/2\rceil\}$ if $n$ is odd.

Moreover, $\max_{i\in[n-1]}g(i,n)$ is increasing in $n$ and $g(i,n)\leq\frac{n+1}{4},\forall i\in[n-1],\ \forall n$ . Therefore, the $\ell^{2}$ -norm of a column of $D^{+}_{-S}$ will never be greater than the greatest possible $\ell^{2}$ -norm of a column of $D^{+}_{\vec{C}_{i}}$ . We thus have that

[TABLE]

∎

Proof of Corollary 4.1.

By Lemma 4.1 we have that

[TABLE]

By combining the above with Lemma 4.3, Lemma 4.4 and Theorem 2.1 we get Corollary 4.1. ∎

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Belloni, Chernozhukov and Wang (2011) {barticle} [author] \bauthor \bsnm Belloni, \bfnm Alexandre \binits A., \bauthor \bsnm Chernozhukov, \bfnm Victor \binits V. and \bauthor \bsnm Wang, \bfnm Lie \binits L. ( \byear 2011). \btitle Square-root lasso: Pivotal recovery of sparse signals via conic programming. \bjournal Biometrika \bvolume 98 \bpages 791–806. \endbibitem
2Birman and Solomjak (1967) {barticle} [author] \bauthor \bsnm Birman, \bfnm M S \binits M. S. and \bauthor \bsnm Solomjak, \bfnm M Z \binits M. Z. ( \byear 1967). \btitle Piecewise-polynomial approximations of functions of the classes W p α subscript superscript 𝑊 𝛼 𝑝 W^{\alpha}_{p} . \bjournal Math. USSR Sb. \bvolume 2. \endbibitem
3Bühlmann and van de Geer (2011) {bbook} [author] \bauthor \bsnm Bühlmann, \bfnm Peter \binits P. and \bauthor \bparticle van de \bsnm Geer, \bfnm Sara \binits S. ( \byear 2011). \btitle Statistics for High-Dimensional Data. \bdoi 10.1007/978-3-642-20192-9 \endbibitem
4Bunea, Lederer and She (2014) {barticle} [author] \bauthor \bsnm Bunea, \bfnm Florentina \binits F., \bauthor \bsnm Lederer, \bfnm Johannes \binits J. and \bauthor \bsnm She, \bfnm Yiyuan \binits Y. ( \byear 2014). \btitle The Group Square-Root Lasso : Theoretical Properties and Fast Algorithms. \bjournal IEEE Transactions on Information Theory \bvolume 60 \bpages 1313–1325. \endbibitem
5Chatterjee and Goswami (2019) {barticle} [author] \bauthor \bsnm Chatterjee, \bfnm Sabyasachi \binits S. and \bauthor \bsnm Goswami, \bfnm Subhajit \binits S. ( \byear 2019). \btitle New Risk Bounds for 2d Total Variation Denoising. \bjournal ar Xiv:1902.01215 v 2 \bpages 1–59. \endbibitem
6Dalalyan, Hebiri and Lederer (2017) {barticle} [author] \bauthor \bsnm Dalalyan, \bfnm Arnak \binits A., \bauthor \bsnm Hebiri, \bfnm Mohamed \binits M. and \bauthor \bsnm Lederer, \bfnm Johannes \binits J. ( \byear 2017). \btitle On the prediction performance of the Lasso. \bjournal Bernoulli \bvolume 23 \bpages 552–581. \endbibitem
7Derumigny (2018) {barticle} [author] \bauthor \bsnm Derumigny, \bfnm Alexis \binits A. ( \byear 2018). \btitle Improved bounds for Square-Root Lasso and Square-Root Slope. \bjournal Electronic Journal of Statistics \bvolume 12 \bpages 741–766. \endbibitem
8Donoho and Johnstone (1998) {barticle} [author] \bauthor \bsnm Donoho, \bfnm David L \binits D. L. and \bauthor \bsnm Johnstone, \bfnm Iain M \binits I. M. ( \byear 1998). \btitle Minimax estimation via wavelet shrinkage. \bjournal The Annals of Statistics \bvolume 26 \bpages 879–921. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Oracle inequalities for square root analysis estimators with application to total variation penalties

Abstract

doi:

keywords:

Contents

1 Introduction

1.1 Review of the literature

1.1.1 Synthesis and analysis

1.1.2 Total variation regularized estimators

1.1.3 Square root regularization

1.2 Contributions

1.3 Notation

1.4 Model assumptions and preliminary definitions

1.4.1 Model assumptions

1.4.2 Definitions

Definition 1.1** **(Weighted weak compatibility constant)

Remark

Definition 1.2** **(Length of antiprojections)

Note

Definition 1.3** **(Normalized inverse scaling factor)

Definition 1.4** **(Weights)

2 Oracle inequalities for the analysis estimator

2.1 Fast rates with compatibility conditions

Theorem 2.1** **(Oracle inequality with fast rates for the analysis estimator)

Proof of Theorem 2.1.

2.2 Slow rates without compatibility conditions

Theorem 2.2** **(Oracle inequality with slow rates for the analysis estimator)

Proof of Theorem 2.2.

Remark

3 Oracle inequalities for the square root analysis estimator

Assumption 3.1

Note

Lemma 3.1

Proof of Lemma 3.1.

Remark

Corollary 3.1

Proof of Corollary 3.1.

3.1 Fast rates with compatibility conditions

Theorem 3.1** **(Oracle inequality with fast rates for the square root analysis estimator)

Proof of Theorem 3.1.

3.2 Slow rates without compatibility conditions

Theorem 3.2** **(Oracle inequality with slow rates for the square root analysis estimator)

Proof of Theorem 3.2.

Remark

Remark

Remark

4 Total variation

4.1 Incidence matrices

Remark

4.1.1 Trees and cycles

Lemma 4.1** **(Upper bound for the normalized inverse scaling factor)

Proof of Lemma 4.1.

4.1.2 Two dimensional grid graph

Lemma 4.2** **(Proposition 4 in Hütter and Rigollet (2016))

4.2 Fast rates

4.2.1 Path graph

Lemma 4.3** **(Theorem 15 and Lemma 21 in van de Geer (2018))

Proof of Lemma 4.3.

Lemma 4.4

Proof of Lemma 4.4.

Corollary 4.1

Proof of Corollary 4.1.

Corollary 4.2

Proof of Corollary 4.2.

Corollary 4.3

Proof of Corollary 4.3.

Corollary 4.4

Proof of Corollary 4.4.

Remark

4.2.2 Cycle graph

Remark

Corollary 4.5

Proof of Corollary 4.5.

Remark

Definition 1.1 (Weighted weak compatibility constant)

Definition 1.2 (Length of antiprojections)

Definition 1.3 (Normalized inverse scaling factor)

Definition 1.4 (Weights)

Theorem 2.1 (Oracle inequality with fast rates for the analysis estimator)

Theorem 2.2 (Oracle inequality with slow rates for the analysis estimator)

Theorem 3.1 (Oracle inequality with fast rates for the square root analysis estimator)

Theorem 3.2 (Oracle inequality with slow rates for the square root analysis estimator)

Lemma 4.1 (Upper bound for the normalized inverse scaling factor)

Lemma 4.2 (Proposition 4 in Hütter and Rigollet (2016))

Lemma 4.3 (Theorem 15 and Lemma 21 in van de Geer (2018))

Lemma A.1 (The maximum of $p$ random variables, Lemma 17.5 in van de Geer (2016))

Lemma A.2 (The special case of $\chi^{2}$ random variables, Lemma 1 in Laurent and Massart (2000), Lemma 8.6 in van de Geer (2016))

Lemma A.3 (Lemma 8.1 in van de Geer (2016))

Lemma B.1 (Basic inequality)

Propostion C.1 (The square root analysis estimator does not overfit)