Oracle inequalities for square root analysis estimators with application to total variation penalties
Francesco Ortelli, Sara van de Geer

TL;DR
This paper develops oracle inequalities for square root analysis estimators, including total variation penalties on graphs, providing theoretical guarantees and extending previous entropy-based results.
Contribution
It introduces new oracle inequalities for square root analysis estimators and extends the theory to total variation regularization on graphs.
Findings
Oracle inequalities with fast and slow rates derived for analysis estimators.
Extension of theory to square root analysis estimators.
Constant-friendly rates for total variation regularized estimators on graphs.
Abstract
Through the direct study of the analysis estimator we derive oracle inequalities with fast and slow rates by adapting the arguments involving projections by Dalalyan, Hebiri and Lederer (2017). We then extend the theory to the square root analysis estimator. Finally, we focus on (square root) total variation regularized estimators on graphs and obtain constant-friendly rates, which, up to log-terms, match previous results obtained by entropy calculations. We also obtain an oracle inequality for the (square root) total variation regularized estimator over the cycle graph.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Oracle inequalities for square root analysis estimators with application to total variation penalties
Francesco Ortellilabel=e1][email protected] [
Sara van de Geerlabel=e2][email protected] [ Rämistrasse 101
8092 Zürich
;
Seminar for Statistics, ETH Zürich
(0000)
Abstract
Through the direct study of the analysis estimator we derive oracle inequalities with fast and slow rates by adapting the arguments involving projections by Dalalyan, Hebiri and Lederer (2017). We then extend the theory to the square root analysis estimator. Finally, we focus on (square root) total variation regularized estimators on graphs and obtain constant-friendly rates, which, up to log-terms, match previous results obtained by entropy calculations. We also obtain an oracle inequality for the (square root) total variation regularized estimator over the cycle graph.
Analysis,
Total variation regularization,
Lasso,
Edge Lasso,
Cycle graph,
Sparsity,
Trend filtering,
Oracle inequality,
Nullspace,
Square root Lasso,
doi:
10.1214/154957804100000000
keywords:
††volume: 0
\arxiv
0000.00000 \startlocaldefs
m[1]∥∥_0#1 m[1]∥∥_1#1 m[1]∥∥_2#1
rm[1]∥∥_n^2#1
\endlocaldefs
and
Contents
-
3 Oracle inequalities for the square root analysis estimator
-
C.1 Proving that the square root analysis estimator does not overfit
1 Introduction
1.1 Review of the literature
1.1.1 Synthesis and analysis
In the literature we find two approaches to regularized empirical risk minimization: the synthesis and the analysis approach, see Elad, Milanfar and Rubinstein (2007). Given a dictionary , the synthesis approach to the estimation of is expressed by the synthesis estimator
[TABLE]
where , and for a vector we write . An instance of synthesis estimator is the classical lasso (Tibshirani (1996), see Bühlmann and van de Geer (2011) and van de Geer (2016) for a thorough exposition of the theory about the lasso).
On the other side, for an analysis operator , the analysis estimator is given by
[TABLE]
The analysis approach to the estimation of has previously been studied in e.g. Vaiter et al. (2013) and Nam et al. (2013). Instances of analysis estimators are total variation regularized estimators over graphs, in particular the fused lasso (Tibshirani et al. (2005)), which corresponds to the case of the path graph. For such estimators, is taken to be the incidence matrix of some directed graph .
Algorithms to solve both the analysis and the synthesis problem are exposed in Tibshirani and Taylor (2011).
1.1.2 Total variation regularized estimators
Let be a general directed graph, where the set is the set of vertices and the set is the set of edges. Every edge is directed from a vertex to a vertex , .
We define , the the incidence matrix of , as
[TABLE]
where denote the row of . Total variation regularized estimators are analysis estimators, where the anaylsis operator is taken to be for some graph . Thus, the differences of the candidate estimator across the edges of the graph are penalized.
Some previous studies of total variation regularized estimators (Dalalyan, Hebiri and Lederer (2017); Ortelli and van de Geer (2018)) used a step through a synthesis formulation (cf. Ortelli and van de Geer (2019a)) to prove oracle inequalities. However, these studies were confined to restrictive graph structures: the path in Dalalyan, Hebiri and Lederer (2017) and a class of tree graphs in Ortelli and van de Geer (2018). Other studies focusing on the fused lasso and not directly involving its synthesis form also implicitly relied on some kind of dictionary to handle the error term by projections onto some columns of this dictionary, see for instance the lower interpolant by Lin et al. (2017).
The approach by Hütter and Rigollet (2016), in spite of handling directly the analysis estimator, is not able to guarantee the convergence of the mean squared error for the fused lasso.
For , define
[TABLE]
The minimax rate of estimation over the path graph for functions is (Donoho and Johnstone (1998)). Moreover the fused lasso tuned with has if and thus achieves the minimax rate (Mammen and van de Geer (1997)). This result is based on entropy bounds (see ed. Babenko (1979); Birman and Solomjak (1967)) on the class , which are not constant-friendly. On the opposite side, Sadhanala, Wang and Tibshirani (2016) showed that estimators given by linear transformations of the observations are suboptimal on .
In a recent paper, Padilla et al. (2018) prove that when is a tree graph with bounded maximal degree and , then the minimax rate is .
Moreover, Padilla et al. (2018) prove that the total variation regularized estimator over any connected graph has a mean squared error of order at most if . Thus the total variation regularized estimator over tree graphs of bounded maximal degree is proved to be minimax-optimal. This result is based on entropy bounds by Wang et al. (2016) and is not constant-friendly.
In Sadhanala, Wang and Tibshirani (2016), the authors prove that, for being the two dimensional grid graph, the minimax rate of estimation for with the canonical scaling is . The paper by Hütter and Rigollet (2016) shows that this rate is retrieved by the total variation regularized estimator up to log terms.
In a recent work, Chatterjee and Goswami (2019) obtain convergence rates for the total variation regularized estimator over the two dimensional grid by proof techniques involving bounds on the Gaussian width of tangent cones.
These previous results will serve as a benchmark for the evaluation of the rates of the oracle inequalities presented in this paper.
1.1.3 Square root regularization
The square root lasso estimator, defined as
[TABLE]
was first introduced by Belloni, Chernozhukov and Wang (2011) and allows to simulataneously estimate the regression coefficients and the noise level. Thus, when tuning the estimator to obtain oracle properties, one can choose not depending on the unknown noise level . The square root lasso estimator is studied in Belloni, Chernozhukov and Wang (2011), Sun and Zhang (2012) (where it is called scaled lasso), van de Geer (2016) and Stucky and van de Geer (2017), among the others.
One can rewrite the minimization problem in the following form
[TABLE]
The objective function of this second expression of the estimator is not differentiable at and thus if the KKT conditions do not hold. By differentiating the penalized loss and assuming that we get the KKT conditions
[TABLE]
where is the subdifferential of at .
The papers Belloni, Chernozhukov and Wang (2011); Sun and Zhang (2012) propose algorithms to compute the square root lasso estimator, which are extended by Bunea, Lederer and She (2014) and Derumigny (2018) to the cases of the group square root lasso and of the square root slope respectively.
In this paper we focus on analysis estimators. Our interest is motivated by the possibility to apply the results to the case of total variation regularization. As it will turn out in Theorems 2.1 and 2.2, the choice of the tuning parameter needed to ensure oracle properties for plain analysis estimators depends on the noise variance , which might be unknown. Therefore, we are interested in the square root version of the analysis estimator: the square root analysis estimator
[TABLE]
Indeed, square root estimators are known to be able to estimate the signal and the noise variance simultaneously and therefore allow for a choice of the tuning parameter that does not depend on to guarantee oracle inequalities. This will turn our to be the case in Theorem 3.1 and 3.2. Square root analysis estimators could be computed either by transforming them into square root synthesis estimators by using the insights provided by Ortelli and van de Geer (2019a) (which are largely based on Elad, Milanfar and Rubinstein (2007)) or by adapting to the square root case the algorithm provided by Tibshirani and Taylor (2011) to solve plain analysis problems.
We want to combine the arguments exposed by van de Geer (2016) and Dalalyan, Hebiri and Lederer (2017) and extend them to the square root analysis estimator.
1.2 Contributions
The main points profiling our results are:
- •
we study directly the analysis estimator without passing through its synthesis formulation;
- •
we apply the projection arguments by Dalalyan, Hebiri and Lederer (2017) to the case of square root regularization;
- •
to do so we use projection theory for analysis operators.
We make the following contributions:
We present a framework for proving oracle inequalities with fast and slow rates for a general analysis estimator without transforming the analysis estimation problem into a synthesis estimation problem. This constitutes an analysis counterpart of the results obtained by Dalalyan, Hebiri and Lederer (2017) for the synthesis estimator. 2. 2.
We introduce, inspired by some remarks by Padilla et al. (2018), as measure for the sparsity of the signal (see Subsection 1.3 for the notation). In Hütter and Rigollet (2016), the sparsity of the true signal was measured as , while we argue that is more appropriate. 3. 3.
For the total variation regularized estimator on the path graph, we show that an analogue of the bound on the increments of the empirical process by projections exposed by Dalalyan, Hebiri and Lederer (2017) is only off by log-terms from the one which can be obtained by entropy calculations, if we allow the tuning parameter to depend on some aspects of . We thus match, up to log-terms, the result obtained by means of entropy calculations by Padilla et al. (2018) for general graphs and by Mammen and van de Geer (1997) for the path graph. Note that entropy calculations are not constant-friendly, while the bounds we expose are and might be advantageous for a small enough value of . 4. 4.
For the total variation regularized estimator over the cycle graph, we prove an oracle inequality with fast rates, which to our knowledge is a new contribution. 5. 5.
We adapt a lemma by van de Geer (2016), showing that the square root lasso does not overfit, to the case where the increments of the empirical process for the square root analysis estimator are bounded by means of the projection arguments by Dalalyan, Hebiri and Lederer (2017). This is a starting point for the development of oracle inequalities for the square root analysis estimator, which produce results analogous to the ones obtained for the plain analysis estimator (which match the ones found in Dalalyan, Hebiri and Lederer (2017)). We then narrow down these results to square root total variation regularized estimators on graphs.
1.3 Notation
Analysis operator . Let be a given matrix. Let denote the row vectors of . By , we denote the nullspace of , i.e. . Let denote the orthogonal complement of . Note that . By penalizing , we favor an estimator lying almost in , while we penalize estimators having high correlation with the rows of .
Active set . Let denote a subset of the row indices of . We denote the cardinality of the set by . We write . Moreover, we write and . For instance, let us suppose that, for , the true signal is s.t. and . Then is the true active set for , i.e. the set of indices of rows of , to which the true signal is not orthogonal.
Set of admissible active sets . Define . and
[TABLE]
where denotes the power set of . If is not of full row rank, then there might be some subsets of , that can not be the active sets of for any . Thus, from now on, we restrict our attention to active sets . More on this in Remark Remark in Section 4.
The nullspace . Note that, since , . Thus encompasses all the signals , s.t. . In a vector we have “pieces” of information. Note that can be nonempty and thus the part of lying in will always be active, because it is not penalized. Moreover, since we can have and , we see that is not a good measure for the sparsity of the signal. We thus use as a measure of sparsity to denote the pieces of information that the estimator effectively had to estimate if the active set were .
We use the shorthand notations and . Similarly, we write and . Note that Moreover, if are s.t. , then we have that . In addition, if the rows of can be written as linear combinations of the rows of , then .
Diagonal matrices of weights. Let be a vector, for instance a vector of weights. For the diagonal matrix we write and . We will need these notations for bounding the weighted weak compatibility constant, defined in Definition 1.1 below.
Linear projections. Let denote the identity matrix and let .
Let be a linear space. By we denote the orthogonal projection matrix onto and by the orthogonal antiprojection matrix onto .
Let . We write , i.e. for a set we decompose a signal into a low rank part (since usually will be small) orthogonal to and a part collinear to . We will use this decomposition when bounding the increments of the empirical processes in the proofs of the oracle inequalities.
Note that and .
Computing . Let be a set of row indices of . We have that
[TABLE]
where denotes the Moore-Penrose pseudoinverse of . If is of full row rank we have that .
1.4 Model assumptions and preliminary definitions
1.4.1 Model assumptions
Throughout the paper we will use the following model, which assumes that we observe a signal contaminated with Gaussian noise. Let be a signal. We observe
[TABLE]
Moreover, for an analysis operator we will study the two following estimators.
- •
The analysis estimator of , defined as
[TABLE]
- •
The square root analysis estimator \hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}} of , defined as
[TABLE]
In particular, Section 2 will focus on the study of the analysis estimator for a general analysis operator , while Section 3 will deal with its square root counterpart \hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}. In Section 4 we will then apply the results of the two previous sections to total variation regularization.
1.4.2 Definitions
Let denote the Moore-Penrose pseudoinverse of and the column vectors of .
We define the map , s.t. . The index denotes the row index of the row of , , in the matrix .
We use a proof technique inspired by Dalalyan, Hebiri and Lederer (2017). The key aspect of this proof technique is to decompose the noise into two parts by using orthogonal projections:
- •
a part projected onto a low-rank linear subspace, which will be bounded by using the Cauchy-Schwarz inequality,
- •
a remainder (i.e. the antiprojection), involving the weights defined below, which will be bounded with more refined techniques. These techniques involve, in the case of oracle inequalities with fast rates, the weak weighted compatibility constant.
Definition 1.1** **(Weighted weak compatibility constant)
Let be a diagonal matrix of weights with and (e.g. as in Definition 1.4). The weighted compatibility constant is defined as
[TABLE]
Remark
This weighted compatibility constant extends the definition given by Dalalyan, Hebiri and Lederer (2017) to the case of analysis estimators. A distinguishing feature is the factor . When , expresses the number of parameters to estimate .
Note that the weak weighted compatibility constant relaxes the definition of compatibility constant given by Hütter and Rigollet (2016). There, one has to lower bound
[TABLE]
while the weighted weak compatibility constant is applied to lower bound
[TABLE]
which is easier, since the denominator is smaller. The additional term comes form the remainder term mentioned in the above sketch of the proof technique used in this paper.
Note that bounds on the compatibility constant by Hütter and Rigollet (2016) imply bounds on the weighted weak compatibility constant but the converse is not true. This is relevant, for instance, for the case of the total variation regularization over the path graph. In that case the bound by Hütter and Rigollet (2016) is too rough. One can obtain more refined bounds by studying the weighted weak compatibility constant as done in Dalalyan, Hebiri and Lederer (2017); Ortelli and van de Geer (2018). For instance, Ortelli and van de Geer (2018) showed that the weighted weak compatibility constant also holds on a certain class of tree graphs. We will show in Section 4 a new bound on the weighted weak compatibility constant for the total variation regularized estimator over the cycle graph.
Definition 1.2** **(Length of antiprojections)
In analogy to Dalalyan, Hebiri and Lederer (2017), the vector is defined as
[TABLE]
Moreover, we write .
Note
One can see that, if is of full row rank,
[TABLE]
We want to find a vector of weights with values in , based on defined above. We thus define hereafter the normalized scaling factor as the maximum entry of .
Definition 1.3** **(Normalized inverse scaling factor)
In analogy to the quantity used by Dalalyan, Hebiri and Lederer (2017) and to the scaling factor defined by Hütter and Rigollet (2016), the normalized inverse scaling factor is defined as
[TABLE]
We now normalize by dividing its entries by to obtain a vector of weights .
Definition 1.4** **(Weights)
In analogy to Dalalyan, Hebiri and Lederer (2017), the vector of weights is defined as
[TABLE]
Moreover, we write . Note that .
2 Oracle inequalities for the analysis estimator
In this section we study the analysis estimator, which is defined as
[TABLE]
This section produces analogous results to Dalalyan, Hebiri and Lederer (2017). We however use an approach that does not take a detour via synthesis, but instead directly handles the analysis estimator. In Section 3 we are going to explore how this approach translates to the case of the square root analysis estimator.
2.1 Fast rates with compatibility conditions
Theorem 2.1** **(Oracle inequality with fast rates for the analysis estimator)
Let be arbitrary and . Choose \lambda\geq{\gamma\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}. For the analysis estimator it holds that, , with probability at least ,
[TABLE]
Proof of Theorem 2.1.
See Appendix B. ∎
2.2 Slow rates without compatibility conditions
Theorem 2.2** **(Oracle inequality with slow rates for the analysis estimator)
Let be arbitrary and . Choose \lambda\geq{\gamma\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}. For the analysis estimator it holds that, , with probability at least ,
[TABLE]
Proof of Theorem 2.2.
See Appendix B. ∎
Remark
Theorem 2.2 does not need the assumption that the (weighted) compatibility constant is bounded away from zero.
3 Oracle inequalities for the square root analysis estimator
In this section we study the square root analysis estimator, defined as
[TABLE]
Throughout this section we will make use of the following assumption.
Assumption 3.1
Assume for some that and that for some
[TABLE]
where,
[TABLE]
We assume that is s.t.
[TABLE]
Note
Assumption 3.1 is also an assumption on and will thus be a criterion to determine for which our (oracle) results hold.
For the square root analysis estimator, to get the KKT conditions we have to make sure that , i.e. that the estimator does not overfit.
The following lemma, showing that , is an adaptation of Lemma 3.1 in van de Geer (2016) to the case of the square root analysis estimator where the increments of the empirical process are bounded by the projection arguments found in Dalalyan, Hebiri and Lederer (2017).
Lemma 3.1
Let be an arbitrary active set satisfying Assumption 3.1 and let . Choose R\geq\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1 we have that, with probability at least ,
[TABLE]
Proof of Lemma 3.1.
See Appendix C. ∎
Remark
While Lemma 3.1 by van de Geer (2016) only requires a lower bound on , Lemma 3.1 presented here requires that is upper and lower bounded and that is lower bounded. It is the price to pay for a more refined technique to handle the increments of the empirical process.
Corollary 3.1
Let be an arbitrary active set satisfying Assumption 3.1 and let . Choose R\geq\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1, we have that, with probability at least , .
Proof of Corollary 3.1.
Under Assumption 3.1 on Lemma 3.1 holds and thus . It follows that . By inserting this inequality into the assumption we get the claim. ∎
We now expose oracle inequalities for the square root analysis estimator with fast and slow rates. The results are similar to Theorems 2.1 and 2.2 up to the constants and the assumptions one has to make.
3.1 Fast rates with compatibility conditions
Theorem 3.1** **(Oracle inequality with fast rates for the square root analysis estimator)
Let be an arbitrary active set satisfying Assumption 3.1 and let . For , choose \lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1, , it holds that, with probability at least ,
[TABLE]
Proof of Theorem 3.1.
See Appendix C. ∎
3.2 Slow rates without compatibility conditions
Theorem 3.2** **(Oracle inequality with slow rates for the square root analysis estimator)
Let be an arbitrary active set satisfying Assumption 3.1 and let . For , choose \lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1, , it holds that, with probability at least ,
[TABLE]
Proof of Theorem 3.2.
See Appendix C. ∎
Remark
The claim of Theorem 3.2 implies also the simpler inequality
[TABLE]
Remark
We can simplify for the ease of exposition Assumption 3.1 on to . Note that if we take \lambda_{0}\asymp\gamma\mathchoice{{\hbox{\displaystyle\sqrt{{\log n}/{n},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{{\log n}/{n},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{{\log n}/{n},}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{{\log n}/{n},}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}, then the assumption becomes
[TABLE]
If is growing with , then the rates obtained with the slow rate oracle inequality by setting will be slower as well. In particular, if , then Theorem 3.2 does not guarantee the convergence in .
Remark
The choice of the tuning parameter depends on through . Therefore, in practice, the oracle inequalities will only hold for certain active sets . To find out for which the oracle inequality holds with high probability we proceed as follows.
We choose , , and . Then, an active set for which the oracle inequality holds has to satisfy the following requirements:
[TABLE]
and
[TABLE]
4 Total variation
4.1 Incidence matrices
Let be a general directed graph, where the set is the set of vertices and the set is the set of edges. Let be the incidence matrix of (for more details see Subsubsection 1.1.2). In this section we will set . It is known that the rank of is given by the number of vertices of minus its number of connected components.
We now consider a set . Let us define the set of edges . The number of connected components of is . These connected components can be any sort of graph: tree or non-tree graphs.
Let be the number of vertices of each connected component of . Let us define and . The matrix can be rewritten as block matrix by rearranging rows and columns. From now on, when writing we intend the matrix in its block form.
By Lemma 1 in Ijiri (1965) we have that
[TABLE]
Remark
The restriction to the class can be seen as a requirement to have an active set which makes sense. The incidence matrix of all connected graphs is of row rank . However, graphs containing cycles, as the cycle graph or the two dimensional grid graph, have more than rows. The dimension of is the number of connected patches of the graph on which the signal is constant, if the active set is . A non-empty active set, means that the signal should have at least two constant pieces, otherwise no edge would be active.
If the active set is , then the dimension of is one. Now consider for instance the cycle graph. If for an , then the dimension of is still one. Thus this active set does not make sense at all since it would imply that we have a constant signal on the cycle graph but yet also a non-empty active set. Indeed, it is impossible to find a constant signal on a graph which results in some active edges.
For tree graphs, we have that , while for graph structures containing cycles we have that . In particular, for the cycle graph it holds that .
4.1.1 Trees and cycles
If is a tree or a cycle graph, then the connected components of are tree graphs, i.e. connected graphs with . Let be the incidence matrices of the tree graphs .
Lemma 4.1** **(Upper bound for the normalized inverse scaling factor)
Let be a tree graph. Then, , the normalized inverse scaling factor is bounded by
[TABLE]
Let and let be a cycle graph. Then
[TABLE]
Proof of Lemma 4.1.
See Appendix D. ∎
4.1.2 Two dimensional grid graph
We report and slightly adapt the bound on the normalized inverse scaling factor for the two dimensional grid graph by Hütter and Rigollet (2016).
Lemma 4.2** **(Proposition 4 in Hütter and Rigollet (2016))
Let be a two dimensional \mathchoice{{\hbox{\displaystyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{\textstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{\scriptstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{\scriptscriptstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth=-1.72223pt}}}\times\mathchoice{{\hbox{\displaystyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{\textstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{\scriptstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{\scriptscriptstyle\sqrt{n,}}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth=-1.72223pt}}} grid graph. Let be s.t. the connected components of are square two dimensional grid graphs. Then, for some sufficiently large constant , the normalized inverse scaling factor is bounded by
[TABLE]
4.2 Fast rates
To prove oracle inequalities with fast rates we need to find an explicit lower bound for the weighted compatibility constant.
Results for the analysis estimator on the path graph have already been obtained by Ortelli and van de Geer (2018). We extend them to the square root analysis estimator. Moreover, we also show that the tools developed in Ortelli and van de Geer (2018) together with the new framework presented here, allow to handle the case of the cycle graph. We are aware of results treating the power graphs of cycles (Hütter and Rigollet (2016)) but not of any oracle inequality implying the convergence of the mean squared error for the case of the cycle graph.
4.2.1 Path graph
We now consider the path graph , for which and .
We see that is a block matrix, where the blocks are incidence matrices of some smaller path graphs. By recycling the proof of Lemma 4.1 we obtain that
[TABLE]
The following lemma by van de Geer (2018), later also used in Ortelli and van de Geer (2018), allows us to lower bound , for a diagonal matrix with and where by convention we choose .
Lemma 4.3** **(Theorem 15 and Lemma 21 in van de Geer (2018))
Assume that is s.t. and . Then
[TABLE]
and the inequality is tight. Moreover
[TABLE]
Proof of Lemma 4.3.
The first statement follows form Theorem 15 and the second from Lemma 21 in van de Geer (2018). The proofs are also exposed also in Ortelli and van de Geer (2018), in Lemmas 5.3-5. ∎
In Ortelli and van de Geer (2018) it is explained that to bound the weak weighted compatibility constant for the path graph one needs to cut it into smaller modules. These modules lie around an edge in and consist of at least one additional edge on each side of the edge in , see Figure 1. Therefore we see that the assumption guarantees that we are in a situation where the bounds on the weak weighted compatibility constant apply. Indeed, if we have at least four vertices on the left and on the right of each edge in and thus we can decompose the path graph into modules being at least as large as the one shown in Figure 1. Since the oracles inequalities with fast rates exposed here are based on the bound on the weighted weak compatibility constant by Ortelli and van de Geer (2018), for fast rates we will require that .
The edges not in between modules can be ignored. Each module needs at least vertices, s.t. we need to hope to be able to upper bound by using the method proposed by van de Geer (2018). Moreover, a vertex not involved in an edge in can only be involved in one module to obtain the bounds exposed in Ortelli and van de Geer (2018).
Note also that the weights in have a direct correspondence to the edges of the graph, where the edges in are s.t. . Moreover, the weights for the edges between modules can be chosen arbitrarily when it comes to bounding from above, even if a value for them can be obtained by computation of the -norm of the corresponding columns of .
We take the arbitrary decision to use the convention , as in Lemma 4.3.
Lemma 4.4
Assume that is s.t. . We have that
[TABLE]
Proof of Lemma 4.4.
See the proof of Corollary 5.6 in Ortelli and van de Geer (2018). ∎
Let be the incidence matrix of the path graph with vertices. With the tools developed we can prove the following corollaries.
Analysis estimator on the path graph
Corollary 4.1 below, is a result already found in Ortelli and van de Geer (2018). It is reported here for comparison with the analogous result obtained for the square root analysis estimator on the path graph (s. Corollary 4.3). Corollary 4.2 follows directly from Corollary 4.1.
Corollary 4.1
Let be an arbitrary active set with s.t. and let . Choose \lambda\geq\sigma\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/{n}. Then, , it holds that, with probability at least ,
[TABLE]
Proof of Corollary 4.1.
See Appendix D. ∎
Due to the use of the bound given by Lemma 4.4, Corollary 4.1 assumes a minimal length condition. This condition does not depend on and is therefore weaker than the one found in Guntuboyina et al. (2017). Note that the choice of the tuning parameter depends both on and .
The next corollary makes a stronger assumption on .
Corollary 4.2
Let be an arbitrary active set with s.t. and even. Let . Choose \lambda\geq{\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=13.10944pt,depth=-10.4876pt}}}{{\hbox{\textstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=9.19028pt,depth=-7.35226pt}}}{{\hbox{\scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}. Then, , it holds that, with probability at least ,
[TABLE]
Proof of Corollary 4.2.
If , then . Moreover, and the statement of Corollary 4.2 follows by plugging in these insights into Corollary 4.1. ∎
Corollary 4.2 says that, if , then we can choose smaller than the universal choice \lambda\asymp\sigma\mathchoice{{\hbox{\displaystyle\sqrt{\log n/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{\log n/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{\log n/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{\log n/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}. The choice of the constant-friendly tuning parameter in the two corollaries above assumes however the knowledge of some aspects of the oracle signal minimizing the right hand side and can be seen as a motivation to choose the tuning parameter smaller than the universal choice if we know or suspect a certain specific structure for it. These insights were already developed by Dalalyan, Hebiri and Lederer (2017) and applied to total variation on the path graph in the case of slow rates.
Square root analysis estimator on the path graph
We now extend the results obtained for the analysis estimator to the case of the square root analyisis estimator.
Corollary 4.3
Let be an arbitrary active set having and satisfying Assumption 3.1. Let and . Choose \lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}. Then, under Assumption 3.1, , for the square root version of the total variation regularized estimator over the path graph it holds that, with probability at least ,
[TABLE]
Proof of Corollary 4.3.
The proof of Corollary 4.3 is analogous to the proof of Corollary 4.1. ∎
Corollary 4.4
Let be an arbitrary active set having with even and satisfying Assumption 3.1. Let and . Choose \lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}. Then, under Assumption 3.1, , for the square root version of the total variation regularized estimator over the path graph it holds that, with probability at least ,
[TABLE]
Proof of Corollary 4.4.
If , then . Moreover, and the statement of Corollary 4.4 follows by plugging in these insights into Corollary 4.3. ∎
Remark
We notice that there is a tradeoff in the choice of . A small will result in a narrower bound for in terms of and in smaller constants in the tuning parameter and in the oracle bound. However, it might result in a more restrictive condition on in Assumption 3.1.
4.2.2 Cycle graph
We consider the the cycle graph and its incidence matrix . We have .
We bound the weighted compatibility constant by cutting the graph into smaller modules as we explained in Subsubsection 4.2.1 for the path graph.
By concatenating such modules, one can obtain a path graph. Whether or not the two ends of the path graph are joined by an edge is not relevant for the possibility to bound the compatibility constant and obtain an oracle inequality with fast rates, since the edges connecting such modules are neglected in the bound.
Remark
Note that for the path graph we have that , while for the cycle graph it holds that .
Corollary 4.5
Assume that is s.t. . Then
[TABLE]
and the inequality is tight. Moreover
[TABLE]
Proof of Corollary 4.5.
Corollary 4.5 follows from Lemma 4.3 and from the considerations above. ∎
Remark
From Lemma 4.3 we get that, if is s.t. , then
[TABLE]
We now have all the tools to derive an oracle inequality for the total variation regularized estimator over the cycle graph and its square root version.
Analysis estimator on the cycle graph
Corollary 4.6
Let be an arbitrary active set with and let . Choose \lambda\geq\sigma\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/{n}. Then, , for the total variation regularized estimator over the cycle graph it holds that, with probability at least ,
[TABLE]
An analogous version of Corollary 4.2 can be derived from Corollary 4.6.
Square root analysis estimator on the cycle graph
Corollary 4.7
Let be an arbitrary active set having and satisfying Assumption 3.1. Let and . Choose \lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}. Then, , for the square root version of the total variation regularized estimator over the cycle graph it holds that, with probability at least ,
[TABLE]
An analogous version of Corollary 4.4 can be derived from Corollary 4.7.
4.3 Slow rates
Note that in the case of the so-called slow rates we do not need to lower bound the compatibility constant.
4.3.1 Trees and cycles
In this subsection we identify the analysis operator with the incidence matrix of a general tree or cycle graph .
Analysis estimator on trees and cycles
Corollary 4.8
Let be a tree or a cycle graph. Let (and under the condition for cycle graphs) be arbitrary and let . Choose \lambda\geq{\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/n. Then, , we have that, with probability at least ,
[TABLE]
Proof of Corollary 4.8.
Corollary 4.8 follows by combining Theorem 2.2 and Lemma 4.1. ∎
Corollary 4.9
Let be a tree or a cycle graph. Let (with the condition for cycle graphs) having be arbitrary and let . Choose \lambda\geq{\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=13.10944pt,depth=-10.4876pt}}}{{\hbox{\textstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=9.19028pt,depth=-7.35226pt}}}{{\hbox{\scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}n},}}\lower 0.4pt\hbox{\vrule height=6.75972pt,depth=-5.4078pt}}}. Then, , we have that, with probability at least ,
[TABLE]
Square root analysis estimator on trees and cycles
Corollary 4.10
Let be a tree or a cycle graph. Let (and under the condition for cycle graphs) be an arbitrary active set satisfying Assumption 3.1. Let and . Choose \lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptscriptstyle\sqrt{n_{\max}\frac{\log(2n)+t}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}. Then, , it holds that under Assumption 3.1, with probability at least ,
[TABLE]
Proof of Corollary 4.10.
Corollary 4.10 follows by combining Theorem 3.2 and Lemma 4.1. ∎
Corollary 4.11
Let be a tree or a cycle graph graph. Let (and under the condition for cycle graphs) be an arbitrary active set having and satisfying Assumption 3.1. Let and . Choose \lambda_{0}\geq\frac{1}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.55833pt,depth=-6.04669pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{\log(2n)+t}{r_{S}(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.55833pt,depth=-6.04669pt}}}. Then, , it holds that under Assumption 3.1, with probability at least ,
[TABLE]
4.3.2 Two dimensional grid graph
In this subsection we identify the analysis operator with the incidence matrix of a square two dimensional grid graph .
Analysis estimator on the two dimensional grid
Corollary 4.12
Let be a square two dimensional grid graph. Let be an arbitrary active set s.t. the connected components of are square two dimensional grid graphs and let . For a constant large enough, choose \lambda\geq C{\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{\log n(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{\log n(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{\log n(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{\log n(\log(2n)+t),}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}/n. Then, , we have that, with probability at least ,
[TABLE]
Proof of Corollary 4.12.
Corollary 4.12 follows by combining Theorem 2.2 and Lemma 4.2. ∎
Square root analysis estimator on the two dimensional grid
Corollary 4.13
Let be a tree or a cycle graph. Let be an arbitrary active set being s.t. the connected components of are square two dimensional grid graphs and satisfying Assumption 3.1. Let and . For a constant large enough, choose \lambda_{0}\geq\frac{C}{1-\eta}\mathchoice{{\hbox{\displaystyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=15.0pt,depth=-12.00005pt}}}{{\hbox{\textstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=10.5pt,depth=-8.40004pt}}}{{\hbox{\scriptstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{\log n(\log(2n)+t)}{n(n-1)},}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}. Then, , it holds that under Assumption 3.1, with probability at least ,
[TABLE]
Proof of Corollary 4.13.
Corollary 4.13 follows by combining Theorem 3.2 and Lemma 4.2. ∎
4.3.3 Comparison with other results
Consider Corollary 4.9 with the choice and assume that does not depend on . Then the following holds with probability at least .
- •
With , then and explicitely depends on .
- •
With , then and does not explicitely depend on .
One can reason analogously starting from Corollary 4.11 for the square root analysis estimator.
In both cases, if we obtain that . However, it is known that the minimax rate for that case (when the graph considered is the path graph) is and thus our results lead to a redundant log-term. The result about the minimax rate over the class of functions with bounded total variation obtained by entropy calculations (Mammen and van de Geer (1997) and references therein) are not constant-friendly, so that it may well be that, for small enough, the log-term is smaller than the constants of the entropy arguments.
The same remark applies to the case of tree graphs of bounded maximal degrees. For such graphs, Padilla et al. (2018) proved that the minimax rate of estimation of is . Moreover, they proved by entropy arguments that the total variation regularized estimator achieves the minimax rate. We prove that this minimax rate is achieved by the (square root) total variation regularized estimator up to a log term by using constant-friendly arguments (cf. Corollary 4.9 and 4.11).
We thus saw that for the path graph, the constant-friendly projection argument introduced by Dalalyan, Hebiri and Lederer (2017) to handle the increments of the empirical process might produce optimal rates up to a log-term for both the total variation regularized estimator and the square root total variation regularized estimator.
Another question is whether we can retrieve almost minimax rates by Corollary 4.12 for being the incidence matrix of a two dimensional grid graph. For that case, the minimax rate is Sadhanala, Wang and Tibshirani (2016) and an oracle inequality proved by Hütter and Rigollet (2016) almost retrieves it. Moreover, a natural scaling for that case is (Sadhanala, Wang and Tibshirani (2016)). Note that the part of Assumption 3.1 concerning , which translates to , is thus satisfied.
Thus, for fixed, from Corollaries 4.12 and 4.13 we get that, if is s.t. the connected components of are square two dimensional grid graphs and
[TABLE]
under the canonical scaling we have the rate
[TABLE]
which corresponds to the minimax rate up to a log term. Note however that, due to the utilization of Lemma 4.2, Corollaries 4.12 and 4.13, from which this insight is derived, are not constant-friendly.
5 Conclusion
We introduced a class of active sets dependent on the analysis operator , to which it is natural to restrict the attention. Indeed, as some examples from total variation regularization on graphs show, there can be some elements of which can not be seen as true active sets of any signal, depending on the graph structure.
We then derived oracle inequalities with fast rates under some compatibility conditions and oracle inequalities with slow rates. The results with fast rates show that, if one can find a suitable bound on the weighted weak compatibility constant, the analysis estimator and its square root version are adaptive, i.e. they can adapt to the unknown sparsity of . For both the analysis and the square root analysis estimators, the results with slow rates were used as tool to retrieve in a simple and constant-friendly way minimax rates obtained by entropy calculations, at the price of an extra log factor. The choice of the tuning parameters and , which includes some information about the structure of the analysis operator and of the active set via the inverse scaling factor , seems to be advantageous in theoretical terms and allows us to show that the “slow” rates can almost match the minimax lower bound for the total variation regularized estimator on graph structures as the path graph and tree graphs with bounded maximal degree.
We obtained parallel and very similar results for both the analysis and the square root analysis estimators. The differences in these results come from the fact that for the square root analysis estimator we first have to prove that the estimator does not overfit and that the KKT conditions hold. In spite of being mathematically more involved, the results for the square root analysis estimator tell us that we can get with high probability theoretical guarantees being very similar to the ones obtained for the analysis estimator by choosing a tuning parameter not depending on the unknown noise level. This fact might be helpful in practice and might speak in favor of the utilization of the square root analysis estimator.
We then narrowed down our results to (square root) total variation regularized estimators over graphs. For fast rates we considered the cases of the path graph and of the cycle graph. In these cases we were able to show that the compatibility conditions are satisfied.
For the case of slow rates, we obtained oracle inequalities matching up to a log term the optimal rate over the path graph, the two dimensional grid graph and tree graphs of maximal bounded degree. These results do not require any compatibility condition.
These oracle inequalities can be interpreted in two senses. Either we can choose a smaller tuning parameter depending on and obtain better rates. Or we can choose a larger tuning parameter not depending on and get worse rates. This might be a justification for incorporating eventual prior knowledge of into the tuning parameter.
The main tool used to derive the oracle inequalities presented in this paper is a bound on the increments of the empirical process inspired by the projection arguments by Dalalyan, Hebiri and Lederer (2017). This bound is very simple and constant-friendly, while entropy bounds are more involved and can have large constants. There are two routes one can take after having bounded the increments of the empirical process by projection arguments. Either one uses a more refined version of the bound on the increments of the empirical process and then bounds the compatibility constant to derive fast rates. Or one bounds the increments of the empirical process in a rougher way and obtains oracle inequalities with slow rates. In this last case one only needs to bound the inverse scaling factor. Bounds on the inverse scaling factor can be very simple and constant-friendly, while bounds on the compatibility constant can sometimes lead to large constants (cf. Ortelli and van de Geer (2019b)). Moreover, results with slow rates have been shown to almost retrieve the minimax rate in a constant-friendly way also in other settings, for instance in higher order total variation regularization (Ortelli and van de Geer (2019b)). If we compare the results obtained by entropy calculations with our results with slow rates, we see that, at the expense of a log term, we are able to retrieve almost the same rate by two simple steps: the constant-friendly bound on the increments of the empirical process and the bound on the inverse scaling factor. The bound on the inverse scaling factor is constant-friendly for graph structures as tree graphs and cycle graphs, while the bound on the inverse scaling factor for the two dimensional grid graph we borrow from Hütter and Rigollet (2016) is more involved. For total variation regularized estimators on the path graph and on tree graphs of bounded maximal degree, we thus obtain nonasymptotic counterparts, in form of oracle inequalities with slow rates, to results found in the previous literature (Mammen and van de Geer (1997); Padilla et al. (2018)).
A question for further investigation is the possibility to use the framework exposed here to obtain oracle inequalities with fast rates for other graph structures. The answer depends on the ability to lower bound the compatibility constant for graphs other than tree graphs and cycles. We leave this questions to future research.
Appendix A Probability inequalities
We expose three lemmas helping us to deal with the random part of the oracle inequalities.
Lemma A.1** **(The maximum of random variables, Lemma 17.5 in van de Geer (2016))
Let be real valued random variables. Assume and that . Then,
[TABLE]
Lemma A.2** **(The special case of random variables, Lemma 1 in Laurent and Massart (2000), Lemma 8.6 in van de Geer (2016))
Let . Then,
[TABLE]
Remark
Note that from Lemma A.2 it follows that
[TABLE]
Lemma A.3** **(Lemma 8.1 in van de Geer (2016))
For , let . Then, we have that, for ,
[TABLE]
Remark
Let be vectors. Then by the union bound and by Lemma A.3 we have that for
[TABLE]
Now select . Then we have that for ,
[TABLE]
Appendix B Proofs of Section 2
B.1 Basic inequality
The case of the analysis estimator is more simple than the one of the square root analysis estimator, because we have the basic inequality without assuming any extra conditions.
Lemma B.1** **(Basic inequality)
For the analysis estimator we have the so called basic inequality, i.e.
[TABLE]
Proof of Lemma B.1.
The KKT conditions for the analysis estimator write as
[TABLE]
Thanks to the chain rule of the subdifferential, is the subdifferential of with respect to at . We have that, for , and that, for a generic , , where the last inequality follows by the dual norm inequality and by the fact that .
By subtracting the first of the two above expressions from the second, we find that
[TABLE]
By polarization we obtain the basic inequality
[TABLE]
∎
B.2 Bound on the increments of the empirical process
Lemma B.2
Let be arbitrary and . Choose \lambda\geq{\gamma\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{2\log(2(n-r_{S}))/n+2t/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}. Then, , it holds that, with probability at least ,
[TABLE]
Proof of Lemma B.2.
We have that
[TABLE]
We have that, since is of full rank,
[TABLE]
For define the set
[TABLE]
where , since .
Since , on we have that
[TABLE]
To find a lower bound on we apply Lemma A.1 to .
The moment generating function of is .
Choosing, for some , \lambda\geq{\gamma\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{2(\log(2(n-r_{S}))+t)/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{2(\log(2(n-r_{S}))+t)/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}, e.g. \lambda={\gamma\sigma}\mathchoice{{\hbox{\displaystyle\sqrt{2(\log(2n)+t)/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{2(\log(2n)+t)/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{2(\log(2n)+t)/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{2(\log(2n)+t)/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}}, and applying Lemma A.1 with and , we obtain that . 2. 2.
We have that
[TABLE]
For , define the set
[TABLE]
On we have that
[TABLE]
Since is a linear space of dimension , we have that
[TABLE]
Moreover note that
[TABLE]
By applying Lemma A.2 for some we thus get that .
∎
Remark
To obtain fast rates by using compatibility conditions one makes use of the more refined bound given by Lemma B.2 involving . This term will flow into the weighted compatibility constant.
To obtain slow rates without needing compatibility conditions one utilizes the less refined version of the bound given by Lemma B.2 involving .
B.3 Proof of the oracle inequalities
Proof of Theorem 2.1.
By Lemma B.1 we have the basic inequality. By the triangle inequality, we have
[TABLE]
We now handle the random part, which is constituted by an increment of the empirical process, by using Lemma B.2. By Lemma B.2 we have that with probability at least ,
[TABLE]
Putting the pieces together, we get that,
[TABLE]
If we have that
[TABLE]
and thus
[TABLE]
where the last inequality follows by .
The term cancels out and we get the statement of the theorem.
∎
Proof of Theorem 2.2.
By Lemma B.1 we have the basic inequality. By Lemma B.2, we have that with probability at least ,
[TABLE]
We thus get that
[TABLE]
∎
Appendix C Proofs of Section 3
Define for
[TABLE]
For , define the sets ,
[TABLE]
and
[TABLE]
Note that on we have that, by the Cauchy-Schwarz inequality,
[TABLE]
Remark
By Lemma A.2 (Lemma 1 in Laurent and Massart (2000)) we have that for both and hold true.
Moreover by Lemma A.3 (Lemma 8.1 in van de Geer (2016)) and using the union bound, we see that if we choose
[TABLE]
we have that . Thus, by such a choice of we get that
[TABLE]
Remark
Motivated by a more simple exposition of the results, we chose the same parameter for the upper and lower bounds for both and . However one could of course choose four different parameters, say , for the four different bounds and obtain results holding with probability resp. .
C.1 Proving that the square root analysis estimator does not overfit
Proof of Lemma 3.1.
Assumption 3.1 expresses a particular choice of the constant in Proposition C.1 below. For we have that and thus the choice of in Assumption 3.1 satisfies the upper bound given by Proposition C.1 (see below), which then holds, since all of its assumtpions are satisfied and we consider the sets .
The choice of implies that and that . Thus the claim follows.
By Remark Remark, if we choose and R\geq\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))), then .
∎
Propostion C.1** **(The square root analysis estimator does not overfit)
Assume for some that and that for some
[TABLE]
where
[TABLE]
We assume that is s.t.
[TABLE]
Let
[TABLE]
Let . Choose R\geq\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Then with probability at least it holds that
[TABLE]
Proof of Proposition C.1, based on the proof of Lemma 3.1 by van de Geer (2016).
On the set we have that, the Cauchy-Schwarz inequality,
[TABLE]
Thus,
[TABLE]
We now show an upper and a lower bound for .
Upper bound:
Since the estimator \hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}} minimizes the objective function we have that
[TABLE]
It follows that
[TABLE]
Lower bound:
Note that, by the triangle inequality, we have that
[TABLE]
Thus the lemma follows if we can prove a bound of the type \lVert\hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}-f^{0}\rVert_{n}\leq\text{const.}\lVert\epsilon\rVert_{n}, with leading constant in . We are not allowed to use the KKT conditions. Instead we use the convexity of the loss function and of the penalty.
Define for the convex combination \hat{f}_{t}:=t\hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}}+(1-t)f^{0} and its residuals
[TABLE]
Choose
[TABLE]
Then
[TABLE]
We thus get that
[TABLE]
By the convexity of the loss and of the penalty and by the fact that \hat{f}_{\mathchoice{{\hbox{\displaystyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\textstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}{{\hbox{\scriptscriptstyle\sqrt{,}}\lower 0.4pt\hbox{\vrule height=0.0pt,depth=0.0pt}}}} is a minimizer of the objective function it follows that
[TABLE]
By squaring the inequality we get that
[TABLE]
We have that
[TABLE]
By combining the squared inequality with the lower bound for and the expression for we get that
[TABLE]
On , for an satisfying the assumptions of the lemma, we have that
[TABLE]
Thus
[TABLE]
Moreover we have that
[TABLE]
Thus we obtain that
[TABLE]
and
[TABLE]
Note that
[TABLE]
By using the spectral decomposition, can be written as , where is s.t. . Moreover can be written as , where is s.t. and .
Let and . We have that , and and are independent. We have that and that . It follows that
[TABLE]
and thus the two terms are independent and can be handled separately.
On we have that
[TABLE]
Therefore
[TABLE]
It follows that
[TABLE]
By expressing more explicitly we get that
[TABLE]
We conclude that
[TABLE]
The last step is to find out how to choose s.t. . We get that , hence
[TABLE]
Note that we also get the assumption , which results in the assumption
[TABLE]
Note that the result holds on , which by Remark Remark has probability at least for and R\geq\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). ∎
C.2 Basic inequality
Lemma C.1
Let be an arbitrary active set satisfying Assumption 3.1 and let . For , choose \lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1, it holds that , with probability at least ,
[TABLE]
Proof of Lemma C.1.
Under Assumption 3.1, on the KKT conditions hold
[TABLE]
We then obtain the basic inequality as in Lemma B.1 (cf. also Lemma 2 in Stucky and van de Geer (2017)). Note that by Remark Remark, the choice of implies that . ∎
C.3 Bound on the increments of the empirical process
Lemma C.2
Let be an arbitrary active set satisfying Assumption 3.1 and let . For , choose \lambda_{0}\geq\frac{1}{1-\eta}\gamma\mathchoice{{\hbox{\displaystyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=13.22221pt,depth=-10.57782pt}}}{{\hbox{\textstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=9.25555pt,depth=-7.40448pt}}}{{\hbox{\scriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}}{{\hbox{\scriptscriptstyle\sqrt{\frac{2\log(2(n-r_{S}))+2t}{n-1},}}\lower 0.4pt\hbox{\vrule height=6.72777pt,depth=-5.38225pt}}},\ t\in(0,{(n-1)}/{2}-\log(2(n-r_{S}))). Under Assumption 3.1 we have that , with probability at least
[TABLE]
Proof of Lemma C.2.
On , by using the decomposition in antiprojection and projection onto the nullspace of and by applying the dual norm inequality to the second term we have that
[TABLE]
Moreover, on , under Assumption 3.1, by Corollary 3.1 we have that and thus the claim follows. Note that the choice of implies, by Remark Remark, that . ∎
C.4 Proof of the oracle inequalities
Proof of Theorem 3.1.
We work under Assumption 3.1 on . By combining Lemma C.1 and Lemma C.2, we get that, in complete analogy to the proof of Theorem 2.1,
[TABLE]
Moreover, by Corollary 3.1, we have that on
[TABLE]
Thus we get that
[TABLE]
Since Assumption 3.1 implies that and we get that (1+\eta)(1+\mathchoice{{\hbox{\displaystyle\sqrt{4a/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\textstyle\sqrt{4a/n,}}\lower 0.4pt\hbox{\vrule height=7.5pt,depth=-6.00003pt}}}{{\hbox{\scriptstyle\sqrt{4a/n,}}\lower 0.4pt\hbox{\vrule height=5.25pt,depth=-4.20003pt}}}{{\hbox{\scriptscriptstyle\sqrt{4a/n,}}\lower 0.4pt\hbox{\vrule height=3.75pt,depth=-3.00002pt}}})\leq 4 and
[TABLE]
By Remark Remark and the choice of in the statement of the theorem, we have that . ∎
Proof of Theorem 3.2.
We work under Assumption 3.1 on . By Lemma C.1 and Lemma C.2 we get that, in analogy with the proof of Theorem 2.2,
[TABLE]
By Corollary 3.1 we have that
[TABLE]
Moreover on
[TABLE]
We thus get that
[TABLE]
By Remark Remark and the choice of in the statement of the theorem, we have that . ∎
Appendix D Proofs of Section 4
Proof of Lemma 4.1.
Notice that for a cycle graph, all elements of have at least (cf. Remark Remark). Thus under the assumption , bounding for the cycle graph reduces to bounding for a tree graph.
Let be the incidence matrix of a directed tree graph rooted at vertex 1. Let be its Moore-Penrose pseudoinverse. By Lemma 2.2 in Ortelli and van de Geer (2019a) we have that can be obtained as , where . As pointed out in Ortelli and van de Geer (2018), has the meaning of the rooted path matrix of the tree graph considered. Thus, the columns of contain a minimum of and a maximum of entries having value 1, while the remaining entries are zeroes.
Let be the number of entries having value 1 of a column of . Let denote any vector with entries having value 1 and entries having value 0. Define . We have that . The maximum of for a given is reached at if is even and at if is odd.
Moreover, is increasing in and . Therefore, the -norm of a column of will never be greater than the greatest possible -norm of a column of . We thus have that
[TABLE]
∎
Proof of Corollary 4.1.
By Lemma 4.1 we have that
[TABLE]
By combining the above with Lemma 4.3, Lemma 4.4 and Theorem 2.1 we get Corollary 4.1. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Belloni, Chernozhukov and Wang (2011) {barticle} [author] \bauthor \bsnm Belloni, \bfnm Alexandre \binits A., \bauthor \bsnm Chernozhukov, \bfnm Victor \binits V. and \bauthor \bsnm Wang, \bfnm Lie \binits L. ( \byear 2011). \btitle Square-root lasso: Pivotal recovery of sparse signals via conic programming. \bjournal Biometrika \bvolume 98 \bpages 791–806. \endbibitem
- 2Birman and Solomjak (1967) {barticle} [author] \bauthor \bsnm Birman, \bfnm M S \binits M. S. and \bauthor \bsnm Solomjak, \bfnm M Z \binits M. Z. ( \byear 1967). \btitle Piecewise-polynomial approximations of functions of the classes W p α subscript superscript 𝑊 𝛼 𝑝 W^{\alpha}_{p} . \bjournal Math. USSR Sb. \bvolume 2. \endbibitem
- 3Bühlmann and van de Geer (2011) {bbook} [author] \bauthor \bsnm Bühlmann, \bfnm Peter \binits P. and \bauthor \bparticle van de \bsnm Geer, \bfnm Sara \binits S. ( \byear 2011). \btitle Statistics for High-Dimensional Data. \bdoi 10.1007/978-3-642-20192-9 \endbibitem
- 4Bunea, Lederer and She (2014) {barticle} [author] \bauthor \bsnm Bunea, \bfnm Florentina \binits F., \bauthor \bsnm Lederer, \bfnm Johannes \binits J. and \bauthor \bsnm She, \bfnm Yiyuan \binits Y. ( \byear 2014). \btitle The Group Square-Root Lasso : Theoretical Properties and Fast Algorithms. \bjournal IEEE Transactions on Information Theory \bvolume 60 \bpages 1313–1325. \endbibitem
- 5Chatterjee and Goswami (2019) {barticle} [author] \bauthor \bsnm Chatterjee, \bfnm Sabyasachi \binits S. and \bauthor \bsnm Goswami, \bfnm Subhajit \binits S. ( \byear 2019). \btitle New Risk Bounds for 2d Total Variation Denoising. \bjournal ar Xiv:1902.01215 v 2 \bpages 1–59. \endbibitem
- 6Dalalyan, Hebiri and Lederer (2017) {barticle} [author] \bauthor \bsnm Dalalyan, \bfnm Arnak \binits A., \bauthor \bsnm Hebiri, \bfnm Mohamed \binits M. and \bauthor \bsnm Lederer, \bfnm Johannes \binits J. ( \byear 2017). \btitle On the prediction performance of the Lasso. \bjournal Bernoulli \bvolume 23 \bpages 552–581. \endbibitem
- 7Derumigny (2018) {barticle} [author] \bauthor \bsnm Derumigny, \bfnm Alexis \binits A. ( \byear 2018). \btitle Improved bounds for Square-Root Lasso and Square-Root Slope. \bjournal Electronic Journal of Statistics \bvolume 12 \bpages 741–766. \endbibitem
- 8Donoho and Johnstone (1998) {barticle} [author] \bauthor \bsnm Donoho, \bfnm David L \binits D. L. and \bauthor \bsnm Johnstone, \bfnm Iain M \binits I. M. ( \byear 1998). \btitle Minimax estimation via wavelet shrinkage. \bjournal The Annals of Statistics \bvolume 26 \bpages 879–921. \endbibitem
