Estimating Piecewise Monotone Signals
Kentaro Minami

TL;DR
This paper analyzes the nearly-isotonic regression for estimating piecewise monotone signals, providing risk bounds and an algorithm for general graphs, showing it performs nearly as well as an oracle estimator.
Contribution
It derives adaptive risk bounds for nearly-isotonic regression and introduces a versatile algorithm applicable to weighted graphs.
Findings
Risk bounds are adaptive to piecewise monotone signals.
Nearly-isotonic regression performs close to an oracle estimator.
The proposed algorithm works on general weighted graphs.
Abstract
We study the problem of estimating piecewise monotone vectors. This problem can be seen as a generalization of the isotonic regression that allows a small number of order-violating changepoints. We focus mainly on the performance of the nearly-isotonic regression proposed by Tibshirani et al. (2011). We derive risk bounds for the nearly-isotonic regression estimators that are adaptive to piecewise monotone signals. The estimator achieves a near minimax convergence rate over certain classes of piecewise monotone signals under a weak assumption. Furthermore, we present an algorithm that can be applied to the nearly-isotonic type estimators on general weighted graphs. The simulation results suggest that the nearly-isotonic regression performs as well as the ideal estimator that knows the true positions of changepoints.
| Node left to | Node right to | ||
| None | None | 0 | |
| None | 0 | ||
| None | 1 | ||
| None | -1 | ||
| -1 | |||
| 0 | |||
| None | 0 | ||
| 0 | |||
| 1 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Control Systems and Identification · Advanced Statistical Methods and Models
Estimating Piecewise Monotone Signals
Kentaro Minami
The University of Tokyo
Preferred Networks, Inc.
(7 March 2020)
Abstract
We study the problem of estimating piecewise monotone vectors. This problem can be seen as a generalization of the isotonic regression that allows a small number of order-violating changepoints. We focus mainly on the performance of the nearly-isotonic regression proposed by Tibshirani et al. (2011). We derive risk bounds for the nearly-isotonic regression estimators that are adaptive to piecewise monotone signals. The estimator achieves a near minimax convergence rate over certain classes of piecewise monotone signals under a weak assumption. Furthermore, we present an algorithm that can be applied to the nearly-isotonic type estimators on general weighted graphs. The simulation results suggest that the nearly-isotonic regression performs as well as the ideal estimator that knows the true positions of changepoints.
keywords: piecewise monotone function, isotonic regression, nearly-isotonic regression, adaptive risk bounds
Contents
-
3.2 Lower bound of isotonic regression with misspecified partitions
-
D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)
-
D.4 Risk bounds for penalized estimators (Proof of Theorem 4.7)
1 Introduction
Isotonic regression is a popular statistical method based on partial order structures, which has a long history in statistics (Ayer et al. 1955, Brunk 1955, van Eeden 1956). Suppose that is a monotone vector satisfying , and is a noisy observation of . The goal of the isotonic regression is to find a least-square fit under the monotone constraint:
[TABLE]
In other words, the isotonic regression is the least squares estimator over a closed convex cone . Broadly speaking, the isotonic regression is an example of shape restricted regression. For comprehensive reviews on this field, see Robertson et al. (1988), Groeneboom and Jongbloed (2014), Chatterjee et al. (2015), Guntuboyina and Sen (2017) and references therein.
In this paper, we study the problem of estimating piecewise monotone vectors, which can be regarded as a generalization of isotonic regression that allows order-violating changepoints. We formulate the problem precisely as follows. Let us consider the Gaussian sequence model
[TABLE]
where is the observed vector, is the unknown parameter of interest, and is the unobserved noise distributed according to the Gaussian distribution . Given the noisy observation , the problem is to find a good piecewise monotone approximation of . Here we define piecewise monotone vectors as follows.
Definition 1.1**.**
Let be a connected partition of , that is, there exists a sequence such that (). We say that a vector is piecewise monotone on if the restriction on each is monotone:
[TABLE]
We also say that is -piecewise monotone if is piecewise monotone on some partition with .
We are particularly interested in the case where the number of pieces is larger than two but much smaller than because it is reduced to simpler problems if otherwise. From Definition 1.1, a monotone vector in is -piecewise monotone for any . In particular, the least squares estimators over -piecewise monotone vectors coincide with the isotonic regression. Besides, since any vector in is -piecewise monotone, the least squares estimator over -piecewise monotone vectors is merely the identity function .
In real-world applications, there are many signals that can be approximated by piecewise monotone vectors. Here, we provide a few examples. First, in seismology, geological observations such as tide gauge records (Nagao et al. 2013) and GPS records (Roggers and Dragert 2003) often consist of a long-term monotonic trend and discontinuous jumps caused by tectonic activities. In particular, Roggers and Dragert (2003) reported that GPS measurements that are nearby a subduction zone in North America can be approximated by a sawtooth function. The top panel of Figure 1 shows an example of GPS measurements. Second, the numbers of search queries for some words related to seasons (e.g., “Christmas” and “gift”) can be seen as periodic piecewise monotone signals (see the bottom panel of Figure 1 for examples). Third, in the ranking systems in online shopping websites, sales ranks of rarely sold items behave like piecewise monotone signals because they suddenly rise every time the items are sold (Hattori and Hattori 2010).
In this paper, we focus on the performance of nearly-isotonic regression proposed by Tibshirani et al. (2011). Given and a tuning parameter , the nearly-isotonic regression estimator is defined as
[TABLE]
where . Intuitively, the tuning parameter controls the degree of monotonicity. The term poses a positive penalty if and only if the directed edge is order violating, i.e., . Hence, a large value of makes the estimator close to a monotone vector. In particular, there is a sufficiently large such that the solution becomes exactly the same as the isotonic regression (1).
Our goal in this paper is to show that the nearly-isotonic regression can adapt to piecewise monotone vectors. As suggested in Tibshirani et al. (2011), the nearly-isotonic regression can fit to a “nearly monotone” vector that is close to in -sense. That is, the estimator performs well if has a small -misspecification error defined as
[TABLE]
Moreover, we can observe that the nearly-isotonic regression can fit to piecewise monotone vectors, even if is far from monotone in -sense. Figure 2 shows an example of the nearly-isotonic regression with . The true parameter (orange line) is 2-piecewise monotone. By varying the values of the tuning parameter , the nearly-isotonic regression behaves as follows: If , the nearly-isotonic regression is just the identity estimator , which clearly overfits to the noisy observation. If is set to a sufficiently large value, coincides with the isotonic regression. In this example, however, the -misspecification error is large compared with the normalized noise variance . We can see that the mean squared error (MSE) of the isotonic regression can be much worse than that of the identity estimator, which coincides with (see Section 3.2). Indeed, we can choose a 2-piecewise monotone vector with arbitrarily large -misspecification error. If we choose an intermediate value of , the nearly-isotonic regression seems to fit to the true parameter. This suggests the adaptation property to piecewise monotone vectors.
1.1 Summary of theoretical results
In this paper, we investigate the adaptation property of the nearly-isotonic regression estimators defined in (3).
In the monotone regression setting (i.e., ), it is known that the isotonic regression estimator achieves the risk bound
[TABLE]
where is the total variation of the monotone vector . It is also known that the rate is minimax optimal under the assumption that is monotone and (Zhang 2002). Hence, a natural question is whether a similar rate can be achieved in piecewise monotone regression.
In Section 3.1, we provide the minimax lower bound over the class of piecewise monotone vectors. Let be the set of -piecewise monotone vectors whose “upper” total variations are bounded by (a precise definition is provided in Section 3.1). Then, the minimax risk over is bounded from below by a constant multiple of
[TABLE]
In Section 5, we construct a concrete (but not computationally efficient) estimator that adaptively achieves this rate, and hence this lower bound is tight in the sense of the order in , and . Intuitively, this suggest that the cost of not knowing the true partition is of order .
In Section 4, we provide the following risk bound for the nearly-isotonic regression estimator (3). A precise statement is given in Corollary 4.12.
Claim 1.2**.**
Let be a piecewise monotone vector on a partition . Suppose that the following assumptions hold:
- (a)
The partition is equi-spaced: . 2. (b)
For each segment , is monotone and the total variation is bounded as . 3. (c)
satisfies an appropriate “growth condition” for each .
Then, the estimator (3) with optimally tuned parameter satisfies the following risk bound:
[TABLE]
The above claim is obtained as a corollary of a more general risk bound in Section 4. In the above statement, we make somewhat restrictive assumptions. Here, (a) and (b) are introduced just for the sake of notation simplicity, whereas (c) is an essential assumption. If we assume only (a) and (b), the rate that appeared in (4) is minimax optimal up to a logarithmic multiplication factor. However, we require an extra growth condition (c), which seems to be unavoidable for the estimator (3). We will provide a precise definition of the growth condition in Section 4.3.
1.2 Organization
The rest of this paper is organized as follows. In Section 2, we give a brief literature review on the shape restricted regression and regularization based estimators and relate our theoretical results to previous work. We provide lower bounds on the risks in the piecewise monotone regression problem in Section 3. In Section 4, we describe our main results on the risk upper bounds for the nearly-isotonic regression estimator and its constrained form variant. In particular, a precise statement of Claim 1.2 in the above is provided in Section 4.3. In Section 5, we discuss the attainability of the minimax lower bound; herein, we provide a concrete example of a model selection-based estimator that achieves the optimal rate. Furthermore, we present some numerical examples in Section 6. Finally, we present our conclusion in Section 7. We have also included appendices which contain additional numerical examples on two-dimensional signals, explanations of algorithms, and all proofs of the theoretical results.
1.3 Notation
Throughout this paper, we assume that is distributed according to an isotropic normal distribution , where is the true mean parameter of interest and is the noise vector. The symbol denotes the expectation with respect to .
We sometimes denote by an absolute positive constant whose value may vary.
For any , we define the total variation and the lower total variation by
[TABLE]
where for any . For example, if is monotone nondecreasing, then and . In this paper, the meaning of subscripts of depends on the context (e.g., , , , and ). If is a connected subset of , we denote by a sub-vector . We also denote by the total variation of .
2 Related work
There are two classes of estimators that are closely related to the nearly-isotonic regression (3): the isotonic regression and the fused lasso.
As we mentioned above, the isotonic regression is an instance of shape restricted regression. Many existing estimators in shape restricted regression can be formulated as least squares estimators (denoted by ) onto closed convex sets (denoted by ). Examples include, but not limited to, the isotonic regression, the isotonic regression in two-dimensional grid or more general partial orders (see e.g., Robertson and Wright (1975) and Kyng et al. (2015)), and convex regression (Hildreth 1954).
Recently, researchers have developed two important techniques for analyzing risk behaviors of least squares estimators. First, Chatterjee (2014) proved that the Euclidean norm is tightly concentrated around a certain quantity defined by the localized Gaussian width. As applications of Chatterjee’s method, non-asymptotic upper bounds that have similar rates to the minimax risks have been proved for the isotonic regression (Chatterjee 2014, Bellec 2018), the multi-isotonic regression on two or more high dimension (Chatteejee et al. 2018, Han et al. 2017), the multi-dimensional convex regression (Han and Wellner 2016), and the constrained form trend filtering estimator (Guntuboyina et al. 2017). See also Section 2.2 in Bellec (2018) for a related result. Second, risk bounds based on the statistical dimension of the tangent cone of has been developed by Oymak and Hassibi (2016) and Bellec (2018). This technique is useful because it takes into account the facial structure of , which leads to risk bounds that are adaptive to low dimensional sub-structures. It has been shown that some least squares estimators are adaptive to piecewise constant vectors: for example, the isotonic regression (Bellec 2018) and the multi-isotonic regression (Chatteejee et al. 2018, Han et al. 2017). In particular, for the one-dimensional isotonic regression, Chatterjee et al. (2015) and Bellec (2018) proved the following oracle inequality
[TABLE]
where is the number of constant pieces of . If is monotone and is small, the right-hand side can be much smaller than the worst-case rate of . However, the first term in the right-hand side can become arbitrarily large if is not included in .
The fused lasso (Tibshirani et al. 2005), also known as the total variation regularization (Rudin et al. 1992), is a penalized estimator defined as
[TABLE]
where is the tuning parameter. The fused lasso poses the penalty whenever , whereas the penalty of the nearly-isotonic regression (3) activates only if . Theoretical risk bounds for the fused lasso have been studied by Mammen and van de Geer (1997), Dalalyan et al. (2017), Lin et al. (2017), and Guntuboyina et al. (2017). In particular, Guntuboyina et al. (2017) showed an oracle inequality of the following form:
[TABLE]
where is an optimally tuned parameter. One can control the quantity by assuming a mild regularity condition on so that the inequality (7) recovers the minimax rate for the piecewise constant vectors (see e.g., Gao et al. (2017)). However, even if is a monotone vector, (7) does not recover the rate of the isotonic regression (5) because becomes zero if and only if is just a constant vector.
Our risk bound for the nearly-isotonic regression in Section 4.2 fills the gap between the above risk bounds for the isotonic regression and the fused lasso. We will show an oracle inequality of the following form:
[TABLE]
Like in the case of the fused lasso (7), this inequality provides a meaningful risk bound even if we cannot approximate by a monotone vector. Furthermore, becomes zero for any monotone vector . Hence, our result can exactly recover the rate achieved by the isotonic regression (5).
3 Lower bounds
In this section, we provide lower bounds for the risk in one-dimensional piecewise monotone regression.
3.1 Minimax lower bound
We are interested in the lower bound for the minimax risk defined as
[TABLE]
where is a set of piecewise monotone vectors, and the infimum is taken over all (measurable) estimators . In particular, for , we consider the class of -piecewise monotone vectors with a bounded total variation that is defined as follows.
Definition 3.1**.**
Let and . For any , let denote the set of (at most) -piecewise monotone vectors such that the upper total variation is bounded by . In other words, a vector is an element of if and only if the following conditions hold:
- (i)
is piecewise monotone on a connected partition of whose cardinality is not larger than . 2. (ii)
There exist numbers such that , , and for all .
In addition, we also define as the set of -piecewise monotone vectors such that the total variations for all pieces are uniformly bounded by . That is, is obtained by replacing (ii) by the following condition:
- (ii)’
for all .
First, we consider is piecewise monotone on a known partition and that the total variation of the sub-vector is bounded as for each . Then, the problem is decomposed into independent subproblems of estimating monotone vectors . The minimax risk lower bound for monotone vectors has been proved by Zhang (2002) and Chatterjee et al. (2015). For simplicity in the notation, we assume here that for all . The minimax risk can be written as
[TABLE]
Hence, the minimax risk over is clearly bounded from below by
[TABLE]
If the partition is known, then this convergence rate can be obtained by concatenating the least squares estimators on all pieces. By Jensen’s inequality, the quantity (9) is not larger than .
In the general setting, we have to deal with unknown partitions. The following proposition gives the lower bound over the class of piecewise monotone vectors in Definition 3.1.
Proposition 3.2**.**
Let , , and . Suppose that is either or in Definition 3.1. Then, for any estimator , we have the following lower bound:
[TABLE]
where is a universal constant.
It remains to verify that the lower bound (10) is tight. Thus, in Section 5, we will construct an estimator that adaptively achieves a similar rate.
3.2 Lower bound of isotonic regression with misspecified partitions
Suppose that is an -piecewise monotone vector. As we mentioned in the previous subsection, if we know the true partition on which is monotone, the least squares estimator can achieve the rate shown in (9). Here, we consider what happens if we underestimate the true number of the pieces.
We consider the risk behavior of the isotonic regression , which corresponds to the least squares estimator for the underestimated number of pieces as . If the true number of pieces is larger than or equal to two, may not be contained in . Recall that is the -misspecification error against the set of monotone vectors. Bellec (2018) showed that the isotonic regression is robust against a small -misspecification, that is, if , then
[TABLE]
where is the orthogonal projection of onto . Conversely, if the -misspecification error is large, we see that the isotonic regression can have an arbitrarily large risk.
Proposition 3.3**.**
There is a positive number that depends on and such that if the true parameter satisfies , then the MSE of the isotonic regression is bounded from below as
[TABLE]
In this case, the isotonic regression has a strictly larger MSE than that of the identity estimator .
We can easily check that there is a 2-piecewise monotone vector with an arbitrarily large -misspecification error. To see this, let be a piecewise constant vector defined as for and for . Then, it is easy to see that diverges as . Figure 2 shows an example of a 2-piecewise monotone vector such that the isotonic regression has a larger squared loss value than the identity estimator.
4 Risk bounds for nearly-isotonic regression
In this section, we develop the risk bound for the nearly-isotonic regression estimator (3). Proofs of all the theorems and propositions in this section are presented in Appendix D.
4.1 Risk bounds for constrained estimators
Before considering the original version of the nearly-isotonic regression (3), we consider the performance of the constrained form nearly-isotonic regression defined by the following constrained optimization problem:
[TABLE]
where is the tuning parameter. By the fundamental duality theorem in convex optimization, there exists a Lagrange multiplier such that the regularization type formulation (3) admits the same solution . Hence, the solution path of penalized estimators and that of constrained estimators are equivalent. However, the properties of estimators with fixed values of and can be different in the following sense:
- •
From a computational perspective, calculating the constrained estimator (11) for a given is more difficult than the regularization estimator (3). For the regularization estimator (3), we can use the Modified Pool Adjacent Violators Algorithm (Modified PAVA) proposed by Tibshirani et al. (2011), which outputs the solution path for every . In particular, given , we can always obtain an exact solution . However, to the best of our knowledge, there are no practical algorithms that obtain an exact solution for the constrained problem (11) that run as fast as the algorithms for the penalized problem (3). We present detailed explanations for the algorithms in Section A.
- •
From a statistical perspective, the correspondence between tuning parameters and is not deterministic (i.e., it depends on the realization of the data ). For this reason, a risk bound that is obtained for one of (3) or (11) cannot be directly applied to the other.
We show the main results on the adaptation property to piecewise monotone vectors in terms of sharp oracle inequality.
Before proceeding, we introduce some notations. Suppose that is piecewise constant on a connected partition of . We denote by the number of pieces in which becomes constant. That is, there are integers such that (i) for and (ii) for any , there exists such that for all . We define the sign associated with each knot () as
[TABLE]
In other words, if and only if the order violation occurs at . See Figure 3 for the graphical illustration. Then, we define as
[TABLE]
determines the non-monotonicity of a piecewise constant vector . If is -piecewise monotone, then it is clear that . In particular, for any monotone vector , we have . Based on these notations, we have the following sharp oracle inequality.
Theorem 4.1**.**
For any , the constrained nearly-isotonic regression (11) satisfies the following oracle inequality:
[TABLE]
Moreover, for any , we have
[TABLE]
with probability at least .
The following risk bound for the best choice of the tuning parameter is an immediate consequence of Theorem 4.1.
Corollary 4.2**.**
Suppose . Choose that minimizes the upper bound in (4.1) (thus, depends on the true parameter ). Then, we have
[TABLE]
Also, choosing or , we have
[TABLE]
Remark 4.3**.**
We briefly comment on the proof of Theorem 4.1 and Corollary 4.2. A key ingredient is to obtain a bound on the statistical dimension (Amelunxen et al. 2014) of the tangent cone of the constraint set . This methodology was first developed for the isotonic regression and the convex regression by Bellec (2018). In particular, our approach is inspired by the analysis of the constrained trend filtering estimators by Guntuboyina et al. (2017). See Appendix D for detailed proofs.
By restricting the region over which the infimum in (4.2) is taken, we have the oracle inequality for monotone vectors
[TABLE]
which recovers the existing results on the isotonic regression (Chatterjee et al. 2015, Bellec 2018) up to a constant multiplicative factor.
To understand the general upper bound in (4.2), we have to control the quantity defined in (13). To this end, we consider the minimal length condition; we say that satisfies the minimal length condition for a constant if it satisfies
[TABLE]
where the partition and the signs () are defined as in (13). Intuitively, a signal is well approximated by another signal that satisfies the minimal length condition if has “moderate slopes” around the order-violating jumps. For further discussion on such growth conditions, see Section 4.3.
Based on the minimal length condition, we have the following result from Theorem 4.1 .
Corollary 4.4**.**
Suppose that satisfies the minimal length condition (18) for a constant . Assume that is -piecewise constant and -piecewise monotone. Then, the constrained nearly-isotonic regression (11) satisfies
[TABLE]
In particular, if the tuning parameter is chosen so that
[TABLE]
for a positive constant , we have
[TABLE]
where is a positive constant.
Remark 4.5**.**
If is -piecewise constant and -piecewise monotone, it is always true that . Hence, the inequality (4.4) can be simplified as
[TABLE]
where is a constant that depends on alone.
Remark 4.6**.**
We comment on the minimal length condition and the relation to estimation of piecewise constant vectors. We conjecture that the minimum length condition (18) is essentially unavoidable for the risk bound of the nearly-isotonic regression due to the following analogy to the fused lasso. The minimal length condition for the fused lasso is considered by Guntuboyina et al. (2017). For the fused lasso, Fan and Guan (2017) showed that the minimum length condition cannot be removed in the sense that there is a lower bound depending on the minimum length (see also the experimental result by Guntuboyina et al. (2017), Remark 2.5).
4.2 Risk bounds for penalized estimators
In this section, we consider the risk bounds for the nearly-isotonic regression (3) in the original penalized form by Tibshirani et al. (2011).
Theorem 4.7**.**
For any , let denote the nearly-isotonic regression estimator defined in (3). Let and be any vectors in . Then, there exists a tuning parameter that depends only on such that, for any , we have the following risk bound:
[TABLE]
where and are defined similarly as in Theorem 4.1. Furthermore, for any , the inequality
[TABLE]
holds with probability .
We comment on some direct consequences of Theorem 4.7. In this theorem, is defined as a function of . To understand the risk bound (4.7), we consider the choice of the tuning parameter that depends on the true parameter . Let be a vector that minimizes the quantity
[TABLE]
among all . Then, taking , we have the following oracle inequality which has the same form as (4.2):
[TABLE]
Moreover, if or , we have
[TABLE]
Again, if we assume the minimal length condition (18) on , we obtain a simplified bound of the form (17).
We move on to discuss a precise expression of in Theorem 4.7. The next proposition provides an upper bound for .
Proposition 4.8**.**
Suppose . Let be the constant partition of , and be the associated signs defined in (4.1). Then, there is a universal constant such that in Theorem 4.7 is bounded from above by
[TABLE]
The purpose of the choice of in Proposition 4.8 is to derive the theoretical convergence rate in terms of and . However, different choices are possible if we are interested in other theoretical aspects (e.g., estimation consistency for changepoints). For the fused lasso estimator (6), several authors have studied theoretical choices of tuning parameters that result in risk upper bounds (Dalalyan et al. 2017, Lin et al. 2017, Guntuboyina et al. 2017).
Remark 4.9** (Example of parameter choice).**
Here, we provide an example choice of the tuning parameter under a simple length condition. Let us assume that (i) is not globally monotone (i.e., ) and (ii) is of order , that is,
[TABLE]
holds for some . Then, we can see that is bounded from above by
[TABLE]
where is a constant that depends on . For the fused lasso, the theoretical choice has been suggested by Dalalyan et al. (2017) and Guntuboyina et al. (2017). For a detailed discussion, see Remark 2.7 by Guntuboyina et al. (2017) and references therein.
Remark 4.10**.**
In general, the choice of the tuning parameter that minimizes the risk can be different from the theoretical suggestion. More importantly, we cannot obtain the value of suggested in Proposition 4.8 because it depends on the unknown true parameter and the noise standard deviation . In practice, there are two typical data-dependent choices of :
- •
Stein’s unbiased risk estimate: If we know or its estimate value , we can reasonably choose a parameter by minimizing Stein’s unbiased risk estimate (SURE)
[TABLE]
Here, is an unbiased estimate of the degrees of freedom. See Tibshirani et al. (2011) for the derivation.
- •
Cross-validation: We can also apply the cross-validation when the model (2) is interpreted as a discrete observation of a continuous signal. Specifically, suppose that the data is generated according to the following nonparametric regression model:
[TABLE]
where are given design points in and is an unknown piecewise monotone function. We define the nearly-isotonic regression estimator over the interval as follows: First, we determine the values () by solving
[TABLE]
Then, we define by interpolation. For instance, one can output a piecewise constant function so that . In this sense, given a new design point , we can predict the value of by . Hence, we can naturally apply the cross-validation in this situation.
4.3 Application to piecewise monotone vectors
To gain a deeper understanding of the adaptation property of the nearly-isotonic regression, we study the risk bound under a more specific assumption. We define the following moderate growth condition for piecewise monotone vectors.
Definition 4.11**.**
Let . We say that a monotone vector satisfies the moderate growth condition if
[TABLE]
and
[TABLE]
Figure 4 gives an illustration of the moderate growth condition. In words, the signal satisfying the moderate growth condition is not larger than the linear signal in the left half of the domain, and not less than that in the right half of the domain. Intuitively, the role of the moderate growth condition is to guarantee the minimal length condition (18) for a piecewise constant approximation.
Suppose that the true signal is piecewise monotone and every segment satisfies the moderate growth condition. Then, the nearly-isotonic regression achieves a nearly minimax convergence rate as follows.
Corollary 4.12**.**
Suppose that the following assumptions hold:
- (a)
The partition is equi-spaced: . 2. (b)
is monotone and for each . 3. (c)
satisfies the moderate growth condition for each .
Then, the estimator (3) with optimally tuned parameter satisfies the following risk bound:
[TABLE]
The risk bound (25) achieves the minimax rate over in Proposition 3.2 up to a multiplicative factor of . We should note that the restrictive assumption (a) in Corollary 4.12 is employed merely for the sake of simplicity of the proof. We may relax this assumption as
[TABLE]
for some .
5 Model selection based estimators
Here, we consider estimators obtained by model selection among all partitions . The main purpose of this section is to discuss whether the minimax lower bound in Proposition 3.2 can be achieved without any additional assumption such as the moderate growth condition.
Given a connected partition of , we write for the set of piecewise monotone vectors on , i.e.,
[TABLE]
Let denote the projection estimator onto . By definition, is obtained by concatenating isotonic regression estimators defined in every segment.
If we know the true partition on which is piecewise monotone, then the risk of the projection estimator is bounded from above by
[TABLE]
If the true partition is unknown, a natural idea is to select a data-dependent partition by a penalized selection rule:
[TABLE]
Here, is a positive penalty for the partition .
The penalized selection rules have been well studied in statistics. In particular, Birgé and Massart (2001) and Massart (2007) developed non-asymptotic risk bounds for generic model selection settings in Gaussian sequence models. Hereafter, we construct a penalized selection estimator in the spirit of Theorem 4.18 in Massart (2007).
Instead of selecting according to (26), we introduce the total variation sieves. Namely, in addition to selecting partitions, we also select budgets of piecewise total variations as follows. Let be a connected partition. For any vector with (), we define the set of piecewise monotone vectors with bounded total variations as
[TABLE]
Then, we define as the projection estimator onto . Next, we define a countable set of vectors as
[TABLE]
where . Finally, we select a pair as the solution of the following minimization problem:
[TABLE]
With a careful choice of the penalty term , we have the following result:
Theorem 5.1**.**
There exists an absolute constant such that the following statement holds. For any pair , define the penalty so that
[TABLE]
Let be the minimizer in (27).
[TABLE]
In particular, if is piecewise monotone on , we have
[TABLE]
We emphasize that Theorem 5.1 does not require any additional assumptions on , e.g., the minimum length condition or the moderate growth condition introduced in the previous section. Therefore, it suggests the existence of a penalized model selection estimator that achieves the minimax rate in Proposition 3.2. However, the estimator (27) is not practical for a computational reason because it is obtained through the minimization over exponentially many possible partitions .
The dependence on the total variation of each segment in (5.1) is instead of . The additional constant is due to the minimal resolution of the sieve. To establish a non-asymptotic risk bound for the penalized model selection estimator without sieves (i.e., (26)) and remove the dependence on the sieve resolution remains an open problem.
6 Simulations
We provide some numerical examples for piecewise monotone regression problems.
6.1 Dealing with inconsistency at boundaries
Before presenting the simulation results, we here explain a well-known practical issue in the isotonic regression literature and a regularization method to cope with it.
In the study of statistical estimation under monotonicity constraints, it is known that the least squares estimator is inconsistent at the boundary points (see e.g., Groeneboom and Jongbloed (2014) and Woodroofe and Sun (1993)). A similar issue arises for the nearly-isotonic regression estimators. Since the penalty term in (3) does not activate if the orders are not violated at the boundary points (i.e., or ), the nearly-isotonic regression is not robust against a negative noise at the left boundary or a positive noise at the right boundary. To overcome this issue, we consider the following boundary correction regularization for the nearly-isotonic regression:
[TABLE]
where is an additional tuning parameter. It can easily be checked that the solution is equivalent to that of the ordinary nearly-isotonic regression (3) applied to . Similar regularization methods for isotonic regression have been studied by Chen et al. (2015), Wu et al. (2015) and Luss and Rosset (2017).
6.2 Simulation data
Here, we evaluate the performance of the nearly-isotonic regression and related estimators on simulated data. According to the one-dimensional regression model (23), we generated data with equi-spaced design points (). For the true function , we consider -piecewise monotone functions defined as
[TABLE]
where is a given monotone function and for . Following Meyer and Woodroofe (2000), we choose from the following two monotone functions:
[TABLE]
Figure 2 shows an example of and . It is worth noting that the former sigmoidal function satisfies the moderate growth condition (see Definition 4.11), whereas the latter cubic function does not. Hence, for the case of piecewise sigmoidal functions , the minimax rate of is achieved by both the nearly-isotonic regression and the fused lasso (see Corollary 4.12 above and Corollary 2.8 by Guntuboyina et al. (2017)).
In our experiments, the size of the signal is chosen from . The noise standard deviation is assumed to be known and fixed to . We evaluated the MSE for the following four estimators:
- •
Neariso: The nearly-isotonic regression (3).
- •
NearisoBC: The nearly-isotonic regression with boundary correction (29)
- •
Fused: The fused lasso (6).
- •
PO: The projection estimator with the partition oracle, i.e., the projection estimator onto provided with the true partition .
For Neariso and Fused, the tuning parameter is selected by generalized criteria (i.e., minimizing SURE (22)). For NearisoBC, the tuning parameters are selected by a similar criterion. To estimate the MSE, we generated 500 replications of the data and calculated the average value of the squared loss .
Figure 5 presents the results for and . The upper line shows log-log plots of the MSE versus . In each setting, the three regularization based estimators (i.e., Neariso NearisoBC and Fused) performed as well as the ideal estimator PO, whereas the former three estimators do not use the information about the true partition. The risks of PO are well fitted by lines of slopes of , which means that the speed of the convergence is about the minimax optimal rate of .
Next, we provide more detailed comparisons of regularization based estimators. The lower line in Figure 5 shows the difference of MSEs from that of PO. For piecewise sigmoidal functions, NearisoBC and Fused performed better than Neariso. Notably, in the case of , the risks of Fused were even better than PO for large values of . A possible reason for the better performance of the fused lasso is that the sigmoidal function can be well approximated by a piecewise constant function near the boundaries. On the other hand, for piecewise cubic functions, Neariso performed slightly better than the other two estimators for small values of .
6.3 Geological data
We conducted experiments on GPS data related to a seismological phenomenon reported by Roggers and Dragert (2003). The aim here is to investigate the performance of the nearly-isotonic type estimators on real-world data in which piecewise monotone approximations have already been justified in the previous work. For the signal , we used the difference of the east-west components of GPS measurements between two observatories, which are located in Victoria (British Columbia, Canada) and Seattle (United States). The GPS data is provided by Melbourne et al. (2018). The top panel in Figure 6 shows the plot. The data period starts on January 1, 2010, and ends on December 2, 2017. After removing missing records, the size of the signal is . The increasing trend of the signal is considered to be caused by the subduction process at the plate boundary. We can also see periodic reversals in the signal, and the entire signal may be approximated by a piecewise monotone signal. Such reversals may be related to the seismological phenomenon so-called the episodic tremor and slip. According to Roggers and Dragert (2003), such slip events were observed in every 13 to 16 months in their data taken from 1997 to 2003.
GPS data contains several anomalous values. For the signal considered above, most of the values are between 20 and 50, except for a single outlier . The behaviors of the estimators are extremely affected by the existence of such outliers. In our situation, we can manually remove the anomalous value (denoted by ). However, it is often difficult to distinguish outliers in practical situations. From this perspective, we also considered the robust -estimation version of the nearly-isotonic regression defined as (34) with . Here, is the Huber loss:
[TABLE]
which is commonly used in the robust regression literature.
We applied the nearly-isotonic regression (3) and its robust variant to the signals and in the above. The tuning parameters were determined by the -fold cross-validation, and in the Huber loss was fixed as .
First, we consider the case where the outlier is removed manually. The second panel in Figure 6 shows the result for the cross-validated nearly-isotonic regression. The vertical lines denote the locations of downward jumps in the estimators. We can see that the period of jump clusters is about 12 to 14 months, which is close to that of the seismological slip events suggested by Roggers and Dragert (2003).
Next, we consider the case where the signal contains an outlier. In this case, the value of the squared loss largely depends on the error at the coordinate of the outlier. Then, the cross-validation may choose a large tuning parameter, and the resulting estimator becomes close to a monotone signal. The third panel in Figure 6 shows that the number of downward jumps is considerably less than the number that is expected from the known frequency of the slip events. Conversely, the fourth panel in Figure 6 shows that the robust version of the nearly-isotonic regression outputs similar clusters of change points as in the second panel.
7 Discussion
In this paper, we studied the problem of estimating piecewise monotone signals. The classical isotonic regression estimator cannot be applied in this setting because of the existence of arbitrarily large downward jumps. We derived the minimax risk lower bound over piecewise monotone signals with bounded upper total variations. The minimax rate is tight up to multiplicative constant because it can be achieved by a (computationally inefficient) model selection based estimator. Our main results show that the nearly-isotonic regression estimator achieves this rate under an additional growth condition. An advantage of the nearly-isotonic regression is that the estimator can be calculated efficiently on arbitrary directed graphs by parametric max-flow algorithms. The simulation results demonstrate that the nearly-isotonic regression has an almost similar convergence rate as the ideal estimator that knows the true partition.
7.1 Non-Gaussian noises
In this paper, we provided risk bound for the nearly-isotonic regression under the assumption that the noise distribution is Gaussian. However, in practice, this assumption is too restrictive. We here briefly discuss the risk bound with non-Gaussian error distributions.
Suppose that are i.i.d. random variables with and . Then, we can see that the “expectation bound” (4.7) holds with a different constant . See Remark D.14 in the appendix for the key ingredients for the derivation. On the other hand, the “high-probability bound” (4.7) does not hold in general since it requires a more strong concentration property (i.e., the Gaussian concentration).
7.2 Future directions
An interesting direction for future work is to investigate the optimal rate of piecewise monotone regression on higher dimensional grids or general graphs. Recently, several researchers have analyzed the risk bounds for the isotonic regression estimators on two or more higher dimensional grid graphs (Chatteejee et al. 2018, Han et al. 2017). It is natural to ask whether one can construct a computationally efficient estimator that is adaptive to piecewise monotone vectors on a given graph. We believe that the nearly-isotonic type estimator (32) is a candidate. A major difficulty is to determine an appropriate graph topology. Given a partial order on a set , the corresponding isotonic regression estimator is uniquely determined. However, there are many directed acyclic graphs that correspond to partial order . Hence, the graph topology for the nearly-isotonic type estimators is not unique. To control the connectivity, it may be useful to introduce edge weightings proposed by Fan and Guan (2017).
Another direction is to develop a model selection method for least squares estimators over unbounded cones. We introduced sieves on the total variation in Section 5 to construct an estimator that is adaptive to piecewise monotone vectors. In practice, sieve-based methods can be computationally inefficient. Conversely, if the true vector is monotone, the isotonic regression automatically achieves the minimax rate with respect to the total variation. We conjecture that it is also possible to select the least squares estimator without using sieves. In particular, we leave it as an open question whether the adaptive risk bound is achieved by the penalized selection rule of the form (26).
Appendix A Algorithms for nearly-isotonic estimators
In this section, we present algorithms for the nearly-isotonic regression and related estimators and discuss their computational complexities. Note that the main purpose of this section is to give a review of existing algorithms, and hence most results presented in this section are not new (except for Proposition A.1).
A.1 Penalized estimators
Here, we introduce two algorithms to solve the penalized form nearly-isotonic regression (3). In Section A.1.1, we introduce the solution path algorithm developed by Tibshirani et al. (2011). The advantage of the solution path algorithm is that it outputs the solutions for every simultaneously. However, the solution path algorithm cannot be applied to the estimators with general weights and graphs. In Section A.1.2, we provide another algorithm that outputs the exact solution for a single . The latter algorithm can be applied to the nearly-isotonic type estimators defined on any weighted directed graphs.
A.1.1 One-dimensional problem
The modified pool adjacent violators algorithm (modified PAVA, Tibshirani et al. (2011)) is the algorithm used to calculate the solution path for the problem (3). Here, we present a variant of the modified PAVA for the following weighted version of the estimator:
[TABLE]
where () are positive weight parameters. Letting , this formulation covers the nearly-isotonic regression for general increasing design points (24).
The derivation of Algorithm 1 is straightforward from the original paper of Tibshirani et al. (2011). We should note that the validity of this algorithm crucially depends on the property that the solution path is piecewise linear and “agglomerative”. It is well known that the piecewise linearity of the solution path holds for many classes of regularization estimators (Rosset and Zhu 2007). We say that the solution path is agglomerative if it satisfies the following condition: if holds for some , then the same equality holds for any . For the constant weights (), such agglomerative property was proved by Tibshirani et al. (2011). However, for general non-unitary edge weights (), this need not be true. Here, we provide the following proposition to ensure the agglomerative property for non-unitary edge weights.
Proposition A.1**.**
The solution path of weighted nearly-isotonic regression (30) is piecewise linear and agglomerative if the edge weights satisfy the following concavity condition.
[TABLE]
where we defined . In particular, this condition implies that Algorithm 1 outputs the exact solution path.
The condition (31) demands that can be written as for some concave function with and for all . In particular, for any , we have
[TABLE]
and
[TABLE]
Proof sketch of Proposition A.1.
We can prove the validity of Algorithm 1 by a similar argument as Tibshirani et al. (2011) if we assume the piecewise linearity and the agglomerative property. The piecewise linearity is already shown in Rosset and Zhu (2007). Hence, it remains to prove the agglomerative property under the condition (31). To this end, we leverage the “agglomerative clustering condition” defined in Appendix D.6. In particular, we defer the details to Remark D.25 as well as Remark D.27. ∎
A.1.2 General graphs
Let be a directed graph with . Suppose that each edge is equipped with a positive weight . We define the generalized nearly-isotonic regression as
[TABLE]
where is a nearly-isotonic type penalty defined as
[TABLE]
For any choices of and , becomes a convex function. Clearly, the lower total variation is a special case where and . Thus, (32) can be regarded as a generalization of the nearly-isotonic regression to general directed graphs.
The problem of the form (32) has been well studied in the optimization literature. In particular, we can see that solving (32) is equivalent to solving a certain parametrized family of minimum-cut problems. For detailed explanations of such an equivalence, see Obozinski and Bach (2016) and Chapter 8 in Bach (2013). Hence, (32) can be solved by the parametric max-flow algorithm (Gallo et al. 1989) that runs in . Conversely, it has been pointed out by Mairal et al. (2011) that, for many practical instances, some simplified variants of the parametric max-flow algorithm output the solution faster than the original algorithm by Gallo et al. (1989). We remark that Hochbaum and Queyranne (2003) also developed the relationship between the isotonic regression and the parametric max-flow algorithm.
Algorithm 2 shows the Divide-and-Conquer algorithm (Chapter 9 of Bach (2013)) that solves (32). In the inner loop, the algorithm recursively solves max-flow problems by defining smaller networks (Algorithm 3). See Figure 7 for examples of networks used in the first two recursions in the algorithm.
A.1.3 General convex loss functions
In practice, we are often interested in general convex loss functions other than the squared loss. Here, we consider a generalized problem of the following form:
[TABLE]
where is a convex loss function for any . As an example, this formulation contains the -estimator in the regression setting , where () are the observed data and is a convex function.
We can also obtain algorithms that output approximate minimizers of (34) as follows. First of all, note that Algorithm 2 outputs the proximal operator of the regularization term . Once we have an oracle for the proximal operator, we can apply proximal gradient methods to solve (34). In particular, if is convex and smooth, the Fast Iterative Shrinkage Thresholding Algorithm (FISTA, Beck and Teboulle (2009)) outputs an -optimal solution after evaluations of the proximal operator.
A.2 Constrained estimators
Consider the following generalized version of the constrained form of nearly-isotonic regression (11):
[TABLE]
Unlike the penalized estimators, it is difficult to find an exact solution of (35). However, since problem (35) is an instance of a quadratic programming problem, there are polynomial time algorithms to obtain approximate solutions. Here, we explain the existence of such algorithms. The following result is a direct application of Theorem 1 by Lee et al. (2018), which provides a convergence guarantee of a variant of cutting plane methods.
Proposition A.2**.**
Suppose that is a directed graph equipped with positive weights for every . Let be any vector and . Then, for any , there exists a randomized algorithm that outputs satisfying
[TABLE]
and
[TABLE]
with a probability of . The overall complexity of the algorithm is .
Remark A.3**.**
In practice, due to computational considerations, we recommend to use the penalized estimator (33) instead of the constrained estimator (35). For the penalized estimator, we empirically observed that Algorithm 2 runs sufficiently fast graphs with several hundreds of nodes. For the constrained estimator, Proposition A.2 theoretically guarantees polynomial time solvability of the constrained problem (35), whereas it does not provide a practical algorithm.
Appendix B Supplemental experiments
To understand the behavior of the nearly-isotonic regression in more generic settings, we present additional simulation results for the nearly-isotonic regression on general graphs (32). Here, we consider the problem of estimating piecewise monotone signals on two-dimensional grids.
We say that an matrix is monotone if whenever and . In other words, is monotone if it has no order-violating edges in the two-dimensional grid graph , where is the set of all subscripts and
[TABLE]
We say that is piecewise monotone if there is a partition of such that, for each , is a weakly connected component of and has no order-violating edges in the induced subgraph. For simplicity of experimental settings, we here only consider “block” type partitions, i.e., we say that is of block type if it can be represented as a product of two partitions of the two coordinates. The left panel in Figure 8 is an example of two-dimensional piecewise monotone signals on a block type partition.
We compare the following three estimators:
- •
LSE: The bivariate isotonic regression (see e.g., Robertson et al. (1988)).
- •
Neariso2: The two-dimensional nearly-isotonic regression with -tuned parameter.
- •
PO: The bivariate isotonic regression applied to the true partition.
For monotone matrices, Chatteejee et al. (2018) proved that LSE is minimax rate optimal with respect to . Hence, the partition oracle estimator PO can be regarded as an ideal benchmark that is minimax optimal over piecewise monotone matrices. On the other hand, if the true matrix is piecewise monotone, the risk of LSE can be arbitrarily large for the same reason as Proposition 3.3. Neariso2 is the special case of the generalized nearly-isotonic regression (32) applied to the graph defined above. Neariso2 was originally discussed in Tibshirani et al. (2011), but no experimental results have been presented. Figure 8 shows examples of the solutions of the three estimators.
We construct an matrix as follows: We define a small monotone matrix , and then we define as an block matrix by repeating for times both in rows and columns (thus ). We choose the small matrix from
[TABLE]
or
[TABLE]
where we write for . With the former choice, becomes an -piecewise monotone matrix. With the latter choice, becomes an -piecewise monotone matrix such that does not depend on .
We generated noisy observations by adding independent Gaussian noises to every entries of . To estimate the MSE, we used 500 replications of the data. Figure 9 shows the results. Clearly, the risks of LSE (blue triangles) are much larger than those of the other two estimators. Neariso2 (green circles) has slightly larger risks compared to PO (magenta squares), while their slopes seem to be close.
To visualize convergence rates, we fit the risks of PO by monomials (), and plotted as dashed lines in Figure 9. The values of the exponent are respectively as follows: (cubic2d, ); (cubic2d, ); (cubic1d, ); (cubic2d, ). We should note that, in monotone matrix estimation, the theoretical convergence rate of LSE is known to be (Chatteejee et al. 2018).
Appendix C Proofs in Section 3
C.1 Proof of Proposition 3.2
Let be either or , which are defined in Definition 3.1. The minimax lower bound (10) is proved by combining the following two lower bounds:
- (i)
(Lower bound for monotone vectors (Zhang 2002, Chatterjee et al. 2015)) Let be the set of monotone vectors with bounded total variations. There is a universal constant such that for any estimator ,
[TABLE] 2. (ii)
(Lower bound for piecewise constant vectors) Let be the set of -piecewise constant vectors in , i.e., if . The minimax lower bound over can be related to sparse estimation as follows. Let be an matrix whose entries are given as . Then, contains the set , and the lower bound for the minimax risk over follows from the well-known results for balls (e.g., Raskutti et al. (2011), Theorem 3-(b)). In particular, for any , the following lower bound is presented in Gao et al. (2017):
[TABLE]
where is a universal constant.
It remains to show that contains and . is obvious because an -piecewise constant vector is also an -piecewise monotone vector such that the piecewise total variations are zero. From the definition, it is also clear that . If , the jumps that strictly exceeds cannot occur more than times. Hence, we can choose a partition with so that each does not contain such large jumps, which implies that .
C.2 Proof of Proposition 3.3
The following theorem in the seminal paper of Chatterjee (2014) provides useful upper and lower bounds for the risk of the least square estimator over any closed convex set .
Theorem C.1** (Chatterjee (2014), Corollary 1.2).**
Let be any closed convex set, and let denote the least squares estimator over . For any , define the function as
[TABLE]
Here, if the set is empty, we define . Then, is strictly concave for and has a unique maximizer . Moreover, there are universal constants such that
[TABLE]
To prove Proposition 3.3, we use the lower bound in (36). Note that for a sufficiently large , is a strictly increasing in . For any and , choose so that . Then, for any such that , we have
[TABLE]
Remark C.2**.**
We should note that the above proof is valid for any closed convex set . For the specific choice of , the lower bound of used in the proof can be quite conservative. In practice, the risk of the isotonic regression estimator can be larger than under a smaller value of -misspecification error.
Appendix D Proofs in Section 4
D.1 Preliminaries
To state the results for risk upper bounds, we first introduce some quantities related to Gaussian processes.
Definition D.1**.**
Let be a closed convex set in . Let denote the expectation with respect to an isotropic Gaussian random variable .
- (i)
The Gaussian width of is defined as
[TABLE] 2. (ii)
The Gaussian mean squared distance is defined as
[TABLE]
where . 3. (iii)
Suppose that is a convex cone. The statistical dimension of is defined as
[TABLE]
We present some historical remarks on these definitions. The three quantities in Definition D.1 can be interpreted as complexity measures for the subset in the Euclidean space. The Gaussian width has been well studied in convex geometry, signal processing, high-dimensional statistics, and empirical process theory; See e.g., Section 7.8 in Vershynin (2018) for a literature review. The definition of the Gaussian mean squared distance is due to Oymak and Hassibi (2016). As we will see in Lemma D.4 below, the Gaussian mean squared distance is useful to provide the risk bounds for proximal denoising estimators. The statistical dimension was defined in Amelunxen et al. (2014). Recently, Bellec (2018) pointed out that the statistical dimension characterizes the adaptive risk bounds for some shape restricted estimators including the isotonic regression and the convex regression.
As suggested by the definitions, these three quantities are closely related to each other. In particular, if is a convex cone, these are comparable as follows.
Proposition D.2**.**
Let be a closed convex cone.
- (i)
(Amelunxen et al. (2014), Proposition 10.2) Let be the unit sphere in . Then, we have . 2. (ii)
(Amelunxen et al. (2014), Proposition 3.1) Let be the polar cone of defined as
[TABLE]
Then, we have .
Now, we introduce two general results for risk bounds for general projection estimators and proximal denoising estimators.
Let be a closed convex set in , and define the projection estimator onto as . Bellec (2018) proved the following oracle inequality that relates the risk of the projection estimator to the statistical dimension of the tangent cone of . Here, the tangent cone of at is defined as
[TABLE]
Lemma D.3** (Bellec (2018), Corollary 2.2).**
Let be any vector, and suppose that the observation is drawn according to . Then, we have the following risk bound:
[TABLE]
Moreover, for any , the inequality
[TABLE]
holds with probability at least .
Next, we provide a general result for proximal denoising estimators. Let be a convex function, and . We define the proximal denoising estimator as
[TABLE]
The class of proximal denoising estimators contains the soft-thresholding estimator (Donoho et al. 1992), the total variation regularization (Rudin et al. 1992), the trend filtering (Kim et al. 2009) and the nearly-isotonic regression (Tibshirani et al. 2011). Oymak and Hassibi (2016) pointed out that the risk bound of proximal denoising estimators can be characterized by the Gaussian mean squared distance of the set . Remarkably, based on this technique, Guntuboyina et al. (2017) proved sharp adaptation results for the trend filtering estimators. The following oracle inequality can be regarded as a generalization of Theorem 2.2 in Oymak and Hassibi (2016). For the sake of completeness, we also provide its proof below.
Lemma D.4**.**
Let be any vector, and suppose that the observation is drawn according to . Let be a convex function, and let denote the proximal denoising estimator defined as (37). Then, we have
[TABLE]
Moreover, for any , the inequality
[TABLE]
holds with probability at least .
Proof.
Below, we write . To prove (38), it suffices to show that we have almost surely
[TABLE]
for any fixed vector . We will assume because otherwise the inequality is trivial.
From the first order optimality condition of the convex minimization problem (37), we have
[TABLE]
See Lemma 6.1 in van de Geer (2015) for a formal proof. Using the elementary fact that and substituting , we have
[TABLE]
Now, take arbitrarily. From the definition of the subgradient, we have
[TABLE]
Hence, the right-hand side of (40) is bounded from above by
[TABLE]
Since the choice of is arbitrary, we have
[TABLE]
By taking the expectation of both sides, (38) is proved.
To prove the high-probability bound (39), we use the well-known Gaussian concentration inequality (see e.g., Theorem 5.6 in Boucheron et al. (2013)); for any -Lipschitz function and , we have
[TABLE]
In fact, the map is a -Lipschitz function because, for any , we have
[TABLE]
where is the orthogonal projection map onto the set . Now, we take as
[TABLE]
Combining (41) and the Gaussian concentration applied for , we have the desired result. ∎
D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)
In this subsection, we provide the proof of Theorem 4.1 as an application of Lemma D.3. To this end, we have to evaluate the statistical dimension of the tangent cone of a convex set
[TABLE]
It is not surprising that the analysis of the tangent cone of goes very similar to that of the set with bounded total variation in Guntuboyina et al. (2017). Our goal is to show the following upper bound for the statistical dimension:
Proposition D.5**.**
Suppose that is a vector with . Then, there exists a universal constant such that
[TABLE]
where is defined in (13).
We briefly outline the proof for this result. We divide the proof into four steps: First, we provide some useful characterizations of the tangent cone. Second, we decompose the tangent cone into finitely many pieces so that the Gaussian widths become easy to evaluate. Third, we provide the concrete upper bounds the Gaussian widths of these pieces. Lastly, we combine the upper bounds and apply Lemma D.3 to complete the proof.
**Step 1: Characterizing the tangent cone ** If , is contained in the interior of , and the tangent cone becomes the entire Euclidean space . Hereafter, we assume that lies on the boundary of , that is, . Let us recall the definition of the sign of jumps in (4.1). Roughly speaking, the tangent cone of is characterized by the sign of jumps.
Lemma D.6**.**
Let be a vector in such that . Let be any connected refinement 111 Here, we say that is a connected refinement of another connected partition if, for any , there exists a unique element such that .
of the constant partition of . Let be a sequence such that for any . We define the signs as
[TABLE]
For any and taken as above, we define a convex cone as
[TABLE]
where is the lower total variation for the restricted vector . Then, for the tangent cone , we have the followings:
- (i)
If , then . 2. (ii)
If is a connected refinement of and is taken arbitrarily as above, then .
Proof.
First, we show that . By the definition of the tangent cone, it suffices to show that holds for any . Note that is constant on every since is finer than the constant partition of . Since the lower total variation is not changed by adding any constant value to each coordinates, we have . Then, we have
[TABLE]
which proves and hence (ii).
Next, we prove that under the assumption . In this case, the definition of coincides that in (4.1). Fix any . We want to show that is obtained as for some and . To this end, we check that there exists a (sufficiently small) such that . Here, we have
[TABLE]
Recall that are chosen so that . We can choose sufficiently small so that
[TABLE]
for every . Indeed, if we choose so that
[TABLE]
the signs of do not change by adding . Consequently, we have
[TABLE]
This proves that and hence (i). ∎
From Proposition D.2-(i), we can bound the statistical dimension by the Gaussian width as follows:
[TABLE]
Here, is the unit ball in . Hence, it suffices to consider the set . In analogy to Lemma B.2 in Guntuboyina et al. (2017), we obtain the following characterization of this set.
Lemma D.7**.**
Let be a vector in such that . Let be any connected refinement of . Define the signs as in Lemma D.6, and let . Then, for every with , there exists indices such that
[TABLE]
where we define as
[TABLE]
Proof.
Fix . By Lemma D.6, we have
[TABLE]
Let be indices which will be specified later. Defining as in (45), we can rewrite (46) as
[TABLE]
Now, let denote the norm of for . By the assumption, . Then, for any , there exists such that . For these choices of , the right-hand side of (D.2) is bounded from above by
[TABLE]
which proves the desired result. ∎
Remark D.8**.**
Note that is always non-negative. This is checked as follows: First, the lower total variation is always larger than the difference of boundary points, that is, for every , we have
[TABLE]
where is taken arbitrarily from . The equality holds if and only if is monotone non-increasing. Then, for any and , we have
[TABLE]
In particular, we obtain . If is monotone non-decreasing (i.e., ), then the right-hand side of (44) equals to [math], and so .
**Step 2: Quantizing the tangent cone ** Now, let be a connected refinement of . Lemma D.7 implies that is contained in the set such that and for some and . From this perspective, we consider finitely many allocation patterns of the budgets for and . To be more precise, we construct a cover of the tangent cone in the following way. Consider a triple such that:
- (a)
and are vectors consisting of non-negative numbers, and 2. (b)
is a set of indices such that for .
For such triple, we define a set
[TABLE]
where is taken as the right-hand side of (44):
[TABLE]
Then, quantizing the allocation vectors and , we can cover the set with finitely many s as the following lemma.
Lemma D.9**.**
Suppose that is a connected refinement of . Define the signs as in Lemma D.7. Let be a set of allocation vectors satisfying the following condition; there exists an integer vector such that () and , and the allocation vector can be written as
[TABLE]
Let be a set of indices such that for all . Given and , we define a set as (48). Then, we have
[TABLE]
Proof.
Fix any vector in . Since , there exists an integer such that
[TABLE]
Summing over , we have
[TABLE]
which implies .
Next, by Lemma D.7, there exist such that Hence, for any , there exists an integer such that
[TABLE]
Suppose . Summing over , we have and thus . For the case of , it is clear that . ∎
We should note that the cardinalities of and are respectively bounded as follows:
Proposition D.10**.**
Let and are the sets defined in Lemma D.9. Then, we have:
- (i)
, and 2. (ii)
.
Proof.
For the first part, we observe that is not larger than the cardinality of
[TABLE]
Then, we have
[TABLE]
The proof of the inequality (a) in the above can be found in Proposition 4.3 of Dudley (2014).
The second part is obtained by Jensen’s inequality as
[TABLE]
∎
**Step 3: Controlling Gaussian widths ** As mentioned before, our goal is to obtain an upper bound of the Gaussian width
[TABLE]
where we convene that . Let is a pair of a partition and a sign vector of knots defined as in Lemma D.7. Using the decomposition in Lemma D.9, we have
[TABLE]
Besides, leveraging a general result for Gaussian suprema (see Lemma F.4 below), we have
[TABLE]
Here, we used Proposition D.10 to bound the cardinality of the set . More precisely, we used the following evaluation:
[TABLE]
Given and , we define
[TABLE]
Dividing the supremum into pieces , this quantity is bounded from above as , where
[TABLE]
Here, we write .
We now consider the quantity (53). In the set over which the supremum taken, the lower total variation of is bounded from above as
[TABLE]
As mentioned in Remark D.8, the reverse inequality
[TABLE]
is always true, and the equality can hold only if two sub-vectors and are either monotone increasing or non-increasing. From this point of view, we may consider that the meaning of the condition (54) is that is approximated by two nearly monotone pieces. This suggests that the complexity of can be evaluated by that of the class of monotone functions.
Below, we provide the upper bound of the Gaussian width of the form (53). First, the following lemma treats a special case where is taken as the rightmost point in .
Lemma D.11**.**
For every , , and , we have
[TABLE]
Proof.
The proof is divided into two cases where and .
Case 1 (): By scaling properly, we need only consider the case where . For a vector , we define a monotone vector as
[TABLE]
We also define another monotone vector as
[TABLE]
It is easy to check that . Using these notations, we have
[TABLE]
Hence, the condition is equivalent to , which leads to
[TABLE]
and
[TABLE]
Denote by the left-hand side in (D.11) with . The argument in the previous paragraph implies that
[TABLE]
The expectation in the last line is bounded as
[TABLE]
Here, the first inequality is the Jensen’s inequality, and the second inequality is a consequence of equation (D.12) in Amelunxen et al. (2014). Combining with (56), we have the desired result.
Case 2 (): We can assume w.l.o.g. . As in Case 1, and we write a vector as a difference of monotone vectors. For , we define and as
[TABLE]
and
[TABLE]
respectively. Under this notation, the condition is equivalent to , and therefore we have
[TABLE]
Then, a similar argument as Case 1 yields the result. ∎
Next, the following lemma provides an upper bound of for general choices of .
Lemma D.12**.**
Fix , , and . For every , the quantity
[TABLE]
is bounded from above as
[TABLE]
In particular, we deduce a simpler bound
[TABLE]
Proof.
Let be a pair of sub-vectors of defined as and . If either or (i.e., one of and becomes a singleton), the result is a direct consequence of Lemma D.11.
Henceforth, we assume that . Suppose that satisfies the assumption . Since , we have
[TABLE]
Similarly, we have
[TABLE]
Based on these observations, we reduce to
[TABLE]
in which both terms in the right-hand side can be bounded using Lemma D.11. ∎
Before going to the next step, we summarize the results in Step 3 as follows.
Proposition D.13**.**
Fix . Let be any connected refinement of , and be the signs associated with as in Lemma D.7. Define as (49). Then, the quantity defined in (53) is bounded from above by
[TABLE]
Proof.
This is a direct consequence of (52) and (58). ∎
**Step 4: Applying Lemma D.3 ** We now are ready to complete the proof of Theorem 4.1.
Recall that our goal is to obtain an upper bound for which is defined in (53). To this end, we will construct a suitable refinement of with moderate piece lengths so that we can control the first term in (59). In fact, from an argument parallel to that in Guntuboyina et al. (2017), there exists a refinement such that
[TABLE]
and . We also define the signs in a similar way as Lemma D.6, but if the knot is not contained in the original partition , the corresponding sign will be specified later.
We can bound the first term in (59) as the following two steps. First, from the Cauchy–Schwarz inequality and the fact that , we have
[TABLE]
Second, by the above construction of , we have
[TABLE]
Therefore, the right-hand side in (59) can be bounded from above by
[TABLE]
Here, to hide the constant term , we have also used the fact that for every integer .
Let be the signs associated with the constant partition (recall the definition (4.1)). Then, we can choose the values of so that the following inequality holds:
[TABLE]
In fact, this is possible if we choose as the sign for the nearest knot that is to the right of . Combining (D.2), (60) and Proposition D.2, the statistical dimension of is bounded from above as
[TABLE]
where we also used the elementary fact that . Consequently, applying Lemma D.3, we have desired result.
Remark D.14** (Non-Gaussian noises).**
For non-Gaussian noise setting, we could prove an analogous result to Proposition D.5. We comment on a sketch of the proof for such a generalization.
The proof of Proposition D.5 consists of (i) a decomposition argument for the tangent cone and (ii) bounds for some probabilistic quantities (i.e., the statistical dimension and the Gaussian width). The former argument is completely deterministic and independent from the distributional assumption on the noise variables. Regarding the probabilistic bounds, we used the following bound for (Gaussian) statistical dimension of :
[TABLE]
Hence, if we can obtain a similar bound for non-Gaussian random variables, we can prove a analogous result to Proposition D.5.
Let be i.i.d. random variables with and . For a convex cone , we define the statistical dimension as
[TABLE]
Here, we write , and the last equality holds from a deterministic relation
[TABLE]
(See Amelunxen et al. (2014) for details). Then, from Theorem 3.1 in Chatterjee et al. (2015), we can check that
[TABLE]
Therefore, by following a similar argument as the proof of Proposition D.5, we conclude that
[TABLE]
for some universal constant . As a consequence, we can prove the expected risk bound similar to (4.7) for non-Gaussian noise variables.
D.3 Proof of Corollary 4.4
Let be a number to be specified later. Define a vector as and
[TABLE]
Then, we have . Moreover, the constant partition and the sign of (defined in (4.1)) are the same as those of , and therefore and .
Now, we set so that . Applying the upper bound (4.1), we have
[TABLE]
The first term in the right-hand side is bounded from above as
[TABLE]
From the minimal length condition (18) and the definition of , we also have
[TABLE]
Combining the above inequalities, we have the desired result.
D.4 Risk bounds for penalized estimators (Proof of Theorem 4.7)
We prove Theorem 4.7 as an application of Lemma D.4. Let denote the set of subgradients (i.e., subdifferential) of the convex function at . The task is to provide a suitable upper bound for the Gaussian mean squared distance of the set . To do this, we use the technique developed in Guntuboyina et al. (2017). The idea is stated roughly as follows: Recall that the Gaussian mean squared distance of a convex cone can be written as the statistical dimension of the polar cone (Proposition D.2-(ii)). This motivates us to relate the Gaussian mean squared distance to that of an associated cone. In particular, we consider the conic hull of the subdifferential:
[TABLE]
As we explain later, can be evaluated by the results in the previous subsection. Then, we can complete the proof if we have an upper bound of the following form:
[TABLE]
where is a residual term that depends on and .
First, we show that has exactly the same value as the statistical dimension of the tangent cone of , which we have already provided a bound in the previous part in this paper.
Proposition D.15**.**
For any , the following equality holds:
[TABLE]
In particular, we have the following upper bound:
[TABLE]
where is the same universal constant as in Proposition D.5.
Proof.
Let us write . In the light of Proposition D.2-(ii), it suffices to show that is the polar cone of . However, from fundamental results in convex geometry, we always have
[TABLE]
for any convex function (see Lemma A.5 and Lemma A.5 in Guntuboyina et al. (2017)). For the case where , the set above is
[TABLE]
which implies the desired result. ∎
Next, we provide an inequality of the form (62). Since holds for every , the definition of the Gaussian mean squared distance (Definition D.1-(ii)) suggests that . However, we need a reverse inequality (62). To this end, we use the following result proved by Guntuboyina et al. (2017).
Lemma D.16** (Guntuboyina et al. (2017), Proposition B.5).**
Let be a convex function, and . Define a vector as
[TABLE]
where is the affine hull of the set . Suppose that . For any , define as
[TABLE]
Then, is well-defined, and has a finite expectation .
Further, define as
[TABLE]
Then, for every and , we have
[TABLE]
Before proceeding, we introduce an additional terminology: A convex function is said to be weakly decomposable if we have
[TABLE]
for every . In other words, we can choose in (64) if is weakly decomposable. Under the assumption that is weakly decomposable, the inequality (64) can be simplified as follows:
Corollary D.17**.**
Suppose that is convex and weakly decomposable. Under the same notation as in Lemma D.16, we have
[TABLE]
Now, we apply Lemma D.16 to the case . The following proposition provides the structural information of that we need for evaluating the upper bound (64). The proof is postponed to Appendix D.6.
Proposition D.18**.**
- (i)
is weakly decomposable. 2. (ii)
For any , let us define as (63). Then, we have
[TABLE]
From Proposition D.18 and Corollary D.17, is bounded from above by
[TABLE]
provided that . Here, is a universal constant. Combining this bound with Lemma D.4, we proved the desired risk bound.
Lastly, we provide an upper bound for the optimal tuning parameter . This is obtained from the following estimate of .
Proposition D.19**.**
Suppose that and . For any , define as
[TABLE]
Then, we have
[TABLE]
where is the expectation with respect to .
Proof.
Let be the conic hull of , and let denote the orthogonal projection map onto . By the definition of , there exists a vector such that .
First, we show a partial result
[TABLE]
As we will see in Appendix D.6, is the support function for a certain convex set. Then, by the fundamental fact for the support function that for all (see Corollary 8.25 in Rockafeller and Wets (1998)), we have
[TABLE]
Here, in the last line, is the polar cone of (see Proposition D.15), and we used the Moreau decomposition . Taking the expectation of both sides with respect to , we have
[TABLE]
which implies the desired result. Here, we used the equality between the statistical dimension and the expected squared norm of projection: (see Proposition 3.1 in Amelunxen et al. (2014)).
To prove the other inequality, we use the characterization of given in (72) in Appendix D.6 below. In particular, if we take as in (75), we have
[TABLE]
and
[TABLE]
and hence the result follows. ∎
D.5 Proof of Corollary 4.12
First, we explain that a monotone vector satisfying the moderate growth condition is approximated by a piecewise-constant vector such that the segments at both ends have sufficient lengths. To this end, we need the following lemma. Here, the first two statements (i) and (ii) are shown in Lemma 2 in Bellec and Tsybakov (2015). The third statement (iii) ensures that the moderate growth conditions implies the minimal length condition (18).
Lemma D.20**.**
Let be a monotone vector satisfying the moderate growth condition and . Then, there exists another monotone vector satisfying the following three conditions.
- (i)
is -piecewise constant with
[TABLE]
Here, is the smallest integer that is not less than . 2. (ii)
We have
[TABLE]
and
[TABLE] 3. (iii)
Let be the partition on which is constant. Then, we have and .
Proof.
Let be an integer defined in (67). We construct a -piecewise constant monotone vector as follows: First, define an equi-spaced partition of the interval as
[TABLE]
and . Next, define a partition of as (). Then, let be a piecewise-constant vector such that for . See the right panel of Figure 4 for an illustrative example for and its piecewise-constant approximation . By a similar argument as Lemma 2 in Bellec and Tsybakov (2015), we can check (i) and (ii).
It remains to prove (iii) under the moderate growth condition. Below, we will only check that the maximal element in is not less than because can be checked in a similar way. Let . Note that we have since . By the moderate growth condition, we have
[TABLE]
which means and hence . ∎
Now, we are ready to prove Corollary 4.12. Applying Lemma D.20 for every segments , we have a -piecewise constant and -piecewise monotone vector such that
[TABLE]
and
[TABLE]
Moreover, satisfies the minimum length condition (18) with . Therefore, we have and
[TABLE]
where we used an obvious inequality . Then, Theorem 4.7 implies that there exists such that
[TABLE]
for some universal constant . This is the desired conclusion. Note that an upper bound for such is suggested by Proposition 4.8.
D.6 Subdifferential and weak decomposability
In this subsection, we discuss the structure of the subdifferential of the nearly-isotonic type penalties. The main purpose is to discuss the weak decomposability (defined in Appendix D.4) of .
D.6.1 Characterization of the subdifferential
First, we observe that can be written as a support function of a certain convex set. In fact, by Theorem 8.24 in Rockafeller and Wets (1998), we can see that
[TABLE]
where is a closed convex set. Conversely, once we have a convex function , the set is specified as
[TABLE]
Many properties of the support function can be understood through the structure of the set ; In particular, we can characterize the subdifferential and weak decomposability. Below, we investigate the more detailed structure of the set in terms of submodular functions.
Let be a directed graph equipped with positive edge weights . For any , we define a nearly-isotonic type penalty for the weighted graph as in (33). For any subset , we also define by the total weights of outgoing edges:
[TABLE]
The function is called the cut function of the weighted graph .
It is well known that the cut function is a submodular function. Here, a function is called submodular if and
[TABLE]
holds for any subsets . We refer the reader to Bach (2013) for fundamental properties of submodular functions. For any submodular function , we define the base polyhedron as
[TABLE]
The Lovász extension of is defined as the support function of , that is, for any ,
We see that the nearly-isotonic type penalty (33) is actually the Lovász extension of the cut function (71).
Proposition D.21**.**
For any directed graph and edge weight , the function is the Lovász extension of the cut function .
Proof.
This is the consequence of the well-known result so-called the greedy algorithm; see e.g., Proposition 3.2 in Bach (2013). In particular, we can find a derivation in Section 6.2 of Bach (2013). ∎
Now, we have the following useful characterizations of the subdifferential.
Proposition D.22**.**
Define be a submodular function and be its Lovász extension. Suppose .
- (i)
The subdifferential coincides with a face of given as
[TABLE] 2. (ii)
There is an (ordered) partition such that
[TABLE]
where (). In particular, we have . 3. (iii)
Let be any point in the relative interior of . Then, the normal cone of at is contained in the set of partition-wise constant vectors:
[TABLE]
Proof.
The first statement is just a well-known property for the support function (Corollary 8.25 in Rockafeller and Wets (1998)). The second statement follows from the characterization of faces for the base polyhedron (see Proposition 4.7 in Bach (2013)). The third statement follows from (ii) and the characterization of normal cones of polyhedra (see Theorem 6.46 in Rockafeller and Wets (1998)). ∎
D.6.2 Weak decomposability
Here, we discuss the weak decomposability of the Lovász extension.
Before describing the result, we introduce some terminology. Let be a submodular function. We say that a set is separable for if there is a non-empty proper subset of such that . We also say that is inseparable if it is not separable. For example, if is the cut function defined in (71), is inseparable if and only if it is a connected component in the graph . Furthermore, we define the following agglomerative clustering condition.
Definition D.23**.**
We say that a submodular function satisfies the agglomerative clustering (AC) condition if it has the following property: Let be a any disjoint pair of subsets such that and is inseparable for the function defined by . Then, for any , we have
[TABLE]
Recall the definition of weak decomposability (65). The following proposition provides a sufficient condition for the weak decomposability of the Lovász extension.
Proposition D.24**.**
Let be a submodular function satisfying the AC condition in Definition D.23. Then, the Lovász extension of of is weakly decomposable.
Proof.
Fix . Since is the support function of the base polyhedron , coincides with a face of . Let be a partition of such that is represented as (72). For , we write and . We should note that the above partition can be chosen so that is inseparable for the function defined as
[TABLE]
In this case, is an dimensional subset.
Define a vector as
[TABLE]
Since
[TABLE]
holds for any , we have . Moreover, is also contained in the normal cone of . Hence, if we prove , we have
[TABLE]
which implies that .
Now, our goal is to prove under the AC condition. If , then it is clear from (72) that . Below, we assume that . Since , it suffices to show that holds for any that determines a relative boundary of . The relative boundary of can be written as the union of all dimensional faces of that have non-empty intersection with . Such faces can be characterized as follows: Let be the partition defined in the above, and choose with . Let be any non-empty proper subset of . We define a new ordered partition of by inserting instead of :
[TABLE]
Then, defines an dimensional affine subspace by (72), which defines a part of the relative boundary of . Therefore, we have to show that for any that can be written as with . From the AC condition, we have
[TABLE]
This proves that , and hence is weakly decomposable. ∎
Remark D.25**.**
The AC condition was originally introduced in Bach (2011). In that paper, the author consider the proximal denoising estimators (37) where is the Lovász extension of a submodular function . The name “agglomerative clustering” captures the following property: Let us consider the solution path of the minimization problem (37) parametrized by , that is, the solution path is the collection calculated for all . In general, the solution path starts with for , and shrinks toward some piecewise constant vector as increases. Proposition 4 of Bach (2011) showed that the solution path is agglomerative if satisfies the AC condition.
We provide some examples of functions satisfying the AC condition:
- •
Let be a concave function with . A submodular function defined as satisfies the AC condition. Examples of solutions paths for this class can be found in Bach (2011).
- •
The one-dimensional fused lasso has an agglomerative solution path. The corresponding submodular function is the cut function of the undirected one-dimensional grid graph, which satisfies the AC condition. Hence, by Proposition D.24, the penalty of the one-dimensional fused lasso is weakly decomposable. This provides an alternative proof for Lemma 2.7 in Guntuboyina et al. (2017). On the other hand, the fused lasso on the two-dimensional grid does not satisfy this condition. See Bach (2011) for details.
- •
The nearly-isotonic regression (3) has an agglomerative solution path. A direct proof for this property is provided in Lemma 1 in Tibshirani et al. (2011). Below, we prove that the cut function for directed one-dimensional grid graph satisfies the AC condition, which provides an alternative proof for this fact.
The following proposition provides a proof for Proposition D.18.
Proposition D.26**.**
The cut function associated with the nearly-isotonic regression satisfies the AC condition. In particular, the lower total variation is weakly decomposable. Moreover, for any , the minimum value of the -norm in is given by (66).
Proof.
For any , is given by the number of connected components in that does not contains the rightmost point . Let be a connected subset, and . The value of depends on whether one or both of two endpoints of are adjacent to .
We will check the AC condition by considering all patterns of adjacency as Table 1.
Here, represents any proper subset of , and “None” means that contains or . In each case, we can easily check that the inequality (73) is satisfied. Hence, satisfies the AC condition.
The second statement is a consequence of Proposition D.24.
The last statement follows from fact that the minimizer of in coincides with that in , which is given as (74). In this case, we can choose as the constant partition of that is sorted by the values of . Thus, we have
[TABLE]
which proves the desired result. ∎
Remark D.27** (Missing part in the proof of Proposition A.1).**
With a slight modification of the above argument, we can show the AC condition for the cut function of weighted graph
[TABLE]
where () are edge weights. As mentioned in Proposition A.1, we need this result to prove the validity of the modified PAVA algorithm (Algorithm 1). Here, we prove that (31) provides a sufficient condition for the AC condition, and hence the solution path of the weighted nearly-isotonic regression (30) is agglomerative.
Let be a non-empty connected subset, be a subset of , and be a proper subset of . Recall that our goal is to check the inequality (73). For clarity, we write . As in the proof of Proposition D.26, we consider all adjacency patterns of , and . Then, we can easily check the following case statement:
Suppose that either “ and ” or “ and ” holds. Then, we have and . Now, we will check (73) under the concavity condition (31). First, (73) trivially holds when because in this case . Next, we assume . Let be the largest element in . Then, we have , . Under the assumption (31), we have
[TABLE]
which implies (73). 2. 2.
Suppose that and . Then, we have and . By a similar argument above, (73) trivially holds when . Let and let be the largest element in . Then, under the assumption (31), we have
[TABLE] 3. 3.
For other case, we have , which implies (73).
Appendix E Proofs in Section 5
The goal of this section is to prove Theorem 5.1. The outline of the proof is essentially the same as the framework of Theorem 4.18 in Massart (2007). We explain this framework in Section E.1. To complete the proof, we have to control the maximum value of a certain normalized Gaussian process. For this, we provide an upper bound in Section E.2.
E.1 Proof overview
Let be the selected pair in (27). Fix any connected partition and . By the definition of the estimator, we have
[TABLE]
for any vector that belongs to . In particular, we can choose as
[TABLE]
Substituting , we can deduce that
[TABLE]
Here, recall that is a random variable drawn from .
Let be a positive number and . Suppose that an inequality
[TABLE]
holds on some event that occurs with probability at least . Here, is a positive constant that can depend on . Combining this inequality with (76), we have on the same event
[TABLE]
where we used the elementary inequality .
E.2 Controlling the normalized process
Now, our goal is to provide an inequality of the form (77). Below, we fix .
First, we fix a partition and . For any , we define
[TABLE]
where is a positive constant which will be specified later. Define a random variable as
[TABLE]
Note that is the supremum of a sample-continuous Gaussian process. By the concentration inequality for Gaussian processes (Lemma F.1), we have
[TABLE]
for any and . Here, the variance is bounded as
[TABLE]
because , and is distributed according to for any .
We will provide an upper bound for . Let be the orthogonal projection of onto . Note that
[TABLE]
The second term (b) in the right-hand side of (80) is bounded from above by . Indeed, since
[TABLE]
we have
[TABLE]
To bound the term (a) in (80), we use the following lemma:
Lemma E.1**.**
Let be any partition and . Fix any . For any , we have
[TABLE]
where is a universal constant. Futhermore, for any , we have
[TABLE]
where is the same constant as in (81).
Proof.
We will prove the first inequality (81). Let denote the left-hand side of (81). We consider a collection of finitely many sets as follows: Let be a collection of vectors that can be written as for some integer vector such that and . Note that, by Proposition D.10, the cardinality of is bounded by . For any , define the set
[TABLE]
Then, we can easily check that
[TABLE]
From Lemma F.3 below, there exists a universal constant such that
[TABLE]
Here, by Hölder’s inequality, we have
[TABLE]
and by the Cauchy-–Schwarz inequality, we also have
[TABLE]
Then, by Lemma F.4 below, we have
[TABLE]
for some . Thus, (81) has been proved.
The second inequality (82) is a consequence of the peeling lemma (Lemma F.2 below). ∎
Combining (79), (80) and (82), we conclude that
[TABLE]
holds with probability at least , where is the constant in (82). Now, we choose the two constant and as
[TABLE]
and
[TABLE]
respectively. Then, it is elementary to check that the right-hand side of (E.2) is not larger than .
Applying the union bound over all pairs , we have
[TABLE]
Here, we can show that
[TABLE]
and hence we conclude that (77) holds with . Indeed, (85) follows from the fact that, for any ,
[TABLE]
and
[TABLE]
E.3 Proof of Theorem 5.1
Now, we are ready to complete the proof of Theorem 5.1. Define as
[TABLE]
where is the constant in (82). Let be the pair that minimizes
[TABLE]
among all possible pairs. Applying (78) and (77) for this choice of , we conclude that
[TABLE]
holds with probability at least . Moreover, by integrating both sides with respect to , we have
[TABLE]
Appendix F Auxiliary lemmas
Here, we present several auxiliary lemmas that are used in the proofs in the previous sections.
Lemma F.1** (Borel–Tsirelson–Ibragimov–Sudakov inequality; see Proposition 3.19 in Massart (2007)).**
Suppose that is a Gaussian process on a totally bounded metric space such that for any and the sample path is almost surely continuous. Let . Then, for any , we have
[TABLE]
Lemma F.2** (Peeling lemma; see e.g. Lemma 4.23 in Massart (2007)).**
Let be a set in and . Assume that there is a function such that is non-increasing and
[TABLE]
for any . Then, for any , we have
[TABLE]
Lemma F.3** (Guntuboyina et al. (2017), Lemma B.1).**
For any and , let
[TABLE]
There exists a universal constant such that
[TABLE]
Lemma F.4** (Guntuboyina et al. (2017), Lemma D.1).**
Suppose and let be subset of each containing the origin and each contained in the closed Euclidean ball of radius centered at the origin. Then, for , we have
[TABLE]
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number JP17J06640. The author would like to thank three anonymous reviewers for their valuable comments and suggestions. The author also thanks Hiromichi Nagao for suggesting the example of a seismological phenomenon, and Fumiyasu Komaki and Keisuke Yano for helpful discussions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Amelunxen et al. [2014] D. Amelunxen, M. Lotz, M. B. Mc Coy, and J. A. Tropp. Living on the edge: Phase transition in convex programs with random data. Information and Inference: A Journal of IMA , 3:224–294, 2014.
- 2Ayer et al. [1955] M. Ayer, H. D. Brunk, G. M. Ewing, W.T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics , 26:641–647, 1955.
- 3Bach [2011] F. Bach. Shaping level sets with submodular functions. In NIPS , 2011.
- 4Bach [2013] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning , 6(2–3):143–373, 2013.
- 5Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
- 6Bellec [2018] P. C. Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics , 46(2):745–780, 2018.
- 7Bellec and Tsybakov [2015] P. C. Bellec and A. B. Tsybakov. Sharp oracle bounds for monotone and convex regression through aggregation. Journal of Machine Learning Research , 16:1879–1892, 2015.
- 8Birgé and Massart [2001] L. Birgé and P. Massart. Gaussian model selection. Journal of the European Mathematical Society , 3:203–268, 2001.
