Estimating Piecewise Monotone Signals

Kentaro Minami

arXiv:1905.01840·math.ST·March 10, 2020

Estimating Piecewise Monotone Signals

Kentaro Minami

PDF

Open Access

TL;DR

This paper analyzes the nearly-isotonic regression for estimating piecewise monotone signals, providing risk bounds and an algorithm for general graphs, showing it performs nearly as well as an oracle estimator.

Contribution

It derives adaptive risk bounds for nearly-isotonic regression and introduces a versatile algorithm applicable to weighted graphs.

Findings

01

Risk bounds are adaptive to piecewise monotone signals.

02

Nearly-isotonic regression performs close to an oracle estimator.

03

The proposed algorithm works on general weighted graphs.

Abstract

We study the problem of estimating piecewise monotone vectors. This problem can be seen as a generalization of the isotonic regression that allows a small number of order-violating changepoints. We focus mainly on the performance of the nearly-isotonic regression proposed by Tibshirani et al. (2011). We derive risk bounds for the nearly-isotonic regression estimators that are adaptive to piecewise monotone signals. The estimator achieves a near minimax convergence rate over certain classes of piecewise monotone signals under a weak assumption. Furthermore, we present an algorithm that can be applied to the nearly-isotonic type estimators on general weighted graphs. The simulation results suggest that the nearly-isotonic regression performs as well as the ideal estimator that knows the true positions of changepoints.

Tables1

Table 1. Table 1: The values of F B A superscript subscript 𝐹 𝐵 𝐴 F_{B}^{A} for the cut function F 𝐹 F of one-dimensional grid graph.

Node left to $A$	Node right to $A$	$F (B \cup A) - F (B)$	$F (B \cup C) - F (B)$
None	None	0	$\geq 0$
None	$B$	0	$\geq 0$
None	$V ∖ B$	1	$\geq 1_{{C \neq \emptyset}}$
$B$	None	-1	$\geq 0$
$B$	$B$	-1	$\geq 0$
$B$	$V ∖ B$	0	$\geq 0$
$V ∖ B$	None	0	$\geq 0$
$V ∖ B$	$B$	0	$\geq 0$
$V ∖ B$	$V ∖ B$	1	$\geq 1_{{C \neq \emptyset}}$

Equations562

minimize ∥ y - θ ∥_{2} subject to θ_{1} \leq θ_{2} \leq \dots \leq θ_{n} .

minimize ∥ y - θ ∥_{2} subject to θ_{1} \leq θ_{2} \leq \dots \leq θ_{n} .

y_{i} = θ_{i}^{*} + ξ_{i}, i = 1, 2, \dots, n,

y_{i} = θ_{i}^{*} + ξ_{i}, i = 1, 2, \dots, n,

θ_{τ_{i}} \leq θ_{τ_{i} + 1} \leq \dots \leq θ_{τ_{i + 1} - 1}, for i = 1, 2, \dots, m .

θ_{τ_{i}} \leq θ_{τ_{i} + 1} \leq \dots \leq θ_{τ_{i + 1} - 1}, for i = 1, 2, \dots, m .

\hat{θ}_{λ} \in θ \in R^{n} argmin {\frac{1}{2} ∥ y - θ ∥_{2}^{2} + λ i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+}},

\hat{θ}_{λ} \in θ \in R^{n} argmin {\frac{1}{2} ∥ y - θ ∥_{2}^{2} + λ i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+}},

dist (θ^{*}, K_{n}^{↑}) := θ \in K_{n}^{↑} in f ∥ θ^{*} - θ ∥_{2} .

dist (θ^{*}, K_{n}^{↑}) := θ \in K_{n}^{↑} in f ∥ θ^{*} - θ ∥_{2} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq C (\frac{σ ^{2} V ( θ ^{*} )}{n})^{2/3} + \frac{C σ ^{2} lo g e n}{n},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq C (\frac{σ ^{2} V ( θ ^{*} )}{n})^{2/3} + \frac{C σ ^{2} lo g e n}{n},

max {(\frac{σ ^{2} V}{n})^{2/3}, \frac{σ ^{2} m}{n} lo g \frac{e n}{m}} .

max {(\frac{σ ^{2} V}{n})^{2/3}, \frac{σ ^{2} m}{n} lo g \frac{e n}{m}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{λ} - θ^{*} ∥_{2}^{2} \leq C {(\frac{σ ^{2} V lo g e n}{n})^{2/3} + \frac{σ ^{2} m}{n} lo g \frac{e n}{m}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{λ} - θ^{*} ∥_{2}^{2} \leq C {(\frac{σ ^{2} V lo g e n}{n})^{2/3} + \frac{σ ^{2} m}{n} lo g \frac{e n}{m}} .

V (θ) := i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣ and V_{-} (θ) := i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+},

V (θ) := i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣ and V_{-} (θ) := i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq θ \in K_{n}^{↑} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )}},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq θ \in K_{n}^{↑} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )}},

\hat{θ}_{fused, λ} = θ \in R^{n} argmin {\frac{1}{2} ∥ y - θ ∥_{2}^{2} + λ i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣},

\hat{θ}_{fused, λ} = θ \in R^{n} argmin {\frac{1}{2} ∥ y - θ ∥_{2}^{2} + λ i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{fused, λ^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in R^{n} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C Δ_{fused} (θ)},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{fused, λ^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in R^{n} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C Δ_{fused} (θ)},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{neariso, λ^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in R^{n} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C Δ_{neariso} (θ)} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{neariso, λ^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in R^{n} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C \frac{σ ^{2} k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C Δ_{neariso} (θ)} .

\hat{θ} in f θ^{*} \in Θ sup \frac{1}{n} E_{θ^{*}} ∥ \hat{θ} - θ^{*} ∥_{2}^{2},

\hat{θ} in f θ^{*} \in Θ sup \frac{1}{n} E_{θ^{*}} ∥ \hat{θ} - θ^{*} ∥_{2}^{2},

\hat{θ}_{i} in f θ_{A_{i}}^{*} \in K_{A_{i}}^{↑} : V (θ_{i}^{*}) \leq V_{i} sup \frac{1}{n _{i}} E_{θ_{A_{i}}^{*}} ∥ \hat{θ}_{i} - θ_{i}^{*} ∥_{2}^{2} \geq C_{1} (\frac{σ ^{2} V _{i}}{n _{i}})^{2/3} for all i = 1, \dots, m .

\hat{θ}_{i} in f θ_{A_{i}}^{*} \in K_{A_{i}}^{↑} : V (θ_{i}^{*}) \leq V_{i} sup \frac{1}{n _{i}} E_{θ_{A_{i}}^{*}} ∥ \hat{θ}_{i} - θ_{i}^{*} ∥_{2}^{2} \geq C_{1} (\frac{σ ^{2} V _{i}}{n _{i}})^{2/3} for all i = 1, \dots, m .

C_{1} i = 1 \sum m^{*} \frac{n _{i}}{n} (\frac{σ ^{2} V _{i}}{n _{i}})^{2/3} .

C_{1} i = 1 \sum m^{*} \frac{n _{i}}{n} (\frac{σ ^{2} V _{i}}{n _{i}})^{2/3} .

θ^{*} \in Θ sup \frac{1}{n} E_{θ^{*}} ∥ \hat{θ} - θ^{*} ∥_{2}^{2} \geq C max {(\frac{σ ^{2} V}{n})^{2/3}, \frac{σ ^{2} m}{n} lo g \frac{e n}{m}},

θ^{*} \in Θ sup \frac{1}{n} E_{θ^{*}} ∥ \hat{θ} - θ^{*} ∥_{2}^{2} \geq C max {(\frac{σ ^{2} V}{n})^{2/3}, \frac{σ ^{2} m}{n} lo g \frac{e n}{m}},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq ϵ^{2} + \frac{σ ^{2} k ( θ ˉ )}{n} lo g \frac{e n}{k ( θ ˉ )},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} \leq ϵ^{2} + \frac{σ ^{2} k ( θ ˉ )}{n} lo g \frac{e n}{k ( θ ˉ )},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} > σ^{2} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{K_{n}^{↑}} - θ^{*} ∥_{2}^{2} > σ^{2} .

minimize ∥ y - θ ∥_{2}^{2} subject to i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+} \leq V,

minimize ∥ y - θ ∥_{2}^{2} subject to i = 1 \sum n - 1 (θ_{i} - θ_{i + 1})_{+} \leq V,

w_{1}

w_{1}

w_{i}

M (θ) := j = 2 \sum k max {\frac{1}{∣ A _{j} ∣}, \frac{k}{n}} 1_{{w_{j - 1} \neq = w_{j}}} .

M (θ) := j = 2 \sum k max {\frac{1}{∣ A _{j} ∣}, \frac{k}{n}} 1_{{w_{j - 1} \neq = w_{j}}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2}

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2}

\displaystyle\leq\inf_{\begin{subarray}{c}\theta\in\mathbb{R}^{n}:\\ \mathcal{V}_{-}(\theta)=\mathcal{V}\end{subarray}}\bigg{\{}\frac{1}{n}\lVert\theta-\theta^{*}\rVert_{2}^{2}+C\sigma^{2}\frac{k(\theta)}{n}\log\frac{\mathrm{e}n}{k(\theta)}+C\sigma^{2}\frac{M(\theta)}{k(\theta)}\log\frac{\mathrm{e}n}{k(\theta)}\bigg{\}}.

\frac{1}{n} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2}

\frac{1}{n} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2}

\leq θ \in R^{n} : V_{-} (θ) = V in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C σ^{2} \frac{k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C σ^{2} \frac{M ( θ )}{k ( θ )} lo g \frac{e n}{k ( θ )}}

+ \frac{4 σ ^{2} lo g η ^{- 1}}{n}

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V^{*}} - θ^{*} ∥_{2}^{2}

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V^{*}} - θ^{*} ∥_{2}^{2}

\leq θ \in R^{n} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C σ^{2} \frac{k ( θ )}{n} lo g \frac{e n}{k ( θ )} + C σ^{2} \frac{M ( θ )}{k ( θ )} lo g \frac{e n}{k ( θ )}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2} \leq C σ^{2} {\frac{k ( θ ^{*} )}{n} lo g \frac{e n}{k ( θ ^{*} )} + \frac{M ( θ ^{*} )}{k ( θ ^{*} )} lo g \frac{e n}{k ( θ ^{*} )}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V} - θ^{*} ∥_{2}^{2} \leq C σ^{2} {\frac{k ( θ ^{*} )}{n} lo g \frac{e n}{k ( θ ^{*} )} + \frac{M ( θ ^{*} )}{k ( θ ^{*} )} lo g \frac{e n}{k ( θ ^{*} )}} .

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in K_{n}^{↑} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C σ^{2} \frac{k ( θ )}{n} lo g \frac{e n}{k ( θ )}},

\frac{1}{n} E_{θ^{*}} ∥ \hat{θ}_{V^{*}} - θ^{*} ∥_{2}^{2} \leq θ \in K_{n}^{↑} in f {\frac{1}{n} ∥ θ - θ^{*} ∥_{2}^{2} + C σ^{2} \frac{k ( θ )}{n} lo g \frac{e n}{k ( θ )}},

min {∣ A_{i} ∣ : 1 \leq i \leq k, w_{i} \neq = w_{i + 1}} \geq \frac{c n}{k},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Control Systems and Identification · Advanced Statistical Methods and Models

Full text

Estimating Piecewise Monotone Signals

Kentaro Minami

The University of Tokyo

Preferred Networks, Inc.

(7 March 2020)

Abstract

We study the problem of estimating piecewise monotone vectors. This problem can be seen as a generalization of the isotonic regression that allows a small number of order-violating changepoints. We focus mainly on the performance of the nearly-isotonic regression proposed by Tibshirani et al. (2011). We derive risk bounds for the nearly-isotonic regression estimators that are adaptive to piecewise monotone signals. The estimator achieves a near minimax convergence rate over certain classes of piecewise monotone signals under a weak assumption. Furthermore, we present an algorithm that can be applied to the nearly-isotonic type estimators on general weighted graphs. The simulation results suggest that the nearly-isotonic regression performs as well as the ideal estimator that knows the true positions of changepoints.

keywords: piecewise monotone function, isotonic regression, nearly-isotonic regression, adaptive risk bounds

1 Introduction
1.1 Summary of theoretical results
1.2 Organization
1.3 Notation
2 Related work
3 Lower bounds
3.1 Minimax lower bound
3.2 Lower bound of isotonic regression with misspecified partitions
4 Risk bounds for nearly-isotonic regression
4.1 Risk bounds for constrained estimators
4.2 Risk bounds for penalized estimators
4.3 Application to piecewise monotone vectors
5 Model selection based estimators
6 Simulations
6.1 Dealing with inconsistency at boundaries
6.2 Simulation data
6.3 Geological data
7 Discussion
7.1 Non-Gaussian noises
7.2 Future directions
A Algorithms for nearly-isotonic estimators
A.1 Penalized estimators
A.1.1 One-dimensional problem
A.1.2 General graphs
A.1.3 General convex loss functions
A.2 Constrained estimators
B Supplemental experiments
C Proofs in Section 3
C.1 Proof of Proposition 3.2
C.2 Proof of Proposition 3.3
D Proofs in Section 4
D.1 Preliminaries
D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)
D.3 Proof of Corollary 4.4
D.4 Risk bounds for penalized estimators (Proof of Theorem 4.7)
D.5 Proof of Corollary 4.12
D.6 Subdifferential and weak decomposability
D.6.1 Characterization of the subdifferential
D.6.2 Weak decomposability
E Proofs in Section 5
E.1 Proof overview
E.2 Controlling the normalized process
E.3 Proof of Theorem 5.1
F Auxiliary lemmas

1 Introduction

Isotonic regression is a popular statistical method based on partial order structures, which has a long history in statistics (Ayer et al. 1955, Brunk 1955, van Eeden 1956). Suppose that $\theta^{*}\in\mathbb{R}^{n}$ is a monotone vector satisfying $\theta^{*}_{1}\leq\theta^{*}_{2}\leq\cdots\leq\theta^{*}_{n}$ , and $y$ is a noisy observation of $\theta^{*}$ . The goal of the isotonic regression is to find a least-square fit under the monotone constraint:

[TABLE]

In other words, the isotonic regression is the least squares estimator $\hat{\theta}=\hat{\theta}_{K_{n}^{\uparrow}}$ over a closed convex cone $K^{\uparrow}_{n}:=\{\theta\in\mathbb{R}^{n}:\theta_{1}\leq\theta_{2}\leq\cdots\leq\theta_{n}\}$ . Broadly speaking, the isotonic regression is an example of shape restricted regression. For comprehensive reviews on this field, see Robertson et al. (1988), Groeneboom and Jongbloed (2014), Chatterjee et al. (2015), Guntuboyina and Sen (2017) and references therein.

In this paper, we study the problem of estimating piecewise monotone vectors, which can be regarded as a generalization of isotonic regression that allows order-violating changepoints. We formulate the problem precisely as follows. Let us consider the Gaussian sequence model

[TABLE]

where $y=(y_{1},y_{2},\ldots,y_{n})^{\top}\in\mathbb{R}^{n}$ is the observed vector, $\theta^{*}=(\theta^{*}_{1},\theta^{*}_{2},\ldots,\theta^{*}_{n})^{\top}\in\mathbb{R}^{n}$ is the unknown parameter of interest, and $\xi=(\xi_{1},\xi_{2},\ldots,\xi_{n})^{\top}$ is the unobserved noise distributed according to the Gaussian distribution $N(0,\sigma^{2}I_{n})$ . Given the noisy observation $y$ , the problem is to find a good piecewise monotone approximation of $\theta^{*}$ . Here we define piecewise monotone vectors as follows.

Definition 1.1.

Let $\Pi=(A_{1},A_{2},\ldots,A_{m})$ be a connected partition of $[n]=\{1,2,\ldots,n\}$ , that is, there exists a sequence $1=\tau_{1}<\tau_{2}<\cdots<\tau_{m}<\tau_{m+1}=n+1$ such that $A_{i}=\{\tau_{i},\tau_{i}+1,\ldots,\tau_{i+1}-1\}$ ( $i=1,2,\ldots,m$ ). We say that a vector $\theta\in\mathbb{R}^{n}$ is piecewise monotone on $\Pi$ if the restriction on each $A_{i}$ is monotone:

[TABLE]

We also say that $\theta$ is $m$ -piecewise monotone if $\theta$ is piecewise monotone on some partition $\Pi$ with $|\Pi|=m$ .

We are particularly interested in the case where the number of pieces $m$ is larger than two but much smaller than $n$ because it is reduced to simpler problems if otherwise. From Definition 1.1, a monotone vector in $K_{n}^{\uparrow}$ is $m$ -piecewise monotone for any $m\geq 1$ . In particular, the least squares estimators over $1$ -piecewise monotone vectors coincide with the isotonic regression. Besides, since any vector in $\mathbb{R}^{n}$ is $n$ -piecewise monotone, the least squares estimator over $n$ -piecewise monotone vectors is merely the identity function $\hat{\theta}_{\mathrm{id}}=y$ .

In real-world applications, there are many signals that can be approximated by piecewise monotone vectors. Here, we provide a few examples. First, in seismology, geological observations such as tide gauge records (Nagao et al. 2013) and GPS records (Roggers and Dragert 2003) often consist of a long-term monotonic trend and discontinuous jumps caused by tectonic activities. In particular, Roggers and Dragert (2003) reported that GPS measurements that are nearby a subduction zone in North America can be approximated by a sawtooth function. The top panel of Figure 1 shows an example of GPS measurements. Second, the numbers of search queries for some words related to seasons (e.g., “Christmas” and “gift”) can be seen as periodic piecewise monotone signals (see the bottom panel of Figure 1 for examples). Third, in the ranking systems in online shopping websites, sales ranks of rarely sold items behave like piecewise monotone signals because they suddenly rise every time the items are sold (Hattori and Hattori 2010).

In this paper, we focus on the performance of nearly-isotonic regression proposed by Tibshirani et al. (2011). Given $y\in\mathbb{R}^{n}$ and a tuning parameter $\lambda\geq 0$ , the nearly-isotonic regression estimator $\hat{\theta}_{\lambda}$ is defined as

[TABLE]

where $(z)_{+}:=\max\{z,0\}$ . Intuitively, the tuning parameter $\lambda$ controls the degree of monotonicity. The term $(\theta_{i}-\theta_{i+1})_{+}$ poses a positive penalty if and only if the directed edge $(i,i+1)$ is order violating, i.e., $\theta_{i}>\theta_{i+1}$ . Hence, a large value of $\lambda>0$ makes the estimator $\hat{\theta}_{\lambda}$ close to a monotone vector. In particular, there is a sufficiently large $\lambda$ such that the solution $\hat{\theta}_{\lambda}$ becomes exactly the same as the isotonic regression (1).

Our goal in this paper is to show that the nearly-isotonic regression can adapt to piecewise monotone vectors. As suggested in Tibshirani et al. (2011), the nearly-isotonic regression can fit to a “nearly monotone” vector that is close to $K^{\uparrow}_{n}$ in $\ell_{2}$ -sense. That is, the estimator performs well if $\theta^{*}$ has a small $\ell_{2}$ -misspecification error $\mathord{\mathrm{dist}}(\theta^{*},K_{n}^{\uparrow})$ defined as

[TABLE]

Moreover, we can observe that the nearly-isotonic regression can fit to piecewise monotone vectors, even if $\theta^{*}$ is far from monotone in $\ell_{2}$ -sense. Figure 2 shows an example of the nearly-isotonic regression with $n=100$ . The true parameter $\theta^{*}$ (orange line) is 2-piecewise monotone. By varying the values of the tuning parameter $\lambda\geq 0$ , the nearly-isotonic regression behaves as follows: If $\lambda=0$ , the nearly-isotonic regression is just the identity estimator $\hat{\theta}_{\mathrm{id}}=y$ , which clearly overfits to the noisy observation. If $\lambda$ is set to a sufficiently large value, $\hat{\theta}_{\lambda}$ coincides with the isotonic regression. In this example, however, the $\ell_{2}$ -misspecification error $\mathrm{dist}^{2}(\theta^{*},K_{n}^{\uparrow})$ is large compared with the normalized noise variance $\sigma^{2}/n$ . We can see that the mean squared error (MSE) $\frac{1}{n}\mathbb{E}_{\theta^{*}}\lVert\hat{\theta}-\theta^{*}\rVert_{2}^{2}$ of the isotonic regression can be much worse than that of the identity estimator, which coincides with $\sigma^{2}/n$ (see Section 3.2). Indeed, we can choose a 2-piecewise monotone vector $\theta^{*}\in K^{\uparrow}_{n/2}\times K^{\uparrow}_{n/2}$ with arbitrarily large $\ell_{2}$ -misspecification error. If we choose an intermediate value of $\lambda$ , the nearly-isotonic regression seems to fit to the true parameter. This suggests the adaptation property to piecewise monotone vectors.

1.1 Summary of theoretical results

In this paper, we investigate the adaptation property of the nearly-isotonic regression estimators defined in (3).

In the monotone regression setting (i.e., $m=1$ ), it is known that the isotonic regression estimator $\hat{\theta}_{K_{n}^{\uparrow}}$ achieves the risk bound

[TABLE]

where $\mathcal{V}(\theta)=\theta_{n}-\theta_{1}$ is the total variation of the monotone vector $\theta$ . It is also known that the rate $\mathord{\mathrm{O}}((\sigma^{2}\mathcal{V}/n)^{2/3})$ is minimax optimal under the assumption that $\theta^{*}$ is monotone and $\mathcal{V}(\theta^{*})\leq\mathcal{V}$ (Zhang 2002). Hence, a natural question is whether a similar rate can be achieved in piecewise monotone regression.

In Section 3.1, we provide the minimax lower bound over the class of piecewise monotone vectors. Let $\Theta_{n}(m,\mathcal{V})$ be the set of $m$ -piecewise monotone vectors whose “upper” total variations are bounded by $\mathcal{V}$ (a precise definition is provided in Section 3.1). Then, the minimax risk over $\Theta_{n}(m,\mathcal{V})$ is bounded from below by a constant multiple of

[TABLE]

In Section 5, we construct a concrete (but not computationally efficient) estimator that adaptively achieves this rate, and hence this lower bound is tight in the sense of the order in $n,m$ , and $\mathcal{V}$ . Intuitively, this suggest that the cost of not knowing the true partition is of order $\mathord{\mathrm{O}}(\frac{\sigma^{2}m}{n}\log\frac{\mathrm{e}n}{m})$ .

In Section 4, we provide the following risk bound for the nearly-isotonic regression estimator (3). A precise statement is given in Corollary 4.12.

Claim 1.2.

Let $\theta^{*}$ be a piecewise monotone vector on a partition $\Pi=(A_{1},A_{2},\ldots,A_{m})$ . Suppose that the following assumptions hold:

(a)

The partition is equi-spaced: $|A_{1}|=|A_{2}|=\cdots=|A_{m}|\ (=\frac{n}{m})$ . 2. (b)

For each segment $A_{j}$ , $\theta_{A_{j}}^{*}$ is monotone and the total variation is bounded as $\mathcal{V}(\theta_{A_{j}}^{*})\leq\mathcal{V}/m$ . 3. (c)

$\theta^{*}_{A_{j}}$ satisfies an appropriate “growth condition” for each $j=1,\ldots,m$ .

Then, the estimator (3) with optimally tuned parameter $\lambda$ satisfies the following risk bound:

[TABLE]

The above claim is obtained as a corollary of a more general risk bound in Section 4. In the above statement, we make somewhat restrictive assumptions. Here, (a) and (b) are introduced just for the sake of notation simplicity, whereas (c) is an essential assumption. If we assume only (a) and (b), the rate that appeared in (4) is minimax optimal up to a logarithmic multiplication factor. However, we require an extra growth condition (c), which seems to be unavoidable for the estimator (3). We will provide a precise definition of the growth condition in Section 4.3.

1.2 Organization

The rest of this paper is organized as follows. In Section 2, we give a brief literature review on the shape restricted regression and regularization based estimators and relate our theoretical results to previous work. We provide lower bounds on the risks in the piecewise monotone regression problem in Section 3. In Section 4, we describe our main results on the risk upper bounds for the nearly-isotonic regression estimator and its constrained form variant. In particular, a precise statement of Claim 1.2 in the above is provided in Section 4.3. In Section 5, we discuss the attainability of the minimax lower bound; herein, we provide a concrete example of a model selection-based estimator that achieves the optimal rate. Furthermore, we present some numerical examples in Section 6. Finally, we present our conclusion in Section 7. We have also included appendices which contain additional numerical examples on two-dimensional signals, explanations of algorithms, and all proofs of the theoretical results.

1.3 Notation

Throughout this paper, we assume that $y=\theta^{*}+\xi$ is distributed according to an isotropic normal distribution $N(\theta^{*},\sigma^{2}I_{n})$ , where $\theta^{*}\in\mathbb{R}^{n}$ is the true mean parameter of interest and $\xi\sim N(0,\sigma^{2}I_{n})$ is the noise vector. The symbol $\mathbb{E}_{\theta^{*}}$ denotes the expectation with respect to $y$ .

We sometimes denote by $C$ an absolute positive constant whose value may vary.

For any $\theta\in\mathbb{R}^{n}$ , we define the total variation $\mathcal{V}(\theta)$ and the lower total variation $\mathcal{V}_{-}(\theta)$ by

[TABLE]

where $(z)_{+}:=\max\{z,0\}$ for any $z\in\mathbb{R}$ . For example, if $\theta$ is monotone nondecreasing, then $\mathcal{V}(\theta)=\theta_{n}-\theta_{1}$ and $\mathcal{V}_{-}(\theta)=0$ . In this paper, the meaning of subscripts of $\theta$ depends on the context (e.g., $\theta_{i}$ , $\theta_{A}$ , $\hat{\theta}_{\lambda}$ , and $\hat{\theta}_{K_{n}^{\uparrow}}$ ). If $A=\{\tau,\tau+1,\ldots,\tau+J-1\}$ is a connected subset of $[n]$ , we denote by $\theta_{A}$ a sub-vector $(\theta_{\tau},\theta_{\tau+1},\ldots,\theta_{\tau+J-1})^{\top}\in\mathbb{R}^{J}$ . We also denote by $\mathcal{V}^{A}(\theta_{A})$ the total variation of $\theta_{A}$ .

2 Related work

There are two classes of estimators that are closely related to the nearly-isotonic regression (3): the isotonic regression and the fused lasso.

As we mentioned above, the isotonic regression is an instance of shape restricted regression. Many existing estimators in shape restricted regression can be formulated as least squares estimators (denoted by $\hat{\theta}_{K}$ ) onto closed convex sets (denoted by $K$ ). Examples include, but not limited to, the isotonic regression, the isotonic regression in two-dimensional grid or more general partial orders (see e.g., Robertson and Wright (1975) and Kyng et al. (2015)), and convex regression (Hildreth 1954).

Recently, researchers have developed two important techniques for analyzing risk behaviors of least squares estimators. First, Chatterjee (2014) proved that the Euclidean norm $\lVert\hat{\theta}_{K}-\theta^{*}\rVert_{2}$ is tightly concentrated around a certain quantity defined by the localized Gaussian width. As applications of Chatterjee’s method, non-asymptotic upper bounds that have similar rates to the minimax risks have been proved for the isotonic regression (Chatterjee 2014, Bellec 2018), the multi-isotonic regression on two or more high dimension (Chatteejee et al. 2018, Han et al. 2017), the multi-dimensional convex regression (Han and Wellner 2016), and the constrained form trend filtering estimator (Guntuboyina et al. 2017). See also Section 2.2 in Bellec (2018) for a related result. Second, risk bounds based on the statistical dimension of the tangent cone of $K$ has been developed by Oymak and Hassibi (2016) and Bellec (2018). This technique is useful because it takes into account the facial structure of $K$ , which leads to risk bounds that are adaptive to low dimensional sub-structures. It has been shown that some least squares estimators are adaptive to piecewise constant vectors: for example, the isotonic regression (Bellec 2018) and the multi-isotonic regression (Chatteejee et al. 2018, Han et al. 2017). In particular, for the one-dimensional isotonic regression, Chatterjee et al. (2015) and Bellec (2018) proved the following oracle inequality

[TABLE]

where $k(\theta)$ is the number of constant pieces of $\theta$ . If $\theta^{*}$ is monotone and $k(\theta^{*})$ is small, the right-hand side can be much smaller than the worst-case rate of $\mathord{\mathrm{O}}((\sigma^{2}\mathcal{V}/n)^{2/3})$ . However, the first term in the right-hand side can become arbitrarily large if $\theta^{*}$ is not included in $K_{n}^{\uparrow}$ .

The fused lasso (Tibshirani et al. 2005), also known as the total variation regularization (Rudin et al. 1992), is a penalized estimator defined as

[TABLE]

where $\lambda\geq 0$ is the tuning parameter. The fused lasso poses the penalty whenever $\theta_{i}\neq\theta_{i+1}$ , whereas the penalty of the nearly-isotonic regression (3) activates only if $\theta_{i}>\theta_{i+1}$ . Theoretical risk bounds for the fused lasso have been studied by Mammen and van de Geer (1997), Dalalyan et al. (2017), Lin et al. (2017), and Guntuboyina et al. (2017). In particular, Guntuboyina et al. (2017) showed an oracle inequality of the following form:

[TABLE]

where $\lambda^{*}$ is an optimally tuned parameter. One can control the quantity $\Delta_{\mathrm{fused}}(\theta)$ by assuming a mild regularity condition on $\theta^{*}$ so that the inequality (7) recovers the minimax rate for the piecewise constant vectors (see e.g., Gao et al. (2017)). However, even if $\theta^{*}$ is a monotone vector, (7) does not recover the rate of the isotonic regression (5) because $\Delta_{\mathrm{fused}}(\theta)$ becomes zero if and only if $\theta$ is just a constant vector.

Our risk bound for the nearly-isotonic regression in Section 4.2 fills the gap between the above risk bounds for the isotonic regression and the fused lasso. We will show an oracle inequality of the following form:

[TABLE]

Like in the case of the fused lasso (7), this inequality provides a meaningful risk bound even if we cannot approximate $\theta^{*}$ by a monotone vector. Furthermore, $\Delta_{\mathrm{neariso}}(\theta)$ becomes zero for any monotone vector $\theta\in K_{n}^{\uparrow}$ . Hence, our result can exactly recover the rate achieved by the isotonic regression (5).

3 Lower bounds

In this section, we provide lower bounds for the risk in one-dimensional piecewise monotone regression.

3.1 Minimax lower bound

We are interested in the lower bound for the minimax risk defined as

[TABLE]

where $\Theta\subset\mathbb{R}^{n}$ is a set of piecewise monotone vectors, and the infimum is taken over all (measurable) estimators $\hat{\theta}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ . In particular, for $1\leq m\leq n$ , we consider the class of $m$ -piecewise monotone vectors with a bounded total variation that is defined as follows.

Definition 3.1.

Let $n\geq 2$ and $1\leq m\leq n$ . For any $\mathcal{V}>0$ , let $\tilde{\Theta}_{n}(m,\mathcal{V})$ denote the set of (at most) $m$ -piecewise monotone vectors such that the upper total variation is bounded by $\mathcal{V}$ . In other words, a vector $\theta\in\mathbb{R}^{n}$ is an element of $\tilde{\Theta}_{n}(m,\mathcal{V})$ if and only if the following conditions hold:

(i)

$\theta$ is piecewise monotone on a connected partition $\Pi=\{A_{1},\ldots,A_{m^{*}}\}$ of $[n]$ whose cardinality $|\Pi|=m^{*}$ is not larger than $m$ . 2. (ii)

There exist numbers $\mathcal{V}_{1},\mathcal{V}_{2},\ldots,\mathcal{V}_{m^{*}}$ such that $\sum_{i=1}^{m^{*}}\mathcal{V}_{i}=\mathcal{V}$ , $\mathcal{V}_{i}\geq 0$ , and $\mathcal{V}(\theta_{A_{i}})\leq\mathcal{V}_{i}$ for all $i=1,\ldots,m^{*}$ .

In addition, we also define $\Theta_{n}(m,\mathcal{V})$ as the set of $m$ -piecewise monotone vectors such that the total variations for all pieces are uniformly bounded by $\mathcal{V}/m$ . That is, $\Theta_{n}(m,\mathcal{V})$ is obtained by replacing (ii) by the following condition:

(ii)’

$\mathcal{V}(\theta_{A_{i}})\leq\mathcal{V}/m$ for all $i=1,\ldots,m^{*}$ .

First, we consider $\theta^{*}$ is piecewise monotone on a known partition $\Pi^{*}=\{A_{1},A_{2},\ldots,A_{m^{*}}\}$ and that the total variation of the sub-vector $\theta^{*}_{A_{i}}$ is bounded as $\mathcal{V}(\theta_{i}^{*})\leq\mathcal{V}_{i}$ for each $i=1,2,\ldots,m^{*}$ . Then, the problem is decomposed into $m^{*}$ independent subproblems of estimating monotone vectors $\theta_{i}^{*}$ . The minimax risk lower bound for monotone vectors has been proved by Zhang (2002) and Chatterjee et al. (2015). For simplicity in the notation, we assume here that $n_{i}=|A_{i}|\geq 2$ for all $i=1,2,\ldots,m$ . The minimax risk can be written as

[TABLE]

Hence, the minimax risk over $\tilde{\Theta}_{n}(m,\mathcal{V})$ is clearly bounded from below by

[TABLE]

If the partition $\Pi^{*}$ is known, then this convergence rate can be obtained by concatenating the least squares estimators on all pieces. By Jensen’s inequality, the quantity (9) is not larger than $(\sigma^{2}\sum_{i}\mathcal{V}_{i}/n)^{2/3}$ .

In the general setting, we have to deal with unknown partitions. The following proposition gives the lower bound over the class of piecewise monotone vectors in Definition 3.1.

Proposition 3.2.

Let $n\geq 3$ , $3\leq m\leq n$ , and $\mathcal{V}>0$ . Suppose that $\Theta$ is either $\tilde{\Theta}_{n}(m,\mathcal{V})$ or $\Theta_{n}(m,\mathcal{V})$ in Definition 3.1. Then, for any estimator $\hat{\theta}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ , we have the following lower bound:

[TABLE]

where $C>0$ is a universal constant.

It remains to verify that the lower bound (10) is tight. Thus, in Section 5, we will construct an estimator that adaptively achieves a similar rate.

3.2 Lower bound of isotonic regression with misspecified partitions

Suppose that $\theta^{*}$ is an $m$ -piecewise monotone vector. As we mentioned in the previous subsection, if we know the true partition on which $\theta^{*}$ is monotone, the least squares estimator can achieve the rate shown in (9). Here, we consider what happens if we underestimate the true number of the pieces.

We consider the risk behavior of the isotonic regression $\hat{\theta}_{K_{n}^{\uparrow}}$ , which corresponds to the least squares estimator for the underestimated number of pieces as $m=1$ . If the true number of pieces is larger than or equal to two, $\theta^{*}$ may not be contained in $K_{n}^{\uparrow}$ . Recall that $\mathrm{dist}(\theta^{*},K_{n}^{\uparrow})$ is the $\ell_{2}$ -misspecification error against the set of monotone vectors. Bellec (2018) showed that the isotonic regression is robust against a small $\ell_{2}$ -misspecification, that is, if $\mathrm{dist}(\theta^{*},K_{n}^{\uparrow})\leq\epsilon$ , then

[TABLE]

where $k(\bar{\theta})$ is the orthogonal projection of $\theta^{*}$ onto $K_{n}^{\uparrow}$ . Conversely, if the $\ell_{2}$ -misspecification error is large, we see that the isotonic regression can have an arbitrarily large risk.

Proposition 3.3.

There is a positive number $t=t_{n,\sigma^{2}}$ that depends on $n$ and $\sigma^{2}$ such that if the true parameter $\theta^{*}$ satisfies $\mathrm{dist}(\theta^{*},K_{n}^{\uparrow})>t$ , then the MSE of the isotonic regression is bounded from below as

[TABLE]

In this case, the isotonic regression has a strictly larger MSE than that of the identity estimator $\hat{\theta}_{\mathrm{id}}=y$ .

We can easily check that there is a 2-piecewise monotone vector with an arbitrarily large $\ell_{2}$ -misspecification error. To see this, let $\theta^{*}\in\mathbb{R}^{2n}$ be a piecewise constant vector defined as $\theta^{*}_{i}=M>0$ for $i=1,\ldots,n$ and $\theta^{*}_{i}=0$ for $i=n+1,\ldots,2n$ . Then, it is easy to see that $\mathrm{dist}(\theta^{*},K^{\uparrow}_{2n})=\sqrt{nM^{2}/2}$ diverges as $M\to\infty$ . Figure 2 shows an example of a 2-piecewise monotone vector $\theta^{*}$ such that the isotonic regression has a larger squared loss value than the identity estimator.

4 Risk bounds for nearly-isotonic regression

In this section, we develop the risk bound for the nearly-isotonic regression estimator (3). Proofs of all the theorems and propositions in this section are presented in Appendix D.

4.1 Risk bounds for constrained estimators

Before considering the original version of the nearly-isotonic regression (3), we consider the performance of the constrained form nearly-isotonic regression $\hat{\theta}_{\mathcal{V}}$ defined by the following constrained optimization problem:

[TABLE]

where $\mathcal{V}\geq 0$ is the tuning parameter. By the fundamental duality theorem in convex optimization, there exists a Lagrange multiplier $\lambda_{\mathcal{V}}\geq 0$ such that the regularization type formulation (3) admits the same solution $\hat{\theta}_{\lambda_{\mathcal{V}}}=\hat{\theta}_{\mathcal{V}}$ . Hence, the solution path of penalized estimators $\{\hat{\theta}_{\lambda}:\lambda\geq 0\}$ and that of constrained estimators $\{\hat{\theta}_{\mathcal{V}}:\mathcal{V}\geq 0\}$ are equivalent. However, the properties of estimators with fixed values of $\lambda\geq 0$ and $\mathcal{V}\geq 0$ can be different in the following sense:

•

From a computational perspective, calculating the constrained estimator (11) for a given $\mathcal{V}\geq 0$ is more difficult than the regularization estimator (3). For the regularization estimator (3), we can use the Modified Pool Adjacent Violators Algorithm (Modified PAVA) proposed by Tibshirani et al. (2011), which outputs the solution path for every $\lambda\geq 0$ . In particular, given $\lambda\geq 0$ , we can always obtain an exact solution $\hat{\theta}_{\lambda}$ . However, to the best of our knowledge, there are no practical algorithms that obtain an exact solution for the constrained problem (11) that run as fast as the algorithms for the penalized problem (3). We present detailed explanations for the algorithms in Section A.

•

From a statistical perspective, the correspondence between tuning parameters $\lambda$ and $\mathcal{V}$ is not deterministic (i.e., it depends on the realization of the data $y$ ). For this reason, a risk bound that is obtained for one of (3) or (11) cannot be directly applied to the other.

We show the main results on the adaptation property to piecewise monotone vectors in terms of sharp oracle inequality.

Before proceeding, we introduce some notations. Suppose that $\theta\in\mathbb{R}^{n}$ is piecewise constant on a connected partition $\Pi_{\mathrm{const}}=\{A_{1},\ldots,A_{k}\}$ of $[n]$ . We denote by $k(\theta):=|\Pi_{\mathrm{const}}|$ the number of pieces in which $\theta$ becomes constant. That is, there are integers $1=\tau_{1}<\cdots<\tau_{k+1}=n+1$ such that (i) $A_{i}=\{\tau_{i},\tau_{i}+1,\ldots,\tau_{i+1}-1\}$ for $i=1,\ldots,k$ and (ii) for any $i\in[k]$ , there exists $t_{i}\in\mathbb{R}$ such that $\theta_{j}=t_{i}$ for all $j\in A_{i}$ . We define the sign $w_{i}\in\{0,1\}$ associated with each knot $\tau_{i}$ ( $i=1,\ldots,k+1$ ) as

[TABLE]

In other words, $w_{i}=1$ if and only if the order violation $\theta_{j-1}>\theta_{j}$ occurs at $j=\tau_{i}$ . See Figure 3 for the graphical illustration. Then, we define $M(\theta)$ as

[TABLE]

$M(\theta)$ determines the non-monotonicity of a piecewise constant vector $\theta$ . If $\theta$ is $m$ -piecewise monotone, then it is clear that $M(\theta)\leq 2(m-1)$ . In particular, for any monotone vector $\theta$ , we have $M(\theta)=0$ . Based on these notations, we have the following sharp oracle inequality.

Theorem 4.1.

For any $\theta^{*}\in\mathbb{R}^{n}$ , the constrained nearly-isotonic regression (11) satisfies the following oracle inequality:

[TABLE]

Moreover, for any $\eta\in(0,1)$ , we have

[TABLE]

with probability at least $1-\eta$ .

The following risk bound for the best choice of the tuning parameter $\mathcal{V}\geq 0$ is an immediate consequence of Theorem 4.1.

Corollary 4.2.

Suppose $\theta^{*}\in\mathbb{R}^{n}$ . Choose $\mathcal{V}^{*}\geq 0$ that minimizes the upper bound in (4.1) (thus, $\mathcal{V}^{*}$ depends on the true parameter $\theta^{*}$ ). Then, we have

[TABLE]

Also, choosing $\mathcal{V}:=\mathcal{V}^{*}$ or $\mathcal{V}:=\mathcal{V}_{-}(\theta^{*})$ , we have

[TABLE]

Remark 4.3.

We briefly comment on the proof of Theorem 4.1 and Corollary 4.2. A key ingredient is to obtain a bound on the statistical dimension (Amelunxen et al. 2014) of the tangent cone of the constraint set $\{\theta\in\mathbb{R}^{n}:\mathcal{V}_{-}(\theta)\leq\mathcal{V}\}$ . This methodology was first developed for the isotonic regression and the convex regression by Bellec (2018). In particular, our approach is inspired by the analysis of the constrained trend filtering estimators by Guntuboyina et al. (2017). See Appendix D for detailed proofs.

By restricting the region over which the infimum in (4.2) is taken, we have the oracle inequality for monotone vectors

[TABLE]

which recovers the existing results on the isotonic regression (Chatterjee et al. 2015, Bellec 2018) up to a constant multiplicative factor.

To understand the general upper bound in (4.2), we have to control the quantity $M(\theta)$ defined in (13). To this end, we consider the minimal length condition; we say that $\theta\in\mathbb{R}^{n}$ satisfies the minimal length condition for a constant $c>0$ if it satisfies

[TABLE]

where the partition $\Pi_{\mathrm{const}}=\{A_{1},A_{2},\ldots,A_{k}\}$ and the signs $w_{i}$ ( $i=1,\ldots,k+1$ ) are defined as in (13). Intuitively, a signal $\theta\in\mathbb{R}^{n}$ is well approximated by another signal that satisfies the minimal length condition if $\theta$ has “moderate slopes” around the order-violating jumps. For further discussion on such growth conditions, see Section 4.3.

Based on the minimal length condition, we have the following result from Theorem 4.1 .

Corollary 4.4.

Suppose that $\theta^{*}\in\mathbb{R}^{n}$ satisfies the minimal length condition (18) for a constant $c>0$ . Assume that $\theta^{*}$ is $k(\theta^{*})$ -piecewise constant and $m(\theta^{*})$ -piecewise monotone. Then, the constrained nearly-isotonic regression (11) satisfies

[TABLE]

In particular, if the tuning parameter $\mathcal{V}$ is chosen so that

[TABLE]

for a positive constant $C^{\prime}$ , we have

[TABLE]

where $C^{\prime\prime}$ is a positive constant.

Remark 4.5.

If $\theta$ is $k$ -piecewise constant and $m$ -piecewise monotone, it is always true that $k\geq 2(m-1)$ . Hence, the inequality (4.4) can be simplified as

[TABLE]

where $C(c)>0$ is a constant that depends on $c$ alone.

Remark 4.6.

We comment on the minimal length condition and the relation to estimation of piecewise constant vectors. We conjecture that the minimum length condition (18) is essentially unavoidable for the risk bound of the nearly-isotonic regression due to the following analogy to the fused lasso. The minimal length condition for the fused lasso is considered by Guntuboyina et al. (2017). For the fused lasso, Fan and Guan (2017) showed that the minimum length condition cannot be removed in the sense that there is a lower bound depending on the minimum length $\Delta=\min_{i}|A_{i}|$ (see also the experimental result by Guntuboyina et al. (2017), Remark 2.5).

4.2 Risk bounds for penalized estimators

In this section, we consider the risk bounds for the nearly-isotonic regression (3) in the original penalized form by Tibshirani et al. (2011).

Theorem 4.7.

For any $\lambda\geq 0$ , let $\hat{\theta}_{\lambda}$ denote the nearly-isotonic regression estimator defined in (3). Let $\theta^{*}$ and $\theta$ be any vectors in $\mathbb{R}^{n}$ . Then, there exists a tuning parameter $\lambda^{*}=\lambda^{*}(\theta)\geq 0$ that depends only on $\theta$ such that, for any $\lambda\geq\lambda^{*}$ , we have the following risk bound:

[TABLE]

where $M(\theta)$ and $k(\theta)$ are defined similarly as in Theorem 4.1. Furthermore, for any $\eta\in(0,1)$ , the inequality

[TABLE]

holds with probability $1-\eta$ .

We comment on some direct consequences of Theorem 4.7. In this theorem, $\lambda^{*}(\theta)$ is defined as a function of $\theta$ . To understand the risk bound (4.7), we consider the choice of the tuning parameter $\lambda\geq 0$ that depends on the true parameter $\theta^{*}$ . Let $\bar{\theta}$ be a vector that minimizes the quantity

[TABLE]

among all $\theta\in\mathbb{R}^{n}$ . Then, taking $\lambda^{**}:=\lambda^{*}(\bar{\theta})$ , we have the following oracle inequality which has the same form as (4.2):

[TABLE]

Moreover, if $\lambda:=\lambda^{**}$ or $\lambda:=\lambda^{*}(\theta^{*})$ , we have

[TABLE]

Again, if we assume the minimal length condition (18) on $\theta^{*}$ , we obtain a simplified bound of the form (17).

We move on to discuss a precise expression of $\lambda^{*}(\theta)$ in Theorem 4.7. The next proposition provides an upper bound for $\lambda^{*}(\theta)$ .

Proposition 4.8.

Suppose $\theta\in\mathbb{R}^{n}$ . Let $\Pi_{\mathrm{const}}(\theta):=\{A_{1},A_{2},\ldots,A_{k}\}$ be the constant partition of $\theta$ , and $w_{1},w_{2},\ldots,w_{k+1}$ be the associated signs defined in (4.1). Then, there is a universal constant $C>0$ such that $\lambda^{*}(\theta)$ in Theorem 4.7 is bounded from above by

[TABLE]

The purpose of the choice of $\lambda^{*}$ in Proposition 4.8 is to derive the theoretical convergence rate in terms of $k(\theta)$ and $M(\theta)$ . However, different choices are possible if we are interested in other theoretical aspects (e.g., estimation consistency for changepoints). For the fused lasso estimator (6), several authors have studied theoretical choices of tuning parameters that result in risk upper bounds (Dalalyan et al. 2017, Lin et al. 2017, Guntuboyina et al. 2017).

Remark 4.9 (Example of parameter choice).

Here, we provide an example choice of the tuning parameter $\lambda$ under a simple length condition. Let us assume that (i) $\theta^{*}$ is not globally monotone (i.e., $M(\theta^{*})>0)$ ) and (ii) $|A_{i}|$ is of order $n/k$ , that is,

[TABLE]

holds for some $0<c_{1}<c_{2}$ . Then, we can see that $\lambda^{*}(\theta^{*})$ is bounded from above by

[TABLE]

where $C^{\prime}$ is a constant that depends on $C,c_{1},c_{2}$ . For the fused lasso, the theoretical choice $\lambda=O(\sigma\sqrt{n\log\mathrm{e}n})$ has been suggested by Dalalyan et al. (2017) and Guntuboyina et al. (2017). For a detailed discussion, see Remark 2.7 by Guntuboyina et al. (2017) and references therein.

Remark 4.10.

In general, the choice of the tuning parameter that minimizes the risk can be different from the theoretical suggestion. More importantly, we cannot obtain the value of $\lambda$ suggested in Proposition 4.8 because it depends on the unknown true parameter $\theta^{*}$ and the noise standard deviation $\sigma$ . In practice, there are two typical data-dependent choices of $\lambda$ :

•

Stein’s unbiased risk estimate: If we know $\sigma$ or its estimate value $\hat{\sigma}$ , we can reasonably choose a parameter $\lambda$ by minimizing Stein’s unbiased risk estimate (SURE)

[TABLE]

Here, $\hat{\mathrm{df}}(\hat{\theta}_{\lambda}):=k(\hat{\theta}_{\lambda})$ is an unbiased estimate of the degrees of freedom. See Tibshirani et al. (2011) for the derivation.

•

Cross-validation: We can also apply the cross-validation when the model (2) is interpreted as a discrete observation of a continuous signal. Specifically, suppose that the data is generated according to the following nonparametric regression model:

[TABLE]

where $x_{1}<x_{2}<\ldots<x_{n}$ are given design points in $[0,1]$ and $f^{*}:[0,1]\to\mathbb{R}$ is an unknown piecewise monotone function. We define the nearly-isotonic regression estimator $\hat{f}_{\lambda}$ over the interval $[0,1]$ as follows: First, we determine the values $\hat{\theta}_{\lambda,i}$ ( $i=1,2,\ldots,n$ ) by solving

[TABLE]

Then, we define $\hat{f}_{\lambda}:[0,1]\to\mathbb{R}$ by interpolation. For instance, one can output a piecewise constant function so that $\hat{f}_{\lambda}(x_{i})=\hat{\theta}_{\lambda,i}$ . In this sense, given a new design point $x^{\mathrm{new}}$ , we can predict the value of $f^{*}(x^{\mathrm{new}})$ by $\hat{f}_{\lambda}(x^{\mathrm{new}})$ . Hence, we can naturally apply the cross-validation in this situation.

4.3 Application to piecewise monotone vectors

To gain a deeper understanding of the adaptation property of the nearly-isotonic regression, we study the risk bound under a more specific assumption. We define the following moderate growth condition for piecewise monotone vectors.

Definition 4.11.

Let $n\geq 2$ . We say that a monotone vector $\theta\in K^{\uparrow}_{n}$ satisfies the moderate growth condition if

[TABLE]

and

[TABLE]

Figure 4 gives an illustration of the moderate growth condition. In words, the signal $\theta\in\mathbb{R}^{n}$ satisfying the moderate growth condition is not larger than the linear signal in the left half of the domain, and not less than that in the right half of the domain. Intuitively, the role of the moderate growth condition is to guarantee the minimal length condition (18) for a piecewise constant approximation.

Suppose that the true signal $\theta^{*}$ is piecewise monotone and every segment satisfies the moderate growth condition. Then, the nearly-isotonic regression achieves a nearly minimax convergence rate as follows.

Corollary 4.12.

Suppose that the following assumptions hold:

(a)

The partition is equi-spaced: $|A_{1}|=|A_{2}|=\cdots=|A_{m}|\;(=\frac{n}{m})$ . 2. (b)

$\theta_{A_{j}}^{*}$ is monotone and $\mathcal{V}(\theta_{A_{j}}^{*})\leq\mathcal{V}/m$ for each $j=1,\ldots,m$ . 3. (c)

$\theta^{*}_{A_{j}}$ satisfies the moderate growth condition for each $j=1,2,\ldots,m$ .

Then, the estimator (3) with optimally tuned parameter $\lambda$ satisfies the following risk bound:

[TABLE]

The risk bound (25) achieves the minimax rate over $\Theta_{n}(m,\mathcal{V})$ in Proposition 3.2 up to a multiplicative factor of $\log^{2/3}\frac{\mathrm{e}n}{m}$ . We should note that the restrictive assumption (a) in Corollary 4.12 is employed merely for the sake of simplicity of the proof. We may relax this assumption as

[TABLE]

for some $c^{\prime}>0$ .

5 Model selection based estimators

Here, we consider estimators obtained by model selection among all partitions $\Pi$ . The main purpose of this section is to discuss whether the minimax lower bound in Proposition 3.2 can be achieved without any additional assumption such as the moderate growth condition.

Given a connected partition $\Pi=(A_{1},A_{2},\ldots,A_{m})$ of $[n]$ , we write $K_{\Pi}^{\uparrow}$ for the set of piecewise monotone vectors on $\Pi$ , i.e.,

[TABLE]

Let $\hat{\theta}_{\Pi}$ denote the projection estimator onto $K_{\Pi}^{\uparrow}$ . By definition, $\hat{\theta}_{\Pi}$ is obtained by concatenating isotonic regression estimators defined in every segment.

If we know the true partition $\Pi^{*}$ on which $\theta^{*}$ is piecewise monotone, then the risk of the projection estimator $\hat{\theta}_{\Pi^{*}}$ is bounded from above by

[TABLE]

If the true partition is unknown, a natural idea is to select a data-dependent partition $\hat{\Pi}$ by a penalized selection rule:

[TABLE]

Here, $\mathrm{pen}(\Pi)$ is a positive penalty for the partition $\Pi$ .

The penalized selection rules have been well studied in statistics. In particular, Birgé and Massart (2001) and Massart (2007) developed non-asymptotic risk bounds for generic model selection settings in Gaussian sequence models. Hereafter, we construct a penalized selection estimator in the spirit of Theorem 4.18 in Massart (2007).

Instead of selecting $\hat{\theta}_{\Pi}$ according to (26), we introduce the total variation sieves. Namely, in addition to selecting partitions, we also select budgets of piecewise total variations as follows. Let $\Pi=(A_{1},A_{2},\ldots,A_{m})$ be a connected partition. For any vector $\mathbf{V}=(\mathcal{V}_{1},\mathcal{V}_{2},\ldots,\mathcal{V}_{m})$ with $\mathcal{V}_{i}\geq 0$ ( $i=1,2,\ldots m$ ), we define the set of piecewise monotone vectors with bounded total variations as

[TABLE]

Then, we define $\hat{\theta}_{\Pi,\mathbf{V}}$ as the projection estimator onto $K_{\Pi}^{\uparrow}(\mathbf{V})$ . Next, we define a countable set of vectors $\mathbf{V}$ as

[TABLE]

where $v(j):=j^{3/2}$ . Finally, we select a pair $(\hat{\Pi},\hat{\mathbf{V}})$ as the solution of the following minimization problem:

[TABLE]

With a careful choice of the penalty term $\mathrm{pen}(\Pi,\mathbf{V})$ , we have the following result:

Theorem 5.1.

There exists an absolute constant $C_{\mathrm{pen}}>0$ such that the following statement holds. For any pair $(\Pi,\mathbf{V})$ , define the penalty $\mathrm{pen}(\Pi,\mathbf{V})$ so that

[TABLE]

Let $(\hat{\Pi},\hat{\mathbf{V}})$ be the minimizer in (27).

[TABLE]

In particular, if $\theta^{*}$ is piecewise monotone on $\Pi=(A_{1},A_{2},\ldots,A_{m})$ , we have

[TABLE]

We emphasize that Theorem 5.1 does not require any additional assumptions on $\theta^{*}$ , e.g., the minimum length condition or the moderate growth condition introduced in the previous section. Therefore, it suggests the existence of a penalized model selection estimator that achieves the minimax rate in Proposition 3.2. However, the estimator (27) is not practical for a computational reason because it is obtained through the minimization over exponentially many possible partitions $\Pi$ .

The dependence on the total variation of each segment in (5.1) is $(\mathcal{V}^{A_{i}}(\theta^{*}_{A_{i}})+1)^{2/3}$ instead of $(\mathcal{V}^{A_{i}}(\theta^{*}_{A_{i}}))^{2/3}$ . The additional constant $1$ is due to the minimal resolution of the sieve. To establish a non-asymptotic risk bound for the penalized model selection estimator without sieves (i.e., (26)) and remove the dependence on the sieve resolution remains an open problem.

6 Simulations

We provide some numerical examples for piecewise monotone regression problems.

6.1 Dealing with inconsistency at boundaries

Before presenting the simulation results, we here explain a well-known practical issue in the isotonic regression literature and a regularization method to cope with it.

In the study of statistical estimation under monotonicity constraints, it is known that the least squares estimator $\hat{\theta}_{K_{n}^{\uparrow}}$ is inconsistent at the boundary points (see e.g., Groeneboom and Jongbloed (2014) and Woodroofe and Sun (1993)). A similar issue arises for the nearly-isotonic regression estimators. Since the penalty term in (3) does not activate if the orders are not violated at the boundary points (i.e., $y_{1}<y_{2}$ or $y_{n-1}<y_{n}$ ), the nearly-isotonic regression is not robust against a negative noise at the left boundary or a positive noise at the right boundary. To overcome this issue, we consider the following boundary correction regularization for the nearly-isotonic regression:

[TABLE]

where $\mu>0$ is an additional tuning parameter. It can easily be checked that the solution is equivalent to that of the ordinary nearly-isotonic regression (3) applied to $\tilde{y}=(y_{1}+\mu,y_{2}\ldots,y_{n-1},y_{n}-\mu)$ . Similar regularization methods for isotonic regression have been studied by Chen et al. (2015), Wu et al. (2015) and Luss and Rosset (2017).

6.2 Simulation data

Here, we evaluate the performance of the nearly-isotonic regression and related estimators on simulated data. According to the one-dimensional regression model (23), we generated data with equi-spaced design points $x_{i}=(i-1)/n$ ( $i=1,2,\ldots,n$ ). For the true function $f^{*}$ , we consider $m$ -piecewise monotone functions defined as

[TABLE]

where $f:[0,1)\to\mathbb{R}$ is a given monotone function and $I_{j}:=[(j-1)/m,j/m)$ for $j=1,2,\ldots,m$ . Following Meyer and Woodroofe (2000), we choose $f$ from the following two monotone functions:

[TABLE]

Figure 2 shows an example of $f=f_{\mathrm{sigmoid}}$ and $m=2$ . It is worth noting that the former sigmoidal function $f_{\mathrm{sigmoid}}$ satisfies the moderate growth condition (see Definition 4.11), whereas the latter cubic function $f_{\mathrm{cube}}$ does not. Hence, for the case of piecewise sigmoidal functions $f_{\mathrm{sigmoid}}^{(m)}$ , the minimax rate of $\mathord{\mathrm{O}}(n^{-2/3})$ is achieved by both the nearly-isotonic regression and the fused lasso (see Corollary 4.12 above and Corollary 2.8 by Guntuboyina et al. (2017)).

In our experiments, the size $n$ of the signal is chosen from $\{2^{6},2^{7},\ldots,2^{10}\}$ . The noise standard deviation $\sigma$ is assumed to be known and fixed to $0.25$ . We evaluated the MSE for the following four estimators:

•

Neariso: The nearly-isotonic regression (3).

•

NearisoBC: The nearly-isotonic regression with boundary correction (29)

•

Fused: The fused lasso (6).

•

PO: The projection estimator with the partition oracle, i.e., the projection estimator onto $K_{\Pi}^{\uparrow}$ provided with the true partition $\Pi$ .

For Neariso and Fused, the tuning parameter $\lambda$ is selected by generalized $C_{p}$ criteria (i.e., minimizing SURE (22)). For NearisoBC, the tuning parameters $(\lambda,\mu)$ are selected by a similar criterion. To estimate the MSE, we generated 500 replications of the data and calculated the average value of the squared loss $\frac{1}{n}\lVert\hat{\theta}-\theta^{*}\rVert_{2}^{2}$ .

Figure 5 presents the results for $m=2,4$ and $f=f_{\mathrm{sigmoid}},f_{\mathrm{cubic}}$ . The upper line shows log-log plots of the MSE versus $n$ . In each setting, the three regularization based estimators (i.e., Neariso NearisoBC and Fused) performed as well as the ideal estimator PO, whereas the former three estimators do not use the information about the true partition. The risks of PO are well fitted by lines of slopes of $-2/3$ , which means that the speed of the convergence is about the minimax optimal rate of $\mathord{\mathrm{O}}(n^{-2/3})$ .

Next, we provide more detailed comparisons of regularization based estimators. The lower line in Figure 5 shows the difference of MSEs from that of PO. For piecewise sigmoidal functions, NearisoBC and Fused performed better than Neariso. Notably, in the case of $m=2$ , the risks of Fused were even better than PO for large values of $n$ . A possible reason for the better performance of the fused lasso is that the sigmoidal function can be well approximated by a piecewise constant function near the boundaries. On the other hand, for piecewise cubic functions, Neariso performed slightly better than the other two estimators for small values of $n$ .

6.3 Geological data

We conducted experiments on GPS data related to a seismological phenomenon reported by Roggers and Dragert (2003). The aim here is to investigate the performance of the nearly-isotonic type estimators on real-world data in which piecewise monotone approximations have already been justified in the previous work. For the signal $y$ , we used the difference of the east-west components of GPS measurements between two observatories, which are located in Victoria (British Columbia, Canada) and Seattle (United States). The GPS data is provided by Melbourne et al. (2018). The top panel in Figure 6 shows the plot. The data period starts on January 1, 2010, and ends on December 2, 2017. After removing missing records, the size of the signal is $n=2885$ . The increasing trend of the signal is considered to be caused by the subduction process at the plate boundary. We can also see periodic reversals in the signal, and the entire signal may be approximated by a piecewise monotone signal. Such reversals may be related to the seismological phenomenon so-called the episodic tremor and slip. According to Roggers and Dragert (2003), such slip events were observed in every 13 to 16 months in their data taken from 1997 to 2003.

GPS data contains several anomalous values. For the signal $y$ considered above, most of the values $y_{i}$ are between 20 and 50, except for a single outlier $y_{2344}=139.34$ . The behaviors of the estimators are extremely affected by the existence of such outliers. In our situation, we can manually remove the anomalous value (denoted by $\tilde{y}$ ). However, it is often difficult to distinguish outliers in practical situations. From this perspective, we also considered the robust $M$ -estimation version of the nearly-isotonic regression defined as (34) with $\mathcal{L}(\theta;y)=\sum_{i=1}^{n}\ell_{\delta}(\theta_{i}-y_{i})$ . Here, $\ell_{\delta}$ is the Huber loss:

[TABLE]

which is commonly used in the robust regression literature.

We applied the nearly-isotonic regression (3) and its robust variant to the signals $y$ and $\tilde{y}$ in the above. The tuning parameters $\lambda$ were determined by the $5$ -fold cross-validation, and $\delta$ in the Huber loss was fixed as $\delta=0.01$ .

First, we consider the case where the outlier is removed manually. The second panel in Figure 6 shows the result for the cross-validated nearly-isotonic regression. The vertical lines denote the locations of downward jumps in the estimators. We can see that the period of jump clusters is about 12 to 14 months, which is close to that of the seismological slip events suggested by Roggers and Dragert (2003).

Next, we consider the case where the signal contains an outlier. In this case, the value of the squared loss largely depends on the error at the coordinate of the outlier. Then, the cross-validation may choose a large tuning parameter, and the resulting estimator becomes close to a monotone signal. The third panel in Figure 6 shows that the number of downward jumps is considerably less than the number that is expected from the known frequency of the slip events. Conversely, the fourth panel in Figure 6 shows that the robust version of the nearly-isotonic regression outputs similar clusters of change points as in the second panel.

7 Discussion

In this paper, we studied the problem of estimating piecewise monotone signals. The classical isotonic regression estimator cannot be applied in this setting because of the existence of arbitrarily large downward jumps. We derived the minimax risk lower bound over piecewise monotone signals with bounded upper total variations. The minimax rate is tight up to multiplicative constant because it can be achieved by a (computationally inefficient) model selection based estimator. Our main results show that the nearly-isotonic regression estimator achieves this rate under an additional growth condition. An advantage of the nearly-isotonic regression is that the estimator can be calculated efficiently on arbitrary directed graphs by parametric max-flow algorithms. The simulation results demonstrate that the nearly-isotonic regression has an almost similar convergence rate as the ideal estimator that knows the true partition.

7.1 Non-Gaussian noises

In this paper, we provided risk bound for the nearly-isotonic regression under the assumption that the noise distribution is Gaussian. However, in practice, this assumption is too restrictive. We here briefly discuss the risk bound with non-Gaussian error distributions.

Suppose that $\xi_{1},\ldots,\xi_{n}$ are i.i.d. random variables with $\mathbb{E}[\xi_{1}]=0$ and $\mathrm{Var}(\xi_{1})=\sigma^{2}$ . Then, we can see that the “expectation bound” (4.7) holds with a different constant $C^{\prime}>0$ . See Remark D.14 in the appendix for the key ingredients for the derivation. On the other hand, the “high-probability bound” (4.7) does not hold in general since it requires a more strong concentration property (i.e., the Gaussian concentration).

7.2 Future directions

An interesting direction for future work is to investigate the optimal rate of piecewise monotone regression on higher dimensional grids or general graphs. Recently, several researchers have analyzed the risk bounds for the isotonic regression estimators on two or more higher dimensional grid graphs (Chatteejee et al. 2018, Han et al. 2017). It is natural to ask whether one can construct a computationally efficient estimator that is adaptive to piecewise monotone vectors on a given graph. We believe that the nearly-isotonic type estimator (32) is a candidate. A major difficulty is to determine an appropriate graph topology. Given a partial order $\preceq$ on a set $V=[n]$ , the corresponding isotonic regression estimator is uniquely determined. However, there are many directed acyclic graphs that correspond to partial order $\preceq$ . Hence, the graph topology for the nearly-isotonic type estimators is not unique. To control the connectivity, it may be useful to introduce edge weightings proposed by Fan and Guan (2017).

Another direction is to develop a model selection method for least squares estimators over unbounded cones. We introduced sieves on the total variation in Section 5 to construct an estimator that is adaptive to piecewise monotone vectors. In practice, sieve-based methods can be computationally inefficient. Conversely, if the true vector $\theta^{*}$ is monotone, the isotonic regression automatically achieves the minimax rate with respect to the total variation. We conjecture that it is also possible to select the least squares estimator $\hat{\theta}_{\Pi}$ without using sieves. In particular, we leave it as an open question whether the adaptive risk bound is achieved by the penalized selection rule of the form (26).

Appendix A Algorithms for nearly-isotonic estimators

In this section, we present algorithms for the nearly-isotonic regression and related estimators and discuss their computational complexities. Note that the main purpose of this section is to give a review of existing algorithms, and hence most results presented in this section are not new (except for Proposition A.1).

A.1 Penalized estimators

Here, we introduce two algorithms to solve the penalized form nearly-isotonic regression (3). In Section A.1.1, we introduce the solution path algorithm developed by Tibshirani et al. (2011). The advantage of the solution path algorithm is that it outputs the solutions $\hat{\theta}_{\lambda}$ for every $\lambda\geq 0$ simultaneously. However, the solution path algorithm cannot be applied to the estimators with general weights and graphs. In Section A.1.2, we provide another algorithm that outputs the exact solution for a single $\lambda$ . The latter algorithm can be applied to the nearly-isotonic type estimators defined on any weighted directed graphs.

A.1.1 One-dimensional problem

The modified pool adjacent violators algorithm (modified PAVA, Tibshirani et al. (2011)) is the algorithm used to calculate the solution path for the problem (3). Here, we present a variant of the modified PAVA for the following weighted version of the estimator:

[TABLE]

where $c_{i}>0$ ( $i=1,2,\ldots,n-1$ ) are positive weight parameters. Letting $c_{i}=(x_{i+1}-x_{i})^{-1}$ , this formulation covers the nearly-isotonic regression for general increasing design points (24).

The derivation of Algorithm 1 is straightforward from the original paper of Tibshirani et al. (2011). We should note that the validity of this algorithm crucially depends on the property that the solution path is piecewise linear and “agglomerative”. It is well known that the piecewise linearity of the solution path holds for many classes of regularization estimators (Rosset and Zhu 2007). We say that the solution path $\{\hat{\theta}_{\lambda}\}_{\lambda\geq 0}$ is agglomerative if it satisfies the following condition: if $\hat{\theta}_{\lambda,i}=\hat{\theta}_{\lambda,j}$ holds for some $\lambda=\lambda_{0}$ , then the same equality holds for any $\lambda\geq\lambda_{0}$ . For the constant weights ( $c_{i}\equiv 1$ ), such agglomerative property was proved by Tibshirani et al. (2011). However, for general non-unitary edge weights ( $c_{i}\neq 1$ ), this need not be true. Here, we provide the following proposition to ensure the agglomerative property for non-unitary edge weights.

Proposition A.1.

The solution path of weighted nearly-isotonic regression (30) is piecewise linear and agglomerative if the edge weights satisfy the following concavity condition.

[TABLE]

where we defined $c_{0}:=0$ . In particular, this condition implies that Algorithm 1 outputs the exact solution path.

The condition (31) demands that $c_{j}$ can be written as $c_{j}=f(j)$ for some concave function $f:\mathbb{R}_{\geq 0}\to\mathbb{R}_{\geq 0}$ with $f(0)=0$ and $f(x)>0$ for all $x>0$ . In particular, for any $i\leq j\leq k$ , we have

[TABLE]

and

[TABLE]

Proof sketch of Proposition A.1.

We can prove the validity of Algorithm 1 by a similar argument as Tibshirani et al. (2011) if we assume the piecewise linearity and the agglomerative property. The piecewise linearity is already shown in Rosset and Zhu (2007). Hence, it remains to prove the agglomerative property under the condition (31). To this end, we leverage the “agglomerative clustering condition” defined in Appendix D.6. In particular, we defer the details to Remark D.25 as well as Remark D.27. ∎

A.1.2 General graphs

Let $G=(V,E)$ be a directed graph with $V:=[n]$ . Suppose that each edge $(i,j)\in E$ is equipped with a positive weight $c_{(i,j)}>0$ . We define the generalized nearly-isotonic regression as

[TABLE]

where $\mathcal{V}_{G}$ is a nearly-isotonic type penalty defined as

[TABLE]

For any choices of $G$ and $c$ , $\mathcal{V}_{G}$ becomes a convex function. Clearly, the lower total variation $\mathcal{V}_{-}$ is a special case where $E=\{(i,i+1):i=1,2,\ldots,n-1\}$ and $c_{(i,i+1)}\equiv 1$ . Thus, (32) can be regarded as a generalization of the nearly-isotonic regression to general directed graphs.

The problem of the form (32) has been well studied in the optimization literature. In particular, we can see that solving (32) is equivalent to solving a certain parametrized family of minimum-cut problems. For detailed explanations of such an equivalence, see Obozinski and Bach (2016) and Chapter 8 in Bach (2013). Hence, (32) can be solved by the parametric max-flow algorithm (Gallo et al. 1989) that runs in $\mathord{\mathrm{O}}(n|E|\log\frac{n^{2}}{|E|})$ . Conversely, it has been pointed out by Mairal et al. (2011) that, for many practical instances, some simplified variants of the parametric max-flow algorithm output the solution faster than the original algorithm by Gallo et al. (1989). We remark that Hochbaum and Queyranne (2003) also developed the relationship between the isotonic regression and the parametric max-flow algorithm.

Algorithm 2 shows the Divide-and-Conquer algorithm (Chapter 9 of Bach (2013)) that solves (32). In the inner loop, the algorithm recursively solves max-flow problems by defining smaller networks (Algorithm 3). See Figure 7 for examples of networks used in the first two recursions in the algorithm.

A.1.3 General convex loss functions

In practice, we are often interested in general convex loss functions other than the squared loss. Here, we consider a generalized problem of the following form:

[TABLE]

where $\theta\mapsto\mathcal{L}(\theta;y)$ is a convex loss function for any $y\in\mathbb{R}^{n}$ . As an example, this formulation contains the $M$ -estimator in the regression setting $\mathcal{L}(\theta;y)=\frac{1}{2}\ell(y_{i}-\langle x_{i},\theta\rangle)$ , where $(y_{i},x_{i})\in\mathbb{R}\times\mathbb{R}^{p}$ ( $i=1,2,\ldots,n$ ) are the observed data and $\ell:\mathbb{R}\to\mathbb{R}$ is a convex function.

We can also obtain algorithms that output approximate minimizers of (34) as follows. First of all, note that Algorithm 2 outputs the proximal operator of the regularization term $\mathcal{V}_{G}(\theta)$ . Once we have an oracle for the proximal operator, we can apply proximal gradient methods to solve (34). In particular, if $\mathcal{L}(\theta;y)$ is convex and smooth, the Fast Iterative Shrinkage Thresholding Algorithm (FISTA, Beck and Teboulle (2009)) outputs an $\mathord{\mathrm{O}}(\epsilon)$ -optimal solution after $\mathord{\mathrm{O}}(\epsilon^{-2})$ evaluations of the proximal operator.

A.2 Constrained estimators

Consider the following generalized version of the constrained form of nearly-isotonic regression (11):

[TABLE]

Unlike the penalized estimators, it is difficult to find an exact solution of (35). However, since problem (35) is an instance of a quadratic programming problem, there are polynomial time algorithms to obtain approximate solutions. Here, we explain the existence of such algorithms. The following result is a direct application of Theorem 1 by Lee et al. (2018), which provides a convergence guarantee of a variant of cutting plane methods.

Proposition A.2.

Suppose that $G=([n],E)$ is a directed graph equipped with positive weights $c_{(i,j)}$ for every $(i,j)\in E$ . Let $y\in\mathbb{R}^{n}$ be any vector and $\mathcal{V}>0$ . Then, for any $\epsilon>0$ , there exists a randomized algorithm that outputs $\tilde{\theta}$ satisfying

[TABLE]

and

[TABLE]

with a probability of $0.99$ . The overall complexity of the algorithm is $\mathord{\mathrm{O}}((n+|E|)n^{2}\log^{\mathord{\mathrm{O}}(1)}\frac{n}{\epsilon|E|})$ .

Remark A.3.

In practice, due to computational considerations, we recommend to use the penalized estimator (33) instead of the constrained estimator (35). For the penalized estimator, we empirically observed that Algorithm 2 runs sufficiently fast graphs with several hundreds of nodes. For the constrained estimator, Proposition A.2 theoretically guarantees polynomial time solvability of the constrained problem (35), whereas it does not provide a practical algorithm.

Appendix B Supplemental experiments

To understand the behavior of the nearly-isotonic regression in more generic settings, we present additional simulation results for the nearly-isotonic regression on general graphs (32). Here, we consider the problem of estimating piecewise monotone signals on two-dimensional grids.

We say that an $n_{1}\times n_{2}$ matrix $\theta$ is monotone if $\theta_{ij}\leq\theta_{kl}$ whenever $i\leq k$ and $j\leq l$ . In other words, $\theta$ is monotone if it has no order-violating edges in the two-dimensional grid graph $G_{2}=(V_{2},E_{2})$ , where $V_{2}=[n_{1}]\times[n_{2}]$ is the set of all subscripts $(i,j)$ and

[TABLE]

We say that $\theta$ is piecewise monotone if there is a partition $\Pi$ of $V$ such that, for each $A\in\Pi$ , $A$ is a weakly connected component of $G_{2}$ and $\theta_{A}$ has no order-violating edges in the induced subgraph. For simplicity of experimental settings, we here only consider “block” type partitions, i.e., we say that $\Pi$ is of block type if it can be represented as a product of two partitions of the two coordinates. The left panel in Figure 8 is an example of two-dimensional piecewise monotone signals on a block type partition.

We compare the following three estimators:

•

LSE: The bivariate isotonic regression (see e.g., Robertson et al. (1988)).

•

Neariso2: The two-dimensional nearly-isotonic regression with $C_{p}$ -tuned parameter.

•

PO: The bivariate isotonic regression applied to the true partition.

For monotone matrices, Chatteejee et al. (2018) proved that LSE is minimax rate optimal with respect to $n=n_{1}n_{2}$ . Hence, the partition oracle estimator PO can be regarded as an ideal benchmark that is minimax optimal over piecewise monotone matrices. On the other hand, if the true matrix $\theta^{*}$ is piecewise monotone, the risk of LSE can be arbitrarily large for the same reason as Proposition 3.3. Neariso2 is the special case of the generalized nearly-isotonic regression (32) applied to the graph $G_{2}$ defined above. Neariso2 was originally discussed in Tibshirani et al. (2011), but no experimental results have been presented. Figure 8 shows examples of the solutions of the three estimators.

We construct an $n\times n$ matrix $\theta^{*}$ as follows: We define a $k\times k$ small monotone matrix $U$ , and then we define $\theta^{*}$ as an $mk\times mk$ block matrix by repeating $U$ for $m$ times both in rows and columns (thus $n=mk$ ). We choose the small matrix $U=(U_{ij})$ from

[TABLE]

or

[TABLE]

where we write $x_{i}=\frac{i-1}{k-1}$ for $i=1,2,\ldots,k$ . With the former choice, $\theta^{*}$ becomes an $m^{2}$ -piecewise monotone matrix. With the latter choice, $\theta^{*}$ becomes an $m$ -piecewise monotone matrix such that $\theta^{*}_{ij}$ does not depend on $j$ .

We generated noisy observations $y$ by adding independent Gaussian noises $\xi_{ij}\sim N(0,(0.25)^{2})$ to every entries of $\theta^{*}$ . To estimate the MSE, we used 500 replications of the data. Figure 9 shows the results. Clearly, the risks of LSE (blue triangles) are much larger than those of the other two estimators. Neariso2 (green circles) has slightly larger risks compared to PO (magenta squares), while their slopes seem to be close.

To visualize convergence rates, we fit the risks of PO by monomials $\propto n^{-a}$ ( $a>0$ ), and plotted as dashed lines in Figure 9. The values of the exponent $a$ are respectively as follows: $0.58$ (cubic2d, $m=2$ ); $0.56$ (cubic2d, $m=4$ ); $0.50$ (cubic1d, $m=2$ ); $0.45$ (cubic2d, $m=4$ ). We should note that, in monotone matrix estimation, the theoretical convergence rate of LSE is known to be $\tilde{\mathord{\mathrm{O}}}(n^{-1/2})$ (Chatteejee et al. 2018).

Appendix C Proofs in Section 3

C.1 Proof of Proposition 3.2

Let $\Theta$ be either $\tilde{\Theta}_{n}(m,\mathcal{V})$ or $\Theta_{n}(m,\mathcal{V})$ , which are defined in Definition 3.1. The minimax lower bound (10) is proved by combining the following two lower bounds:

(i)

(Lower bound for monotone vectors (Zhang 2002, Chatterjee et al. 2015)) Let $\mathcal{K}(\mathcal{V})=\{\theta\in K_{n}^{\uparrow}:\mathcal{V}(\theta)\leq\mathcal{V}\}$ be the set of monotone vectors with bounded total variations. There is a universal constant $C_{1}>0$ such that for any estimator $\hat{\theta}$ ,

[TABLE] 2. (ii)

(Lower bound for piecewise constant vectors) Let $\mathcal{C}(m)$ be the set of $m$ -piecewise constant vectors in $\mathbb{R}^{n}$ , i.e., $\theta\in\mathcal{C}(m)$ if $|\{i:\theta_{i}\neq\theta_{i+1}\}|\leq m-1$ . The minimax lower bound over $\mathcal{C}(m)$ can be related to sparse estimation as follows. Let $X$ be an $n\times n$ matrix whose $(i,j)$ entries are given as $1_{\{i\geq j\}}$ . Then, $\mathcal{C}(m)$ contains the set $\{\theta=X\beta:\lVert\beta\rVert_{0}\leq m\}$ , and the lower bound for the minimax risk over $\mathcal{C}(m)$ follows from the well-known results for $\ell_{0}$ balls (e.g., Raskutti et al. (2011), Theorem 3-(b)). In particular, for any $m\geq 3$ , the following lower bound is presented in Gao et al. (2017):

[TABLE]

where $C_{2}>0$ is a universal constant.

It remains to show that $\Theta$ contains $\mathcal{K}(\mathcal{V})$ and $\mathcal{C}(m)$ . $\mathcal{C}(m)\subseteq\Theta$ is obvious because an $m$ -piecewise constant vector is also an $m$ -piecewise monotone vector such that the piecewise total variations are zero. From the definition, it is also clear that $\mathcal{K}(\mathcal{V})\subseteq\tilde{\Theta}_{n}(m,\mathcal{V})$ . If $\theta\in\mathcal{K}(\mathcal{V})$ , the jumps $\theta_{i+1}-\theta_{i}$ that strictly exceeds $\mathcal{V}/m$ cannot occur more than $m-1$ times. Hence, we can choose a partition $\Pi$ with $|\Pi|\leq m$ so that each $A\in\Pi$ does not contain such large jumps, which implies that $\theta\in\Theta_{n}(m,\mathcal{V})$ .

C.2 Proof of Proposition 3.3

The following theorem in the seminal paper of Chatterjee (2014) provides useful upper and lower bounds for the risk of the least square estimator over any closed convex set $K$ .

Theorem C.1 (Chatterjee (2014), Corollary 1.2).

Let $K\subseteq\mathbb{R}^{n}$ be any closed convex set, and let $\hat{\theta}_{K}$ denote the least squares estimator over $K$ . For any $\theta^{*}\in\mathbb{R}^{n}$ , define the function $g_{\theta^{*}}:\mathbb{R}_{+}\to\mathbb{R}\cup\{-\infty\}$ as

[TABLE]

Here, if the set $\{\theta\in K:\lVert\theta-\theta^{*}\rVert_{2}\leq t\}$ is empty, we define $g_{\theta^{*}}(t)=-\infty$ . Then, $g_{\theta^{*}}$ is strictly concave for $t\geq\mathrm{dist}(\theta^{*},K)$ and has a unique maximizer $t_{\theta^{*}}$ . Moreover, there are universal constants $C_{1},C_{2}>0$ such that

[TABLE]

To prove Proposition 3.3, we use the lower bound in (36). Note that for a sufficiently large $t_{0}>0$ , $t\mapsto t^{2}-Ct^{3/2}$ is a strictly increasing in $t\in[t_{0},\infty)$ . For any $n$ and $\sigma^{2}$ , choose $t\geq t_{0}$ so that $t^{2}-Ct^{3/2}\geq n\sigma^{2}$ . Then, for any $\theta^{*}$ such that $\mathrm{dist}(\theta^{*},K)\geq t$ , we have

[TABLE]

Remark C.2.

We should note that the above proof is valid for any closed convex set $K$ . For the specific choice of $K=K_{n}^{\uparrow}$ , the lower bound of $t_{n,\sigma^{2}}$ used in the proof can be quite conservative. In practice, the risk of the isotonic regression estimator can be larger than $\sigma^{2}$ under a smaller value of $\ell_{2}$ -misspecification error.

Appendix D Proofs in Section 4

D.1 Preliminaries

To state the results for risk upper bounds, we first introduce some quantities related to Gaussian processes.

Definition D.1.

Let $C$ be a closed convex set in $\mathbb{R}^{n}$ . Let $\mathbb{E}$ denote the expectation with respect to an isotropic Gaussian random variable $Z\sim N(0,I_{n})$ .

(i)

The Gaussian width of $C$ is defined as

[TABLE] 2. (ii)

The Gaussian mean squared distance is defined as

[TABLE]

where $\mathord{\mathrm{dist}}(z,C):=\inf_{x\in C}\lVert x-z\rVert_{2}$ . 3. (iii)

Suppose that $C$ is a convex cone. The statistical dimension of $C$ is defined as

[TABLE]

We present some historical remarks on these definitions. The three quantities in Definition D.1 can be interpreted as complexity measures for the subset $C$ in the Euclidean space. The Gaussian width has been well studied in convex geometry, signal processing, high-dimensional statistics, and empirical process theory; See e.g., Section 7.8 in Vershynin (2018) for a literature review. The definition of the Gaussian mean squared distance is due to Oymak and Hassibi (2016). As we will see in Lemma D.4 below, the Gaussian mean squared distance is useful to provide the risk bounds for proximal denoising estimators. The statistical dimension was defined in Amelunxen et al. (2014). Recently, Bellec (2018) pointed out that the statistical dimension characterizes the adaptive risk bounds for some shape restricted estimators including the isotonic regression and the convex regression.

As suggested by the definitions, these three quantities are closely related to each other. In particular, if $C$ is a convex cone, these are comparable as follows.

Proposition D.2.

Let $C$ be a closed convex cone.

(i)

(Amelunxen et al. (2014), Proposition 10.2) Let $S_{n-1}=\{x\in\mathbb{R}^{n}:\lVert x\rVert_{2}=1\}$ be the unit sphere in $\mathbb{R}^{n}$ . Then, we have $w^{2}(C\cap S_{n-1})\leq\delta(C)\leq w^{2}(C\cap S_{n-1})+1$ . 2. (ii)

(Amelunxen et al. (2014), Proposition 3.1) Let $C^{\circ}$ be the polar cone of $C$ defined as

[TABLE]

Then, we have $\mathbf{D}(C)=\delta(C^{\circ})$ .

Now, we introduce two general results for risk bounds for general projection estimators and proximal denoising estimators.

Let $K$ be a closed convex set in $\mathbb{R}^{n}$ , and define the projection estimator onto $K$ as $\hat{\theta}_{K}=\operatornamewithlimits{argmin}_{\theta\in K}\lVert y-\theta\rVert_{2}$ . Bellec (2018) proved the following oracle inequality that relates the risk of the projection estimator to the statistical dimension of the tangent cone of $K$ . Here, the tangent cone $T_{K}(\theta)$ of $K$ at $\theta\in K$ is defined as

[TABLE]

Lemma D.3 (Bellec (2018), Corollary 2.2).

Let $\theta^{*}\in\mathbb{R}^{n}$ be any vector, and suppose that the observation $y$ is drawn according to $N(\theta^{*},\sigma^{2}I_{n})$ . Then, we have the following risk bound:

[TABLE]

Moreover, for any $\eta\in(0,1)$ , the inequality

[TABLE]

holds with probability at least $1-\eta$ .

Next, we provide a general result for proximal denoising estimators. Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function, and $\lambda\geq 0$ . We define the proximal denoising estimator $\hat{\theta}_{\lambda}$ as

[TABLE]

The class of proximal denoising estimators contains the soft-thresholding estimator (Donoho et al. 1992), the total variation regularization (Rudin et al. 1992), the trend filtering (Kim et al. 2009) and the nearly-isotonic regression (Tibshirani et al. 2011). Oymak and Hassibi (2016) pointed out that the risk bound of proximal denoising estimators can be characterized by the Gaussian mean squared distance of the set $\lambda\partial f(\theta^{*})$ . Remarkably, based on this technique, Guntuboyina et al. (2017) proved sharp adaptation results for the trend filtering estimators. The following oracle inequality can be regarded as a generalization of Theorem 2.2 in Oymak and Hassibi (2016). For the sake of completeness, we also provide its proof below.

Lemma D.4.

Let $\theta^{*}\in\mathbb{R}^{n}$ be any vector, and suppose that the observation $y$ is drawn according to $N(\theta^{*},\sigma^{2}I_{n})$ . Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function, and let $\hat{\theta}_{\lambda}$ denote the proximal denoising estimator defined as (37). Then, we have

[TABLE]

Moreover, for any $\eta\in(0,1)$ , the inequality

[TABLE]

holds with probability at least $1-\eta$ .

Proof.

Below, we write $\hat{\theta}:=\hat{\theta}_{\lambda}$ . To prove (38), it suffices to show that we have almost surely

[TABLE]

for any fixed vector $\theta\in\mathbb{R}^{n}$ . We will assume $\theta\neq\hat{\theta}$ because otherwise the inequality is trivial.

From the first order optimality condition of the convex minimization problem (37), we have

[TABLE]

See Lemma 6.1 in van de Geer (2015) for a formal proof. Using the elementary fact that $2\langle u,v\rangle=\lVert u\rVert_{2}^{2}+\lVert v\rVert_{2}^{2}-\lVert u-v\rVert_{2}^{2}$ and substituting $y=\theta^{*}+\sigma z$ , we have

[TABLE]

Now, take $v\in\partial f(\theta)$ arbitrarily. From the definition of the subgradient, we have

[TABLE]

Hence, the right-hand side of (40) is bounded from above by

[TABLE]

Since the choice of $v\in\partial f(\theta)$ is arbitrary, we have

[TABLE]

By taking the expectation of both sides, (38) is proved.

To prove the high-probability bound (39), we use the well-known Gaussian concentration inequality (see e.g., Theorem 5.6 in Boucheron et al. (2013)); for any $L$ -Lipschitz function $h:\mathbb{R}^{n}\to\mathbb{R}$ and $\eta\in(0,1)$ , we have

[TABLE]

In fact, the map $z\mapsto\mathord{\mathrm{dist}}(z,\lambda\partial f(\theta))$ is a $2$ -Lipschitz function because, for any $z_{1},z_{2}\in\mathbb{R}^{n}$ , we have

[TABLE]

where $P$ is the orthogonal projection map onto the set $\lambda\partial f(\theta)$ . Now, we take $\bar{\theta}$ as

[TABLE]

Combining (41) and the Gaussian concentration applied for $\theta=\bar{\theta}$ , we have the desired result. ∎

D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)

In this subsection, we provide the proof of Theorem 4.1 as an application of Lemma D.3. To this end, we have to evaluate the statistical dimension of the tangent cone of a convex set

[TABLE]

It is not surprising that the analysis of the tangent cone of $K_{-}(\mathcal{V})$ goes very similar to that of the set with bounded total variation $K(\mathcal{V})=\{\theta\in\mathbb{R}^{n}:\mathcal{V}(\theta)\leq\mathcal{V}\}$ in Guntuboyina et al. (2017). Our goal is to show the following upper bound for the statistical dimension:

Proposition D.5.

Suppose that $\theta$ is a vector with $\mathcal{V}_{-}(\theta)=\mathcal{V}$ . Then, there exists a universal constant $C>0$ such that

[TABLE]

where $M(\theta)$ is defined in (13).

We briefly outline the proof for this result. We divide the proof into four steps: First, we provide some useful characterizations of the tangent cone. Second, we decompose the tangent cone into finitely many pieces so that the Gaussian widths become easy to evaluate. Third, we provide the concrete upper bounds the Gaussian widths of these pieces. Lastly, we combine the upper bounds and apply Lemma D.3 to complete the proof.

**Step 1: Characterizing the tangent cone ** If $\mathcal{V}_{-}(\theta)<\mathcal{V}$ , $\theta$ is contained in the interior of $K_{-}(\mathcal{V})$ , and the tangent cone becomes the entire Euclidean space $\mathbb{R}^{n}$ . Hereafter, we assume that $\theta$ lies on the boundary of $K_{-}(\mathcal{V})$ , that is, $\mathcal{V}_{-}(\theta)=\mathcal{V}$ . Let us recall the definition of the sign of jumps $w_{i}$ in (4.1). Roughly speaking, the tangent cone of $K_{-}(\mathcal{V})$ is characterized by the sign of jumps.

Lemma D.6.

Let $\theta$ be a vector in $\mathbb{R}^{n}$ such that $\mathcal{V}_{-}(\theta)=\mathcal{V}$ . Let $\Pi=\{B_{1},B_{2},\ldots,B_{k^{\prime}}\}$ be any connected refinement 111 Here, we say that $\Pi$ is a connected refinement of another connected partition $\Pi^{\prime}$ if, for any $B\in\Pi$ , there exists a unique element $A\in\Pi^{\prime}$ such that $B\subseteq A$ .

of the constant partition $\Pi_{\mathrm{const}}(\theta)$ of $\theta$ . Let $1=\tau_{1}<\tau_{2}<\cdots<\tau_{k^{\prime}}<\tau_{k^{\prime}+1}=n+1$ be a sequence such that $B_{i}=\{\tau_{i},\tau_{i}+1,\ldots,\tau_{i+1}-1\}$ for any $i\in\{1,2,\ldots,k^{\prime}\}$ . We define the signs $w_{2},w_{3},\ldots,w_{k^{\prime}}\in\{0,1\}$ as

[TABLE]

For any $\Pi$ and $w_{2},w_{3},\ldots,w_{k^{\prime}}$ taken as above, we define a convex cone $T(\Pi,w)$ as

[TABLE]

where $\mathcal{V}_{-}^{B_{i}}(v_{B_{i}})$ is the lower total variation for the restricted vector $v_{B_{i}}$ . Then, for the tangent cone $T_{K_{-}(\mathcal{V})}(\theta)$ , we have the followings:

(i)

If $\Pi=\Pi_{\mathrm{const}}(\theta)$ , then $T_{K_{-}(\mathcal{V})}(\theta)=T(\Pi,w)$ . 2. (ii)

If $\Pi$ is a connected refinement of $\Pi_{\mathrm{const}}(\theta)$ and $w$ is taken arbitrarily as above, then $T_{K_{-}(\mathcal{V})}(\theta)\subseteq T(\Pi,w)$ .

Proof.

First, we show that $T_{K_{-}(\mathcal{V})}(\theta)\subseteq T(\Pi,w)$ . By the definition of the tangent cone, it suffices to show that $v:=z-\theta\in T(\Pi,w)$ holds for any $z\in K_{-}(\mathcal{V})$ . Note that $\theta$ is constant on every $B_{i}\in\Pi$ since $\Pi$ is finer than the constant partition of $\theta$ . Since the lower total variation is not changed by adding any constant value to each coordinates, we have $\mathcal{V}_{-}^{B_{i}}(z_{B_{i}}-\theta_{B_{i}})=\mathcal{V}_{-}^{B_{i}}(z_{B_{i}})$ . Then, we have

[TABLE]

which proves $v\in T(\Pi,w)$ and hence (ii).

Next, we prove that $T(\Pi,w)\subseteq T_{K_{-}(\mathcal{V})}(\theta)$ under the assumption $\Pi=\Pi_{\mathrm{const}}(\theta)=\{B_{1},B_{2},\ldots,B_{k}\}$ . In this case, the definition of $w_{2},\ldots,w_{k}$ coincides that in (4.1). Fix any $v\in T(\Pi,w)$ . We want to show that $z$ is obtained as $v=t(z-\theta)$ for some $t>0$ and $z\in K_{-}(\mathcal{V})$ . To this end, we check that there exists a (sufficiently small) $t^{-1}>0$ such that $\theta+t^{-1}v\in K_{-}(\mathcal{V})$ . Here, we have

[TABLE]

Recall that $w_{2},\ldots,w_{k}$ are chosen so that $(\theta_{\tau_{i}-1}-\theta_{\tau_{i}})_{+}=w_{i}(\theta_{\tau_{i}-1}-\theta_{\tau_{i}})$ . We can choose sufficiently small $t^{-1}>0$ so that

[TABLE]

for every $i=2,3,\ldots,k$ . Indeed, if we choose $t^{-1}>0$ so that

[TABLE]

the signs of $\theta$ do not change by adding $t^{-1}v$ . Consequently, we have

[TABLE]

This proves that $T(\Pi,w)\subseteq T_{K_{-}(\mathcal{V})}(\theta)$ and hence (i). ∎

From Proposition D.2-(i), we can bound the statistical dimension by the Gaussian width as follows:

[TABLE]

Here, $B_{n}:=\{v\in\mathbb{R}^{n}:\lVert v\rVert_{2}\leq 1\}$ is the unit ball in $\mathbb{R}^{n}$ . Hence, it suffices to consider the set $T_{K_{-}(\mathcal{V})}(\theta)\cap B_{n}$ . In analogy to Lemma B.2 in Guntuboyina et al. (2017), we obtain the following characterization of this set.

Lemma D.7.

Let $\theta$ be a vector in $\mathbb{R}^{n}$ such that $\mathcal{V}_{-}(\theta)=\mathcal{V}$ . Let $\Pi=\{B_{1},B_{2},\ldots,B_{k^{\prime}}\}$ be any connected refinement of $\Pi_{\mathrm{const}}(\theta)$ . Define the signs $w_{2},w_{3},\ldots,w_{k^{\prime}}$ as in Lemma D.6, and let $w_{1}=w_{k^{\prime}+1}=0$ . Then, for every $v\in T_{K_{-}(\mathcal{V})}(\theta)$ with $\lVert v\rVert_{2}\leq 1$ , there exists indices $\ell_{1}\in B_{1},\ell_{2}\in B_{2},\ldots,\ell_{k^{\prime}}\in B_{k^{\prime}}$ such that

[TABLE]

where we define $\Gamma_{i}(v,\ell_{i})$ as

[TABLE]

Proof.

Fix $v\in T_{K_{-}(\mathcal{V})}(\theta)\cap B_{n}$ . By Lemma D.6, we have

[TABLE]

Let $\ell_{1}\in B_{1},\ell_{2}\in B_{2},\ldots,\ell_{k^{\prime}}\in B_{k^{\prime}}$ be indices which will be specified later. Defining $\Gamma_{i}(v,\ell_{i})$ as in (45), we can rewrite (46) as

[TABLE]

Now, let $t_{i}$ denote the $\ell_{2}$ norm of $v_{B_{i}}$ for $i=1,2,\ldots,k^{\prime}$ . By the assumption, $\sum_{i=1}^{k^{\prime}}t_{i}^{2}=\lVert v\rVert_{2}^{2}\leq 1$ . Then, for any $i\in\{1,2,\ldots,k^{\prime}\}$ , there exists $\ell_{i}\in B_{i}$ such that $|v_{\ell_{i}}|\leq t_{i}/\sqrt{|B_{i}|}$ . For these choices of $\ell_{i}$ , the right-hand side of (D.2) is bounded from above by

[TABLE]

which proves the desired result. ∎

Remark D.8.

Note that $\Gamma_{i}(v,\ell_{i})$ is always non-negative. This is checked as follows: First, the lower total variation is always larger than the difference of boundary points, that is, for every $v\in\mathbb{R}^{m}$ , we have

[TABLE]

where $w$ is taken arbitrarily from $\{0,1\}$ . The equality holds if and only if $v$ is monotone non-increasing. Then, for any $\ell\in[m]$ and $w_{1},w_{2}\in\{0,1\}$ , we have

[TABLE]

In particular, we obtain $\Gamma_{i}(v,\ell_{i})\geq 0$ . If $\theta$ is monotone non-decreasing (i.e., $w_{0}=w_{1}=\cdots=w_{k+1}=0$ ), then the right-hand side of (44) equals to [math], and so $\Gamma_{i}(v,\ell_{i})=0$ .

**Step 2: Quantizing the tangent cone ** Now, let $\Pi=\{B_{1},B_{2},\ldots,B_{k^{\prime}}\}$ be a connected refinement of $\Pi_{\mathrm{const}}(\theta)$ . Lemma D.7 implies that $T_{K_{-}(\mathcal{V})}(\theta)\cap B_{n}$ is contained in the set such that $\sum_{i=1}^{k^{\prime}}\lVert v_{B_{i}}\rVert_{2}^{2}\leq 1$ and $\sum_{i=1}^{k^{\prime}}\Gamma_{i}(v,\ell_{i})\leq\gamma$ for some $\ell_{i}\in B_{i}$ and $\gamma>0$ . From this perspective, we consider finitely many allocation patterns of the budgets for $\lVert v_{B_{i}}\rVert_{2}^{2}$ and $\Gamma_{i}(v,\ell_{i})$ . To be more precise, we construct a cover of the tangent cone in the following way. Consider a triple $(\mathbf{t},\mathbf{q},\mathbf{l})$ such that:

(a)

$\mathbf{t}=(t_{1},t_{2},\ldots,t_{k^{\prime}})$ and $\mathbf{q}=(q_{1},q_{2},\ldots,q_{k^{\prime}})$ are vectors consisting of non-negative numbers, and 2. (b)

$\mathbf{l}=(\ell_{1},\ell_{2},\ldots,\ell_{k^{\prime}})$ is a set of indices such that $\ell_{i}\in B_{i}$ for $i=1,2,\ldots,k^{\prime}$ .

For such triple, we define a set

[TABLE]

where $\gamma$ is taken as the right-hand side of (44):

[TABLE]

Then, quantizing the allocation vectors $\mathbf{t}$ and $\mathbf{q}$ , we can cover the set $T_{K_{-}(\mathcal{V})}(\theta)\cap B_{n}$ with finitely many $T(\mathbf{t},\mathbf{q},\mathbf{l})$ s as the following lemma.

Lemma D.9.

Suppose that $\Pi=(B_{1},B_{2},\ldots,B_{k^{\prime}})$ is a connected refinement of $\Pi_{\mathrm{const}}(\theta)$ . Define the signs $w_{1},w_{2},\ldots,w_{k^{\prime}}$ as in Lemma D.7. Let $\mathcal{Q}$ be a set of allocation vectors satisfying the following condition; there exists an integer vector $\mathbf{m}=(m_{1},m_{2},\ldots,m_{k^{\prime}})\in\mathbb{N}^{k^{\prime}}$ such that $1\leq m_{i}\leq k^{\prime}$ ( $i=1,2,\ldots,k^{\prime}$ ) and $\sum_{i=1}^{k^{\prime}}m_{i}\leq 2k^{\prime}$ , and the allocation vector $q=(q_{1},q_{2},\ldots,q_{k^{\prime}})\in\mathcal{Q}$ can be written as

[TABLE]

Let $\mathcal{L}$ be a set of indices $\mathbf{l}=(\ell_{1},\ell_{2},\ldots,\ell_{k^{\prime}})$ such that $\ell_{i}\in B_{i}$ for all $i=1,2,\ldots,k^{\prime}$ . Given $\mathbf{t},\mathbf{q}\in\mathcal{Q}$ and $\mathbf{l}\in\mathcal{L}$ , we define a set $T(\mathbf{t},\mathbf{q},\mathbf{l})$ as (48). Then, we have

[TABLE]

Proof.

Fix any vector $v$ in $T(\Pi,w)\cap B_{n}$ . Since $\lVert v_{B_{i}}\rVert_{2}^{2}\leq\lVert v\rVert_{2}^{2}\leq 1$ , there exists an integer $1\leq m_{i}\leq k^{\prime}$ such that

[TABLE]

Summing over $i=1,2,\ldots,k^{\prime}$ , we have

[TABLE]

which implies $\mathbf{t}=(m_{1}/k^{\prime},\ldots,m_{k^{\prime}}/k^{\prime})\in\mathcal{Q}$ .

Next, by Lemma D.7, there exist $\mathbf{l}=(\ell_{1},\ldots,\ell_{k^{\prime}})\in\mathcal{L}$ such that $\sum_{i=1}^{k^{\prime}}\Gamma_{i}(v,\ell_{i})\leq\gamma.$ Hence, for any $i$ , there exists an integer $1\leq l_{i}\leq k^{\prime}$ such that

[TABLE]

Suppose $\gamma>0$ . Summing over $i=1,2,\ldots,k^{\prime}$ , we have $\sum_{i=1}^{k^{\prime}}l_{i}\leq 2k^{\prime}$ and thus $\mathbf{q}=(l_{1}/k^{\prime},\ldots,l_{k^{\prime}}/k^{\prime})\in\mathcal{Q}$ . For the case of $\gamma=0$ , it is clear that $\mathbf{q}=(1/k^{\prime},1/k^{\prime},\ldots,1/k^{\prime})\in\mathcal{Q}$ . ∎

We should note that the cardinalities of $\mathcal{Q}$ and $\mathcal{L}$ are respectively bounded as follows:

Proposition D.10.

Let $\mathcal{Q}$ and $\mathcal{L}$ are the sets defined in Lemma D.9. Then, we have:

(i)

$\log|\mathcal{Q}|\leq 2k^{\prime}\log 2\mathrm{e}$ , and 2. (ii)

$\log|\mathcal{L}|\leq k^{\prime}\log\frac{n}{k^{\prime}}$ .

Proof.

For the first part, we observe that $|\mathcal{Q}|$ is not larger than the cardinality of

[TABLE]

Then, we have

[TABLE]

The proof of the inequality (a) in the above can be found in Proposition 4.3 of Dudley (2014).

The second part is obtained by Jensen’s inequality as

[TABLE]

∎

**Step 3: Controlling Gaussian widths ** As mentioned before, our goal is to obtain an upper bound of the Gaussian width

[TABLE]

where we convene that $\mathbb{E}=\mathbb{E}_{Z\sim N(0,I_{n})}$ . Let $(\Pi,w)$ is a pair of a partition and a sign vector of knots defined as in Lemma D.7. Using the decomposition in Lemma D.9, we have

[TABLE]

Besides, leveraging a general result for Gaussian suprema (see Lemma F.4 below), we have

[TABLE]

Here, we used Proposition D.10 to bound the cardinality of the set $\mathcal{Q}^{2}\times\mathcal{L}$ . More precisely, we used the following evaluation:

[TABLE]

Given $\mathbf{t},\mathbf{q}\in\mathcal{Q}$ and $\mathbf{l}\in\mathcal{L}$ , we define

[TABLE]

Dividing the supremum into $k^{\prime}$ pieces $v_{B_{1}},v_{B_{2}},\ldots,v_{B_{k^{\prime}}}$ , this quantity is bounded from above as $\tilde{W}(\mathbf{t},\mathbf{q},\mathbf{l})\leq\sum_{i=1}^{k^{\prime}}\tilde{W}_{i}(t_{i},q_{i},\ell_{i})$ , where

[TABLE]

Here, we write $T_{i}(t_{i},q_{i},\ell_{i}):=\{v_{B_{i}}\in\mathbb{R}^{B_{i}}:\ \lVert v_{B_{i}}\rVert_{2}^{2}\leq t_{i},\ \Gamma_{i}(v,\ell_{i})\leq q_{i}\gamma\}$ .

We now consider the quantity (53). In the set $T_{i}(t_{i},q_{i},\ell_{i})$ over which the supremum taken, the lower total variation of $v_{B_{i}}$ is bounded from above as

[TABLE]

As mentioned in Remark D.8, the reverse inequality

[TABLE]

is always true, and the equality can hold only if two sub-vectors $(v_{\tau_{i}},v_{\tau_{i}}+1,\ldots,\ell_{i})$ and $(\ell_{i},\ell_{i}+1,\ldots,v_{\tau_{i+1}}-1)$ are either monotone increasing or non-increasing. From this point of view, we may consider that the meaning of the condition (54) is that $v_{B_{i}}$ is approximated by two nearly monotone pieces. This suggests that the complexity of $T_{i}(t_{i},q_{i},\ell_{i})$ can be evaluated by that of the class of monotone functions.

Below, we provide the upper bound of the Gaussian width of the form (53). First, the following lemma treats a special case where $\ell_{i}$ is taken as the rightmost point in $B_{i}$ .

Lemma D.11.

For every $n\geq 1$ , $t>0$ , $w\in\{0,1\}$ and $\gamma\geq 0$ , we have

[TABLE]

Proof.

The proof is divided into two cases where $w=1$ and $w=0$ .

Case 1 ( $w=1$ ): By scaling properly, we need only consider the case where $t=1$ . For a vector $v\in\mathbb{R}^{n}$ , we define a monotone vector $v^{+}$ as

[TABLE]

We also define another monotone vector $v^{-}$ as

[TABLE]

It is easy to check that $v=v^{+}-v^{-}$ . Using these notations, we have

[TABLE]

Hence, the condition $\mathcal{V}_{-}(v)\leq v_{1}-v_{n}+\gamma$ is equivalent to $v^{+}_{n}\leq\gamma$ , which leads to

[TABLE]

and

[TABLE]

Denote by $\tilde{W}$ the left-hand side in (D.11) with $t=1$ . The argument in the previous paragraph implies that

[TABLE]

The expectation in the last line is bounded as

[TABLE]

Here, the first inequality is the Jensen’s inequality, and the second inequality is a consequence of equation (D.12) in Amelunxen et al. (2014). Combining with (56), we have the desired result.

Case 2 ( $w=0$ ): We can assume w.l.o.g. $t=1$ . As in Case 1, and we write a vector as a difference of monotone vectors. For $v\in\mathbb{R}^{n}$ , we define $v^{+}$ and $v^{-}$ as

[TABLE]

and

[TABLE]

respectively. Under this notation, the condition $\mathcal{V}_{-}(v)\leq\gamma$ is equivalent to $v^{-}_{n}\leq\gamma$ , and therefore we have

[TABLE]

Then, a similar argument as Case 1 yields the result. ∎

Next, the following lemma provides an upper bound of $\tilde{W}_{i}$ for general choices of $\ell_{i}\in B_{i}$ .

Lemma D.12.

Fix $n\geq 1$ , $1\leq\ell\leq n$ , $t>0$ and $\gamma\geq 0$ . For every $w_{1},w_{2}\in\{0,1\}$ , the quantity

[TABLE]

is bounded from above as

[TABLE]

In particular, we deduce a simpler bound

[TABLE]

Proof.

Let $(A_{1},A_{2})$ be a pair of sub-vectors of $[n]$ defined as $A_{1}=\{1,2,\ldots,\ell\}$ and $A_{2}=\{\ell,\ell+1,\ldots,n\}$ . If either $\ell=1$ or $\ell=n$ (i.e., one of $A_{1}$ and $A_{2}$ becomes a singleton), the result is a direct consequence of Lemma D.11.

Henceforth, we assume that $1<\ell<n$ . Suppose that $v\in\mathbb{R}^{n}$ satisfies the assumption $\mathcal{V}_{-}(v)\leq w_{1}(v_{1}-v_{\ell})+w_{2}(v_{\ell}-v_{n})+\gamma$ . Since $\mathcal{V}_{-}(v)\geq\mathcal{V}_{-}^{A_{1}}(v_{A_{1}})+w_{2}(v_{\ell}-v_{n})$ , we have

[TABLE]

Similarly, we have

[TABLE]

Based on these observations, we reduce to

[TABLE]

in which both terms in the right-hand side can be bounded using Lemma D.11. ∎

Before going to the next step, we summarize the results in Step 3 as follows.

Proposition D.13.

Fix $\theta\in\mathbb{R}^{n}$ . Let $\Pi=(B_{1},B_{2},\ldots,B_{k^{\prime}})$ be any connected refinement of $\Pi_{\mathrm{const}}(\theta)$ , and $w_{1},w_{2},\ldots,w_{k^{\prime}}$ be the signs associated with $\Pi$ as in Lemma D.7. Define $\gamma\geq 0$ as (49). Then, the quantity $\tilde{W}(\theta)$ defined in (53) is bounded from above by

[TABLE]

Proof.

This is a direct consequence of (52) and (58). ∎

**Step 4: Applying Lemma D.3 ** We now are ready to complete the proof of Theorem 4.1.

Recall that our goal is to obtain an upper bound for $\tilde{W}(\theta)$ which is defined in (53). To this end, we will construct a suitable refinement of $\Pi_{\mathrm{const}}(\theta)$ with moderate piece lengths so that we can control the first term in (59). In fact, from an argument parallel to that in Guntuboyina et al. (2017), there exists a refinement $\Pi=(B_{1},B_{2},\ldots,B_{k^{\prime}})$ such that

[TABLE]

and $k(\theta)\leq k^{\prime}\leq 2k(\theta)$ . We also define the signs $w_{1},w_{2},\ldots,w_{k^{\prime}}$ in a similar way as Lemma D.6, but if the knot $\tau_{i}$ is not contained in the original partition $\Pi_{\mathrm{const}}(\theta)$ , the corresponding sign $w_{i}$ will be specified later.

We can bound the first term in (59) as the following two steps. First, from the Cauchy–Schwarz inequality and the fact that $\mathbf{t}\in\mathcal{Q}$ , we have

[TABLE]

Second, by the above construction of $\Pi$ , we have

[TABLE]

Therefore, the right-hand side in (59) can be bounded from above by

[TABLE]

Here, to hide the constant term $\sqrt{\pi/2}$ , we have also used the fact that $\sqrt{m\log(\mathrm{e}n/m)}\geq 1$ for every integer $1\leq m\leq n$ .

Let $w^{0}_{1},w^{0}_{2},\ldots,w^{0}_{k(\theta)+1}$ be the signs associated with the constant partition $\Pi_{\mathrm{const}}(\theta)=(A_{1},A_{2},\ldots,A_{k(\theta)})$ (recall the definition (4.1)). Then, we can choose the values of $w_{i}$ so that the following inequality holds:

[TABLE]

In fact, this is possible if we choose $w_{i}$ as the sign $w_{j}^{0}$ for the nearest knot that is to the right of $\tau_{i}$ . Combining (D.2), (60) and Proposition D.2, the statistical dimension of $T_{K_{-}(\mathcal{V})}(\theta)$ is bounded from above as

[TABLE]

where we also used the elementary fact that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ . Consequently, applying Lemma D.3, we have desired result.

Remark D.14 (Non-Gaussian noises).

For non-Gaussian noise setting, we could prove an analogous result to Proposition D.5. We comment on a sketch of the proof for such a generalization.

The proof of Proposition D.5 consists of (i) a decomposition argument for the tangent cone and (ii) bounds for some probabilistic quantities (i.e., the statistical dimension and the Gaussian width). The former argument is completely deterministic and independent from the distributional assumption on the noise variables. Regarding the probabilistic bounds, we used the following bound for (Gaussian) statistical dimension of $K_{n}^{\uparrow}$ :

[TABLE]

Hence, if we can obtain a similar bound for non-Gaussian random variables, we can prove a analogous result to Proposition D.5.

Let $\xi_{1},\ldots,x_{n}$ be i.i.d. random variables with $\mathbb{E}[\xi_{1}]=0$ and $\mathrm{Var}(\xi_{1})=\sigma^{2}$ . For a convex cone $C$ , we define the statistical dimension as

[TABLE]

Here, we write $\mathrm{Proj}_{C}(x)=\operatornamewithlimits{argmin}_{z\in C}\lVert z-x\rVert_{2}$ , and the last equality holds from a deterministic relation

[TABLE]

(See Amelunxen et al. (2014) for details). Then, from Theorem 3.1 in Chatterjee et al. (2015), we can check that

[TABLE]

Therefore, by following a similar argument as the proof of Proposition D.5, we conclude that

[TABLE]

for some universal constant $C^{\prime}>0$ . As a consequence, we can prove the expected risk bound similar to (4.7) for non-Gaussian noise variables.

D.3 Proof of Corollary 4.4

Let $\alpha>0$ be a number to be specified later. Define a vector $\theta^{\prime}\in\mathbb{R}^{n}$ as $\theta^{\prime}_{1}=\theta^{*}_{1}$ and

[TABLE]

Then, we have $\mathcal{V}_{-}(\theta^{\prime})=\alpha\mathcal{V}_{-}(\theta^{*})$ . Moreover, the constant partition and the sign of $\theta^{\prime}$ (defined in (4.1)) are the same as those of $\theta^{*}$ , and therefore $k(\theta^{\prime})=k(\theta^{*})$ and $M(\theta^{\prime})=M(\theta^{*})$ .

Now, we set $\alpha=\mathcal{V}/\mathcal{V}_{-}(\theta^{*})$ so that $\mathcal{V}_{-}(\theta^{\prime})=\mathcal{V}$ . Applying the upper bound (4.1), we have

[TABLE]

The first term in the right-hand side is bounded from above as

[TABLE]

From the minimal length condition (18) and the definition of $M(\theta)$ , we also have

[TABLE]

Combining the above inequalities, we have the desired result.

D.4 Risk bounds for penalized estimators (Proof of Theorem 4.7)

We prove Theorem 4.7 as an application of Lemma D.4. Let $\partial\mathcal{V}_{-}(\theta)$ denote the set of subgradients (i.e., subdifferential) of the convex function $\mathcal{V}_{-}(\cdot)$ at $\theta\in\mathbb{R}^{n}$ . The task is to provide a suitable upper bound for the Gaussian mean squared distance of the set $\lambda\partial\mathcal{V}_{-}(\theta)$ . To do this, we use the technique developed in Guntuboyina et al. (2017). The idea is stated roughly as follows: Recall that the Gaussian mean squared distance of a convex cone can be written as the statistical dimension of the polar cone (Proposition D.2-(ii)). This motivates us to relate the Gaussian mean squared distance $\mathbf{D}(\lambda\partial\mathcal{V}_{-}(\theta))$ to that of an associated cone. In particular, we consider the conic hull of the subdifferential:

[TABLE]

As we explain later, $\mathbf{D}(\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta)))$ can be evaluated by the results in the previous subsection. Then, we can complete the proof if we have an upper bound of the following form:

[TABLE]

where $\Delta(\theta,\lambda)$ is a residual term that depends on $\theta$ and $\lambda$ .

First, we show that $\mathbf{D}(\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta)))$ has exactly the same value as the statistical dimension of the tangent cone of $T_{K_{-}(\mathcal{V}_{-}(\theta))}(\theta)$ , which we have already provided a bound in the previous part in this paper.

Proposition D.15.

For any $\theta\in\mathbb{R}^{n}$ , the following equality holds:

[TABLE]

In particular, we have the following upper bound:

[TABLE]

where $C$ is the same universal constant as in Proposition D.5.

Proof.

Let us write $T:=T_{K_{-}(\mathcal{V}(\theta))}(\theta)$ . In the light of Proposition D.2-(ii), it suffices to show that $T$ is the polar cone of $\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta))$ . However, from fundamental results in convex geometry, we always have

[TABLE]

for any convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ (see Lemma A.5 and Lemma A.5 in Guntuboyina et al. (2017)). For the case where $f=\mathcal{V}_{-}$ , the set $K(\theta)$ above is

[TABLE]

which implies the desired result. ∎

Next, we provide an inequality of the form (62). Since $\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta))\supseteq\lambda\partial\mathcal{V}_{-}(\theta)$ holds for every $\lambda\geq 0$ , the definition of the Gaussian mean squared distance (Definition D.1-(ii)) suggests that $\mathbf{D}(\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta)))\leq\mathbf{D}(\lambda\partial\mathcal{V}_{-}(\theta))$ . However, we need a reverse inequality (62). To this end, we use the following result proved by Guntuboyina et al. (2017).

Lemma D.16 (Guntuboyina et al. (2017), Proposition B.5).

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function, and $\theta\in\mathbb{R}^{n}$ . Define a vector $v_{0}$ as

[TABLE]

where $\mathord{\mathrm{aff}}(C)$ is the affine hull of the set $C\subseteq\mathbb{R}^{n}$ . Suppose that $v_{0}\neq 0$ . For any $z\in\mathbb{R}^{n}$ , define $\lambda(z)\geq 0$ as

[TABLE]

Then, $\lambda(z)$ is well-defined, and has a finite expectation $\mathbb{E}_{Z\sim N(0,I_{n})}[\lambda(Z)]<\infty$ .

Further, define $\lambda^{*}$ as

[TABLE]

Then, for every $\lambda\geq\lambda^{*}$ and $v^{*}\in\partial f(\theta)$ , we have

[TABLE]

Before proceeding, we introduce an additional terminology: A convex function $f:\mathbb{R}^{n}\to\mathbb{R}$ is said to be weakly decomposable if we have

[TABLE]

for every $\theta\in\mathbb{R}^{n}$ . In other words, we can choose $v_{0}\equiv v^{*}$ in (64) if $f$ is weakly decomposable. Under the assumption that $f$ is weakly decomposable, the inequality (64) can be simplified as follows:

Corollary D.17.

Suppose that $f:\mathbb{R}^{n}\to\mathbb{R}$ is convex and weakly decomposable. Under the same notation as in Lemma D.16, we have

[TABLE]

Now, we apply Lemma D.16 to the case $f=\mathcal{V}_{-}$ . The following proposition provides the structural information of $\partial\mathcal{V}_{-}(\theta)$ that we need for evaluating the upper bound (64). The proof is postponed to Appendix D.6.

Proposition D.18.

(i)

$\theta\mapsto\mathcal{V}_{-}(\theta)$ is weakly decomposable. 2. (ii)

For any $\theta\in\mathbb{R}^{n}$ , let us define $v_{0}$ as (63). Then, we have

[TABLE]

From Proposition D.18 and Corollary D.17, $\mathbf{D}(\lambda\partial\mathcal{V}_{-}(\theta))$ is bounded from above by

[TABLE]

provided that $\lambda\geq\lambda^{*}$ . Here, $C^{\prime}>0$ is a universal constant. Combining this bound with Lemma D.4, we proved the desired risk bound.

Lastly, we provide an upper bound for the optimal tuning parameter $\lambda^{*}$ . This is obtained from the following estimate of $\mathbb{E}[\lambda(Z)]$ .

Proposition D.19.

Suppose that $\theta\in\mathbb{R}^{n}$ and $\mathcal{V}_{-}(\theta)>0$ . For any $z\in\mathbb{R}^{n}$ , define $\lambda(z)$ as

[TABLE]

Then, we have

[TABLE]

where $\mathbb{E}$ is the expectation with respect to $Z\sim N(0,I_{n})$ .

Proof.

Let $C:=\mathord{\mathrm{cone}}(\partial\mathcal{V}_{-}(\theta))$ be the conic hull of $\partial\mathcal{V}_{-}(\theta)$ , and let $P_{C}$ denote the orthogonal projection map onto $C$ . By the definition of $\lambda(z)$ , there exists a vector $v(z)\in\partial\mathcal{V}_{-}(\theta)$ such that $\lambda(z)v(z)=P_{C}(z)$ .

First, we show a partial result

[TABLE]

As we will see in Appendix D.6, $\mathcal{V}_{-}$ is the support function for a certain convex set. Then, by the fundamental fact for the support function that $\langle\theta,v\rangle=\mathcal{V}_{-}(\theta)$ for all $v\in\partial\mathcal{V}_{-}(\theta)$ (see Corollary 8.25 in Rockafeller and Wets (1998)), we have

[TABLE]

Here, in the last line, $T:=T_{K_{-}(\mathcal{V}_{-}(\theta))}(\theta)$ is the polar cone of $C$ (see Proposition D.15), and we used the Moreau decomposition $z=P_{C}(z)+P_{T}(z)$ . Taking the expectation of both sides with respect to $z\sim N(0,I_{n})$ , we have

[TABLE]

which implies the desired result. Here, we used the equality between the statistical dimension and the expected squared norm of projection: $\delta(T)=\mathbb{E}_{Z\sim N(0,I_{n})}\lVert P_{T}(Z)\rVert_{2}^{2}$ (see Proposition 3.1 in Amelunxen et al. (2014)).

To prove the other inequality, we use the characterization of $\mathord{\mathrm{aff}}(\partial\mathcal{V}_{-}(\theta))$ given in (72) in Appendix D.6 below. In particular, if we take $v^{*}$ as in (75), we have

[TABLE]

and

[TABLE]

and hence the result follows. ∎

D.5 Proof of Corollary 4.12

First, we explain that a monotone vector satisfying the moderate growth condition is approximated by a piecewise-constant vector such that the segments at both ends have sufficient lengths. To this end, we need the following lemma. Here, the first two statements (i) and (ii) are shown in Lemma 2 in Bellec and Tsybakov (2015). The third statement (iii) ensures that the moderate growth conditions implies the minimal length condition (18).

Lemma D.20.

Let $\theta\in K^{\uparrow}_{n}$ be a monotone vector satisfying the moderate growth condition and $\theta_{n}-\theta_{1}=\mathcal{V}$ . Then, there exists another monotone vector $\theta^{\prime}\in K^{\uparrow}_{n}$ satisfying the following three conditions.

(i)

$\theta^{\prime}$ is $k$ -piecewise constant with

[TABLE]

Here, $\lceil t\rceil$ is the smallest integer that is not less than $t$ . 2. (ii)

We have

[TABLE]

and

[TABLE] 3. (iii)

Let $\Pi^{\prime}=\{A_{1},A_{2},\ldots,A_{k}\}$ be the partition on which $\theta^{\prime}$ is constant. Then, we have $|A_{1}|\geq n/k$ and $|A_{k}|\geq n/k$ .

Proof.

Let $k$ be an integer defined in (67). We construct a $k$ -piecewise constant monotone vector $\theta^{\prime}\in K_{n}^{\uparrow}$ as follows: First, define an equi-spaced partition $I_{1},I_{2},\ldots,I_{k}$ of the interval $[\theta_{1},\theta_{n}]$ as

[TABLE]

and $I_{k}:=[\theta_{1}+\frac{k-1}{k}\mathcal{V},\theta_{n}]$ . Next, define a partition $\Pi=(A_{1},A_{2},\ldots,A_{k})$ of $[n]$ as $A_{j}:=\{i\in[n]:\theta_{i}\in I_{j}\}$ ( $j=1,2,\ldots,k$ ). Then, let $\theta^{\prime}$ be a piecewise-constant vector such that $\theta^{\prime}_{i}:=\theta_{1}+\frac{j-1/2}{k}\mathcal{V}$ for $i\in A_{j}$ . See the right panel of Figure 4 for an illustrative example for $\theta$ and its piecewise-constant approximation $\theta^{\prime}$ . By a similar argument as Lemma 2 in Bellec and Tsybakov (2015), we can check (i) and (ii).

It remains to prove (iii) under the moderate growth condition. Below, we will only check that the maximal element in $A_{1}$ is not less than $n/k$ because $|A_{k}|\geq n/k$ can be checked in a similar way. Let $i^{*}:=\lceil n/k\rceil$ . Note that we have $i^{*}\leq\lceil n/2\rceil$ since $k\geq 3$ . By the moderate growth condition, we have

[TABLE]

which means $i^{*}\in A_{1}$ and hence $|A_{1}|\geq\lceil n/k\rceil$ . ∎

Now, we are ready to prove Corollary 4.12. Applying Lemma D.20 for every segments $A_{1},A_{2},\ldots,A_{m}$ , we have a $k$ -piecewise constant and $m$ -piecewise monotone vector $\theta^{\prime}\in\mathbb{R}^{n}$ such that

[TABLE]

and

[TABLE]

Moreover, $\theta^{\prime}$ satisfies the minimum length condition (18) with $c=1$ . Therefore, we have $M(\theta^{\prime})\leq 2(m-1)k/n$ and

[TABLE]

where we used an obvious inequality $m\leq k$ . Then, Theorem 4.7 implies that there exists $\lambda$ such that

[TABLE]

for some universal constant $C^{\prime}>0$ . This is the desired conclusion. Note that an upper bound for such $\lambda$ is suggested by Proposition 4.8.

D.6 Subdifferential and weak decomposability

In this subsection, we discuss the structure of the subdifferential of the nearly-isotonic type penalties. The main purpose is to discuss the weak decomposability (defined in Appendix D.4) of $\mathcal{V}_{-}$ .

D.6.1 Characterization of the subdifferential

First, we observe that $\mathcal{V}_{-}(\theta)=\sum_{i=1}^{n-1}(\theta_{i}-\theta_{i+1})_{+}$ can be written as a support function of a certain convex set. In fact, by Theorem 8.24 in Rockafeller and Wets (1998), we can see that

[TABLE]

where $\mathcal{B}$ is a closed convex set. Conversely, once we have a convex function $\mathcal{V}_{-}$ , the set $\mathcal{B}$ is specified as

[TABLE]

Many properties of the support function can be understood through the structure of the set $\mathcal{B}$ ; In particular, we can characterize the subdifferential and weak decomposability. Below, we investigate the more detailed structure of the set $\mathcal{B}$ in terms of submodular functions.

Let $G=(V,E)$ be a directed graph equipped with positive edge weights $\{c_{(i,j)}\}$ . For any $\theta\in\mathbb{R}^{n}$ , we define a nearly-isotonic type penalty $\mathcal{V}_{G}(\theta)$ for the weighted graph $G$ as in (33). For any subset $A\subseteq[n]$ , we also define $\kappa_{G}(A)$ by the total weights of outgoing edges:

[TABLE]

The function $A\mapsto\kappa_{G}(A)$ is called the cut function of the weighted graph $G$ .

It is well known that the cut function is a submodular function. Here, a function $F:2^{[n]}\to\mathbb{R}$ is called submodular if $F(\emptyset)=0$ and

[TABLE]

holds for any subsets $A,B\subseteq[n]$ . We refer the reader to Bach (2013) for fundamental properties of submodular functions. For any submodular function $F:2^{[n]}\to\mathbb{R}$ , we define the base polyhedron $\mathcal{B}(F)\subseteq\mathbb{R}^{n}$ as

[TABLE]

The Lovász extension $f:\mathbb{R}^{n}\to\mathbb{R}$ of $F$ is defined as the support function of $\mathcal{B}(F)$ , that is, for any $\theta\in\mathbb{R}^{n}$ , $f(\theta):=\max_{v\in\mathcal{B}(F)}\langle v,\theta\rangle.$

We see that the nearly-isotonic type penalty (33) is actually the Lovász extension of the cut function (71).

Proposition D.21.

For any directed graph $G$ and edge weight $c_{(i,j)}$ , the function $\mathcal{V}_{G}$ is the Lovász extension of the cut function $\kappa_{G}$ .

Proof.

This is the consequence of the well-known result so-called the greedy algorithm; see e.g., Proposition 3.2 in Bach (2013). In particular, we can find a derivation in Section 6.2 of Bach (2013). ∎

Now, we have the following useful characterizations of the subdifferential.

Proposition D.22.

Define $F:2^{[n]}\to\mathbb{R}$ be a submodular function and $f:\mathbb{R}^{n}\to\mathbb{R}$ be its Lovász extension. Suppose $\theta\in\mathbb{R}^{n}$ .

(i)

The subdifferential $\partial f(\theta)$ coincides with a face of $\mathcal{B}(F)$ given as

[TABLE] 2. (ii)

There is an (ordered) partition $(A_{1},A_{2},\ldots,A_{k})\subseteq[n]$ such that

[TABLE]

where $S_{i}:=\bigcup_{j=1}^{i}A_{j}$ ( $i=1,2,\ldots,k$ ). In particular, we have $\partial f(\theta)=\mathcal{B}(F)\cap\mathord{\mathrm{aff}}(\partial f(\theta))$ . 3. (iii)

Let $v$ be any point in the relative interior of $\partial f(\theta)$ . Then, the normal cone of $\partial f(\theta)$ at $v$ is contained in the set of partition-wise constant vectors:

[TABLE]

Proof.

The first statement is just a well-known property for the support function (Corollary 8.25 in Rockafeller and Wets (1998)). The second statement follows from the characterization of faces for the base polyhedron (see Proposition 4.7 in Bach (2013)). The third statement follows from (ii) and the characterization of normal cones of polyhedra (see Theorem 6.46 in Rockafeller and Wets (1998)). ∎

D.6.2 Weak decomposability

Here, we discuss the weak decomposability of the Lovász extension.

Before describing the result, we introduce some terminology. Let $F:2^{[n]}\to\mathbb{R}$ be a submodular function. We say that a set $A\subseteq[n]$ is separable for $F$ if there is a non-empty proper subset $B$ of $A$ such that $F(A)=F(B)+F(A\setminus B)$ . We also say that $A$ is inseparable if it is not separable. For example, if $F=\kappa_{G}$ is the cut function defined in (71), $A$ is inseparable if and only if it is a connected component in the graph $G$ . Furthermore, we define the following agglomerative clustering condition.

Definition D.23.

We say that a submodular function $F:2^{[n]}\to\mathbb{R}$ satisfies the agglomerative clustering (AC) condition if it has the following property: Let $A,B\subseteq[n]$ be a any disjoint pair of subsets such that $A\neq\emptyset$ and $A$ is inseparable for the function $F_{B}^{A}:2^{A}\to\mathbb{R}$ defined by $F_{B}^{A}(C):=F(B\cup C)-F(B)$ . Then, for any $C\subset A$ , we have

[TABLE]

Recall the definition of weak decomposability (65). The following proposition provides a sufficient condition for the weak decomposability of the Lovász extension.

Proposition D.24.

Let $F:2^{[n]}\to\mathbb{R}$ be a submodular function satisfying the AC condition in Definition D.23. Then, the Lovász extension of $f$ of $F$ is weakly decomposable.

Proof.

Fix $\theta\in\mathbb{R}^{n}$ . Since $f$ is the support function of the base polyhedron $\mathcal{B}(F)$ , $\partial f(\theta)$ coincides with a face of $\mathcal{B}(F)$ . Let $A_{1},A_{2},\ldots,A_{k}$ be a partition of $[n]$ such that $\mathord{\mathrm{aff}}(\partial f(\theta))$ is represented as (72). For $i=1,2,\ldots,k$ , we write $S_{0}:=\emptyset$ and $S_{i}:=A_{1}\cup A_{2}\cup\cdots\cup A_{i}$ . We should note that the above partition can be chosen so that $A_{i}$ is inseparable for the function defined as

[TABLE]

In this case, $\partial f(\theta)$ is an $n-k$ dimensional subset.

Define a vector $v^{*}$ as

[TABLE]

Since

[TABLE]

holds for any $i=1,\ldots,k$ , we have $v^{*}\in\mathord{\mathrm{aff}}(\partial f(\theta))$ . Moreover, $v^{*}$ is also contained in the normal cone of $\mathord{\mathrm{aff}}(\partial f(\theta))$ . Hence, if we prove $v^{*}\in\partial f(\theta)$ , we have

[TABLE]

which implies that $v^{*}\in\operatornamewithlimits{argmin}_{v\in\partial f(\theta)}\lVert v\rVert_{2}^{2}$ .

Now, our goal is to prove $v^{*}\in\partial f(\theta)$ under the AC condition. If $k=n$ , then it is clear from (72) that $\partial f(\theta)=\{v^{*}\}$ . Below, we assume that $k<n$ . Since $v^{*}\in\mathord{\mathrm{aff}}(\partial f(\theta))$ , it suffices to show that $\sum_{i\in S}v_{i}^{*}\leq F(S)$ holds for any $S\subseteq[n]$ that determines a relative boundary of $\partial f(\theta)$ . The relative boundary of $\partial f(\theta)$ can be written as the union of all $n-k-1$ dimensional faces of $\mathcal{B}(F)$ that have non-empty intersection with $\partial f(\theta)$ . Such faces can be characterized as follows: Let $\Pi=(A_{1},A_{2},\ldots,A_{k})$ be the partition defined in the above, and choose $A_{i}$ with $|A_{i}|\geq 2$ . Let $A^{\prime}_{i}$ be any non-empty proper subset of $A_{i}$ . We define a new ordered partition of $[n]$ by inserting $(A^{\prime}_{i},A_{i}\setminus A^{\prime}_{i})$ instead of $A_{i}$ :

[TABLE]

Then, $\Pi^{\prime}$ defines an $n-k-1$ dimensional affine subspace by (72), which defines a part of the relative boundary of $\partial f(\theta)$ . Therefore, we have to show that $\sum_{i\in S}v_{i}^{*}\leq F(S)$ for any $S$ that can be written as $S=S_{i-1}\cup A^{\prime}_{i}$ with $A^{\prime}_{i}\subset A_{i}$ . From the AC condition, we have

[TABLE]

This proves that $v^{*}\in\partial f(\theta)$ , and hence $f$ is weakly decomposable. ∎

Remark D.25.

The AC condition was originally introduced in Bach (2011). In that paper, the author consider the proximal denoising estimators (37) where $f$ is the Lovász extension of a submodular function $F$ . The name “agglomerative clustering” captures the following property: Let us consider the solution path of the minimization problem (37) parametrized by $\lambda$ , that is, the solution path is the collection $\{\hat{\theta}_{\lambda}\}_{\lambda\geq 0}$ calculated for all $\lambda\geq 0$ . In general, the solution path starts with $\hat{\theta}_{\lambda}=y$ for $\lambda=0$ , and $\hat{\theta}_{\lambda}$ shrinks toward some piecewise constant vector as $\lambda$ increases. Proposition 4 of Bach (2011) showed that the solution path is agglomerative if $F$ satisfies the AC condition.

We provide some examples of functions satisfying the AC condition:

•

Let $h:\mathbb{R}\to\mathbb{R}$ be a concave function with $h(0)=0$ . A submodular function defined as $F(A):=h(|A|)$ satisfies the AC condition. Examples of solutions paths for this class can be found in Bach (2011).

•

The one-dimensional fused lasso has an agglomerative solution path. The corresponding submodular function is the cut function of the undirected one-dimensional grid graph, which satisfies the AC condition. Hence, by Proposition D.24, the penalty of the one-dimensional fused lasso is weakly decomposable. This provides an alternative proof for Lemma 2.7 in Guntuboyina et al. (2017). On the other hand, the fused lasso on the two-dimensional grid does not satisfy this condition. See Bach (2011) for details.

•

The nearly-isotonic regression (3) has an agglomerative solution path. A direct proof for this property is provided in Lemma 1 in Tibshirani et al. (2011). Below, we prove that the cut function for directed one-dimensional grid graph satisfies the AC condition, which provides an alternative proof for this fact.

The following proposition provides a proof for Proposition D.18.

Proposition D.26.

The cut function $F$ associated with the nearly-isotonic regression satisfies the AC condition. In particular, the lower total variation $\mathcal{V}_{-}(\theta)$ is weakly decomposable. Moreover, for any $\theta\in\mathbb{R}^{n}$ , the minimum value of the $\ell_{2}$ -norm in $\partial\mathcal{V}_{-}(\theta)$ is given by (66).

Proof.

For any $A\subseteq V:=[n]$ , $F(A)$ is given by the number of connected components in $A$ that does not contains the rightmost point $n$ . Let $A\subseteq[n]$ be a connected subset, and $B\subseteq[n]\setminus A$ . The value of $F(B\cup A)-F(B)$ depends on whether one or both of two endpoints of $A$ are adjacent to $B$ .

We will check the AC condition by considering all patterns of adjacency as Table 1.

Here, $C$ represents any proper subset of $A$ , and “None” means that $A$ contains $1$ or $n$ . In each case, we can easily check that the inequality (73) is satisfied. Hence, $F$ satisfies the AC condition.

The second statement is a consequence of Proposition D.24.

The last statement follows from fact that the minimizer of $\lVert v\rVert_{2}^{2}$ in $\partial f(\theta)$ coincides with that in $\mathord{\mathrm{aff}}(\partial f(\theta))$ , which is given as (74). In this case, we can choose $A_{1},A_{2},\ldots,A_{k}$ as the constant partition of $\theta$ that is sorted by the values of $\theta$ . Thus, we have

[TABLE]

which proves the desired result. ∎

Remark D.27 (Missing part in the proof of Proposition A.1).

With a slight modification of the above argument, we can show the AC condition for the cut function of weighted graph

[TABLE]

where $c_{j}>0$ ( $j=1,\ldots,n-1$ ) are edge weights. As mentioned in Proposition A.1, we need this result to prove the validity of the modified PAVA algorithm (Algorithm 1). Here, we prove that (31) provides a sufficient condition for the AC condition, and hence the solution path of the weighted nearly-isotonic regression (30) is agglomerative.

Let $A\subseteq[n]$ be a non-empty connected subset, $B$ be a subset of $[n]\setminus A$ , and $C$ be a proper subset of $A$ . Recall that our goal is to check the inequality (73). For clarity, we write $A=\{j_{L},j_{L}+1,\ldots,j_{R}\}$ . As in the proof of Proposition D.26, we consider all adjacency patterns of $A$ , $B$ and $C$ . Then, we can easily check the following case statement:

Suppose that either “ $j_{L}=1$ and $j_{R}+1\notin B$ ” or “ $j_{L}-1\notin B$ and $j_{R}+1\notin B$ ” holds. Then, we have $F(B\cup A)-F(B)=F(A)=c_{j_{R}}$ and $F(B\cup C)-F(B)=F(C)$ . Now, we will check (73) under the concavity condition (31). First, (73) trivially holds when $j_{R}\in C$ because in this case $F(C)\geq c_{j_{R}}=F(A)$ . Next, we assume $j_{R}\notin C$ . Let $i$ be the largest element in $C$ . Then, we have $F(C)\geq c_{i}$ , $|C|\leq i-j_{L}+1$ . Under the assumption (31), we have

[TABLE]

which implies (73). 2. 2.

Suppose that $j_{L}-1\in B$ and $j_{R}+1\notin B$ . Then, we have $F(B\cup A)-F(B)=c_{j_{R}}-c_{j_{L}-1}$ and $F(B\cup C)-F(B)\geq F(C)-c_{j_{L}-1}$ . By a similar argument above, (73) trivially holds when $j_{R}\in C$ . Let $j_{R}\notin C$ and let $i$ be the largest element in $C$ . Then, under the assumption (31), we have

[TABLE] 3. 3.

For other case, we have $F(B\cup A)-F(B)\leq F(B\cup C)-F(B)$ , which implies (73).

Appendix E Proofs in Section 5

The goal of this section is to prove Theorem 5.1. The outline of the proof is essentially the same as the framework of Theorem 4.18 in Massart (2007). We explain this framework in Section E.1. To complete the proof, we have to control the maximum value of a certain normalized Gaussian process. For this, we provide an upper bound in Section E.2.

E.1 Proof overview

Let $(\hat{\Pi},\hat{\mathbf{V}})$ be the selected pair in (27). Fix any connected partition $\Pi$ and $\mathbf{V}\in\mathscr{V}(|\Pi|)$ . By the definition of the estimator, we have

[TABLE]

for any vector $\theta^{\prime}$ that belongs to $K_{\Pi^{\prime}}^{\uparrow}(\mathbf{V}^{\prime})$ . In particular, we can choose $\theta^{\prime}$ as

[TABLE]

Substituting $y=\theta^{*}+\xi$ , we can deduce that

[TABLE]

Here, recall that $\xi$ is a random variable drawn from $N(0,\sigma^{2}I_{n})$ .

Let $z>0$ be a positive number and $c\in(0,1)$ . Suppose that an inequality

[TABLE]

holds on some event $\Omega_{z}$ that occurs with probability at least $1-\mathrm{e}^{-z}$ . Here, $\eta(\Pi,\mathbf{V},z)>0$ is a positive constant that can depend on $\Pi,\mathbf{V},z$ . Combining this inequality with (76), we have on the same event

[TABLE]

where we used the elementary inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ .

E.2 Controlling the normalized process

Now, our goal is to provide an inequality of the form (77). Below, we fix $\theta^{\prime}:=\theta^{*}_{\Pi^{\prime},\mathbf{V}^{\prime}}$ .

First, we fix a partition $\Pi$ and $\mathbf{V}\in\mathscr{V}(|\Pi|)$ . For any $\theta\in K_{\Pi}^{\uparrow}(\mathbf{V})$ , we define

[TABLE]

where $\eta>0$ is a positive constant which will be specified later. Define a random variable $Z_{\Pi,\mathbf{V}}$ as

[TABLE]

Note that $Z_{\Pi,\mathbf{V}}$ is the supremum of a sample-continuous Gaussian process. By the concentration inequality for Gaussian processes (Lemma F.1), we have

[TABLE]

for any $x>0$ and $z>0$ . Here, the variance $v$ is bounded as

[TABLE]

because $\omega(\theta)\geq\lVert\theta-\theta^{\prime}\rVert_{2}^{2}+\eta\geq 2\eta^{1/2}\lVert\theta-\theta^{\prime}\rVert_{2}$ , and $\langle u,\xi\rangle$ is distributed according to $N(0,\sigma^{2}\lVert u\rVert_{2}^{2})$ for any $u\in\mathbb{R}^{n}$ .

We will provide an upper bound for $\mathbb{E}[Z_{\Pi,\mathbf{V}}]$ . Let $\theta^{*}_{\Pi,\mathbf{V}}$ be the orthogonal projection of $\theta^{*}$ onto $K_{\Pi}^{\uparrow}(\mathbf{V})$ . Note that

[TABLE]

The second term (b) in the right-hand side of (80) is bounded from above by $\sigma\eta^{-1/2}$ . Indeed, since

[TABLE]

we have

[TABLE]

To bound the term (a) in (80), we use the following lemma:

Lemma E.1.

Let $\Pi=(A_{1},A_{2},\ldots,A_{m})$ be any partition and $\mathbf{V}=(\mathcal{V}_{1},\mathcal{V}_{2},\ldots,\mathcal{V}_{m})$ . Fix any $\bar{\theta}\in K_{\Pi}^{\uparrow}(\mathbf{V})$ . For any $t>0$ , we have

[TABLE]

where $C>0$ is a universal constant. Futhermore, for any $\eta>0$ , we have

[TABLE]

where $C$ is the same constant as in (81).

Proof.

We will prove the first inequality (81). Let $W:=W(\Pi,\mathbf{V})$ denote the left-hand side of (81). We consider a collection of finitely many sets $S(\mathbf{q})$ as follows: Let $\mathcal{Q}:=\mathcal{Q}(m)$ be a collection of vectors $\mathbf{q}=(q_{1},q_{2},\ldots,q_{m})$ that can be written as $\mathbf{q}=t^{2}\mathbf{a}/m$ for some integer vector $\mathbf{a}=(a_{1},a_{2},\ldots,a_{m})$ such that $1\leq a_{i}\leq m$ and $\sum_{i=1}^{m}a_{i}\leq 2m$ . Note that, by Proposition D.10, the cardinality of $\mathcal{Q}$ is bounded by $(2\mathrm{e})^{m}$ . For any $\mathbf{q}\in\mathcal{Q}$ , define the set

[TABLE]

Then, we can easily check that

[TABLE]

From Lemma F.3 below, there exists a universal constant $C>0$ such that

[TABLE]

Here, by Hölder’s inequality, we have

[TABLE]

and by the Cauchy-–Schwarz inequality, we also have

[TABLE]

Then, by Lemma F.4 below, we have

[TABLE]

for some $C^{\prime}>0$ . Thus, (81) has been proved.

The second inequality (82) is a consequence of the peeling lemma (Lemma F.2 below). ∎

Combining (79), (80) and (82), we conclude that

[TABLE]

holds with probability at least $1-\exp(-(x+z))$ , where $C$ is the constant in (82). Now, we choose the two constant $\eta:=\eta(\Pi,\mathcal{V},z)$ and $x:=x(\Pi,\mathcal{V})$ as

[TABLE]

and

[TABLE]

respectively. Then, it is elementary to check that the right-hand side of (E.2) is not larger than $1/8$ .

Applying the union bound over all pairs $(\Pi,\mathbf{V})$ , we have

[TABLE]

Here, we can show that

[TABLE]

and hence we conclude that (77) holds with $c=1/2$ . Indeed, (85) follows from the fact that, for any $\Pi$ ,

[TABLE]

and

[TABLE]

E.3 Proof of Theorem 5.1

Now, we are ready to complete the proof of Theorem 5.1. Define $\mathrm{pen}(\Pi,\mathbf{V})$ as

[TABLE]

where $C$ is the constant in (82). Let $(\Pi^{\prime},\mathbf{V}^{\prime})$ be the pair that minimizes

[TABLE]

among all possible pairs. Applying (78) and (77) for this choice of $(\Pi^{\prime},\mathbf{V}^{\prime})$ , we conclude that

[TABLE]

holds with probability at least $1-\exp(-z)$ . Moreover, by integrating both sides with respect to $z$ , we have

[TABLE]

Appendix F Auxiliary lemmas

Here, we present several auxiliary lemmas that are used in the proofs in the previous sections.

Lemma F.1 (Borel–Tsirelson–Ibragimov–Sudakov inequality; see Proposition 3.19 in Massart (2007)).

Suppose that $(X_{t})_{t\in T}$ is a Gaussian process on a totally bounded metric space $(T,d)$ such that $\mathbb{E}[X_{t}]=0$ for any $t\in T$ and the sample path $t\mapsto X_{t}$ is almost surely continuous. Let $v:=\sup_{t\in T}\mathbb{E}[X^{2}_{t}]$ . Then, for any $z>0$ , we have

[TABLE]

Lemma F.2 (Peeling lemma; see e.g. Lemma 4.23 in Massart (2007)).

Let $K$ be a set in $\mathbb{R}^{n}$ and $\bar{\theta}\in K$ . Assume that there is a function $\psi:[0,\infty)\to\mathbb{R}$ such that $\psi(t)/t$ is non-increasing and

[TABLE]

for any $t\geq\bar{t}\geq 0$ . Then, for any $x\geq\bar{t}$ , we have

[TABLE]

Lemma F.3 (Guntuboyina et al. (2017), Lemma B.1).

For any $t>0$ and $\mathcal{V}>0$ , let

[TABLE]

There exists a universal constant $C>0$ such that

[TABLE]

Lemma F.4 (Guntuboyina et al. (2017), Lemma D.1).

Suppose $p,n\geq 1$ and let $\Theta_{1},\ldots,\Theta_{p}$ be subset of $\mathbb{R}^{n}$ each containing the origin and each contained in the closed Euclidean ball of radius $D$ centered at the origin. Then, for $\xi\sim N(0,\sigma^{2}I)$ , we have

[TABLE]

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP17J06640. The author would like to thank three anonymous reviewers for their valuable comments and suggestions. The author also thanks Hiromichi Nagao for suggesting the example of a seismological phenomenon, and Fumiyasu Komaki and Keisuke Yano for helpful discussions.

Bibliography56

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amelunxen et al. [2014] D. Amelunxen, M. Lotz, M. B. Mc Coy, and J. A. Tropp. Living on the edge: Phase transition in convex programs with random data. Information and Inference: A Journal of IMA , 3:224–294, 2014.
2Ayer et al. [1955] M. Ayer, H. D. Brunk, G. M. Ewing, W.T. Reid, and E. Silverman. An empirical distribution function for sampling with incomplete information. The Annals of Mathematical Statistics , 26:641–647, 1955.
3Bach [2011] F. Bach. Shaping level sets with submodular functions. In NIPS , 2011.
4Bach [2013] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning , 6(2–3):143–373, 2013.
5Beck and Teboulle [2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2(1):183–202, 2009.
6Bellec [2018] P. C. Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics , 46(2):745–780, 2018.
7Bellec and Tsybakov [2015] P. C. Bellec and A. B. Tsybakov. Sharp oracle bounds for monotone and convex regression through aggregation. Journal of Machine Learning Research , 16:1879–1892, 2015.
8Birgé and Massart [2001] L. Birgé and P. Massart. Gaussian model selection. Journal of the European Mathematical Society , 3:203–268, 2001.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Estimating Piecewise Monotone Signals

Abstract

Contents

1 Introduction

Definition 1.1**.**

1.1 Summary of theoretical results

Claim 1.2**.**

1.2 Organization

1.3 Notation

2 Related work

3 Lower bounds

3.1 Minimax lower bound

Definition 3.1**.**

Proposition 3.2**.**

3.2 Lower bound of isotonic regression with misspecified partitions

Proposition 3.3**.**

4 Risk bounds for nearly-isotonic regression

4.1 Risk bounds for constrained estimators

Theorem 4.1**.**

Corollary 4.2**.**

Remark 4.3**.**

Corollary 4.4**.**

Remark 4.5**.**

Remark 4.6**.**

4.2 Risk bounds for penalized estimators

Theorem 4.7**.**

Proposition 4.8**.**

Remark 4.9** (Example of parameter choice).**

Remark 4.10**.**

4.3 Application to piecewise monotone vectors

Definition 4.11**.**

Corollary 4.12**.**

5 Model selection based estimators

Theorem 5.1**.**

6 Simulations

6.1 Dealing with inconsistency at boundaries

6.2 Simulation data

6.3 Geological data

7 Discussion

7.1 Non-Gaussian noises

7.2 Future directions

Appendix A Algorithms for nearly-isotonic estimators

A.1 Penalized estimators

A.1.1 One-dimensional problem

Proposition A.1**.**

Proof sketch of Proposition A.1.

A.1.2 General graphs

A.1.3 General convex loss functions

A.2 Constrained estimators

Proposition A.2**.**

Remark A.3**.**

Appendix B Supplemental experiments

Appendix C Proofs in Section 3

C.1 Proof of Proposition 3.2

C.2 Proof of Proposition 3.3

Theorem C.1** (Chatterjee (2014), Corollary 1.2).**

Remark C.2**.**

Appendix D Proofs in Section 4

D.1 Preliminaries

Definition D.1**.**

Proposition D.2**.**

Lemma D.3** (Bellec (2018), Corollary 2.2).**

Lemma D.4**.**

Proof.

D.2 Risk bounds for constrained estimators (Proof of Theorem 4.1)

Proposition D.5**.**

Lemma D.6**.**

Proof.

Lemma D.7**.**

Proof.

Remark D.8**.**

Lemma D.9**.**

Proof.

Definition 1.1.

Claim 1.2.

Definition 3.1.

Proposition 3.2.

Proposition 3.3.

Theorem 4.1.

Corollary 4.2.

Remark 4.3.

Corollary 4.4.

Remark 4.5.

Remark 4.6.

Theorem 4.7.

Proposition 4.8.

Remark 4.9 (Example of parameter choice).

Remark 4.10.

Definition 4.11.

Corollary 4.12.

Theorem 5.1.

Proposition A.1.

Proposition A.2.

Remark A.3.

Theorem C.1 (Chatterjee (2014), Corollary 1.2).

Remark C.2.

Definition D.1.

Proposition D.2.

Lemma D.3 (Bellec (2018), Corollary 2.2).

Lemma D.4.

Proposition D.5.

Lemma D.6.

Lemma D.7.

Remark D.8.

Lemma D.9.

Proposition D.10.

Lemma D.11.

Lemma D.12.

Proposition D.13.

Remark D.14 (Non-Gaussian noises).

Proposition D.15.

Lemma D.16 (Guntuboyina et al. (2017), Proposition B.5).

Corollary D.17.

Proposition D.18.

Proposition D.19.

Lemma D.20.

Proposition D.21.

Proposition D.22.

Definition D.23.

Proposition D.24.

Remark D.25.

Proposition D.26.

Remark D.27 (Missing part in the proof of Proposition A.1).

Lemma E.1.

Lemma F.1 (Borel–Tsirelson–Ibragimov–Sudakov inequality; see Proposition 3.19 in Massart (2007)).

Lemma F.2 (Peeling lemma; see e.g. Lemma 4.23 in Massart (2007)).

Lemma F.3 (Guntuboyina et al. (2017), Lemma B.1).

Lemma F.4 (Guntuboyina et al. (2017), Lemma D.1).