Post-Selection Inference for Changepoint Detection Algorithms with   Application to Copy Number Variation Data

Sangwon Hyun; Kevin Lin; Max G'Sell; Ryan J. Tibshirani

arXiv:1812.03644·stat.ME·December 11, 2018

Post-Selection Inference for Changepoint Detection Algorithms with Application to Copy Number Variation Data

Sangwon Hyun, Kevin Lin, Max G'Sell, Ryan J. Tibshirani

PDF

TL;DR

This paper develops tailored post-selection inference methods for changepoint detection algorithms, especially in copy number variation data, enhancing uncertainty quantification and practical usability.

Contribution

It adapts post-selection inference techniques for specific changepoint algorithms, incorporating randomization and MCMC methods to improve test power and usability.

Findings

01

Improved power in post-selection tests using auxiliary randomization.

02

Effective application of methods to copy number variation data.

03

Guidelines for practical implementation and analysis.

Abstract

Changepoint detection methods are used in many areas of science and engineering, e.g., in the analysis of copy number variation data, to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or presence) of given changepoints, post-detection, are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods towards changepoint detection, focusing as our main scientific application on copy number variation data. As for changepoint algorithms, we study binary segmentation, and two of its most popular variants, wild and circular, and the fused lasso. We…

Equations135

Y_{i} \sim N (θ_{i}, σ^{2}), i = 1, \dots, n,

Y_{i} \sim N (θ_{i}, σ^{2}), i = 1, \dots, n,

θ_{b_{j} + 1} = \dots = θ_{b_{j + 1}}, j = 0, \dots, t .

θ_{b_{j} + 1} = \dots = θ_{b_{j + 1}}, j = 0, \dots, t .

1 \leq \overset{c}{^}_{1} < \dots < \overset{c}{^}_{k} \leq n - 1,

1 \leq \overset{c}{^}_{1} < \dots < \overset{c}{^}_{k} \leq n - 1,

v_{j}^{T}y=\hat{d}_{j}\bigg{(}\frac{1}{\hat{c}_{j+1}-\hat{c}_{j}}\Big{(}\sum_{i=\hat{c}_{j}+1}^{\hat{c}_{j+1}}y_{i}\Big{)}-\frac{1}{\hat{c}_{j}-\hat{c}_{j-1}+1}\Big{(}\sum_{i=\hat{c}_{j-1}+1}^{\hat{c}_{j}}y_{i}\Big{)}\bigg{)},

v_{j}^{T}y=\hat{d}_{j}\bigg{(}\frac{1}{\hat{c}_{j+1}-\hat{c}_{j}}\Big{(}\sum_{i=\hat{c}_{j}+1}^{\hat{c}_{j+1}}y_{i}\Big{)}-\frac{1}{\hat{c}_{j}-\hat{c}_{j-1}+1}\Big{(}\sum_{i=\hat{c}_{j-1}+1}^{\hat{c}_{j}}y_{i}\Big{)}\bigg{)},

\displaystyle\big{\{}\hat{j}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in\{1,\ldots,\ell-1\}\\ b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},b,e_{j})}y\big{|},\quad\text{where}

\displaystyle\big{\{}\hat{j}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in\{1,\ldots,\ell-1\}\\ b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},b,e_{j})}y\big{|},\quad\text{where}

\displaystyle g_{(s,b,e)}^{T}y=\sqrt{\frac{1}{\frac{1}{|e-b|}+\frac{1}{|b+1-s|}}}\big{(}\bar{y}_{(b+1):e}-\bar{y}_{s:b}\big{)}.

\big{\{}\hat{j}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in J_{\ell}\\ b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},b,e_{j})}y\big{|},

\big{\{}\hat{j}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in J_{\ell}\\ b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},b,e_{j})}y\big{|},

\displaystyle\big{\{}\hat{j}_{\ell},\hat{a}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in\{1,\ldots,2(\ell-1)+1)\}\\ a<b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},a,b,e_{j})}y\big{|}\quad\text{where}

\displaystyle\big{\{}\hat{j}_{\ell},\hat{a}_{\ell},\hat{b}_{\ell}\big{\}}=\mathop{\mathrm{argmax}}_{\begin{subarray}{c}j\in\{1,\ldots,2(\ell-1)+1)\}\\ a<b\in\{s_{j},\ldots,e_{j}-1\}\end{subarray}}\big{|}g^{T}_{(s_{j},a,b,e_{j})}y\big{|}\quad\text{where}

\displaystyle g_{(s,a,b,e)}^{T}y=\sqrt{\frac{1}{\frac{1}{|b-a|}+\frac{1}{|e-s-b+a|}}}\Big{(}\bar{y}_{(a+1):b}-\bar{y}_{\{s:a\}\cup\{(b+1):e\}}\Big{)}.

θ \in R^{n} min i = 1 \sum n (y_{i} - θ_{i})^{2} + λ i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣,

θ \in R^{n} min i = 1 \sum n (y_{i} - θ_{i})^{2} + λ i = 1 \sum n - 1 ∣ θ_{i} - θ_{i + 1} ∣,

v^{T}Y\;|\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;q(Y)=q(y_{\mathrm{obs}})\Big{)},

v^{T}Y\;|\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;q(Y)=q(y_{\mathrm{obs}})\Big{)},

v^{T}Y\;|\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\Pi_{v}^{\perp}Y=\Pi_{v}^{\perp}y_{\mathrm{obs}}\Big{)}.

v^{T}Y\;|\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\Pi_{v}^{\perp}Y=\Pi_{v}^{\perp}y_{\mathrm{obs}}\Big{)}.

θ_{\overset{c}{^}_{j} + 1} = \dots = θ_{\overset{c}{^}_{j + 1}}, j \in {0, \dots, k} .

θ_{\overset{c}{^}_{j} + 1} = \dots = θ_{\overset{c}{^}_{j + 1}}, j \in {0, \dots, k} .

\big{(}\bar{Y}_{(\hat{c}_{j}+1):\hat{c}_{j+1}}-\bar{Y}_{(\hat{c}_{j-1}+1):\hat{c}_{j}}\big{)}\;\big{|}\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\big{(}\bar{y}_{\mathrm{obs}}\big{)}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}},\;\ell\neq j\Big{)}.

\big{(}\bar{Y}_{(\hat{c}_{j}+1):\hat{c}_{j+1}}-\bar{Y}_{(\hat{c}_{j-1}+1):\hat{c}_{j}}\big{)}\;\big{|}\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\big{(}\bar{y}_{\mathrm{obs}}\big{)}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}},\;\ell\neq j\Big{)}.

\big{(}\bar{Y}_{(\hat{c}_{j}+1):\hat{c}_{j+1}}-\bar{Y}_{(\hat{c}_{j-1}+1):\hat{c}_{j}}\big{)}\;\big{|}\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\big{(}\bar{y}_{\mathrm{obs}}\big{)}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}},\;\ell\neq j,\\ \|Y\|_{2}=\|y_{\mathrm{obs}}\|_{2}\Big{)}.

\big{(}\bar{Y}_{(\hat{c}_{j}+1):\hat{c}_{j+1}}-\bar{Y}_{(\hat{c}_{j-1}+1):\hat{c}_{j}}\big{)}\;\big{|}\;\Big{(}M(Y)=M(y_{\mathrm{obs}}),\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\big{(}\bar{y}_{\mathrm{obs}}\big{)}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}},\;\ell\neq j,\\ \|Y\|_{2}=\|y_{\mathrm{obs}}\|_{2}\Big{)}.

M^{\mathrm{BS}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}})\big{\}},

M^{\mathrm{BS}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}})\big{\}},

\big{\{}y:M_{1:k}^{\mathrm{BS}}(y)=\{b_{1:k},d_{1:k}\}\big{\}}=\{y:\Gamma y\geq 0\},

\big{\{}y:M_{1:k}^{\mathrm{BS}}(y)=\{b_{1:k},d_{1:k}\}\big{\}}=\{y:\Gamma y\geq 0\},

d_{1} \cdot g_{(1, b_{1}, n)}^{T} y \geq g_{(1, b, n)}^{T} y, and d_{1} \cdot g_{(1, b_{1}, n)}^{T} y \geq - g_{(1, b, n)}^{T} y, b \in {1, \dots, n - 1} \ {b_{1}} .

d_{1} \cdot g_{(1, b_{1}, n)}^{T} y \geq g_{(1, b, n)}^{T} y, and d_{1} \cdot g_{(1, b_{1}, n)}^{T} y \geq - g_{(1, b, n)}^{T} y, b \in {1, \dots, n - 1} \ {b_{1}} .

d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq g_{(s_{k}, b, e_{k})}^{T} y and d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq - g_{(s_{k}, b, e_{k})}^{T} y, b \in {s_{k}, \dots, e_{k} - 1} \ {b_{k}} .

d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq g_{(s_{k}, b, e_{k})}^{T} y and d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq - g_{(s_{k}, b, e_{k})}^{T} y, b \in {s_{k}, \dots, e_{k} - 1} \ {b_{k}} .

d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq g_{(s_{ℓ}, b, e_{ℓ})}^{T} y and d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq - g_{(s_{ℓ}, b, e_{ℓ})}^{T} y, b \in {s_{ℓ}, \dots, e_{ℓ} - 1} .

d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq g_{(s_{ℓ}, b, e_{ℓ})}^{T} y and d_{k} \cdot g_{(s_{k}, b_{k}, e_{k})}^{T} y \geq - g_{(s_{ℓ}, b, e_{ℓ})}^{T} y, b \in {s_{ℓ}, \dots, e_{ℓ} - 1} .

M^{\mathrm{WBS}}_{1:k}(y_{\mathrm{obs}},w)=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}}),\;\hat{j}_{1:k}(y_{\mathrm{obs}})\big{\}},

M^{\mathrm{WBS}}_{1:k}(y_{\mathrm{obs}},w)=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}}),\;\hat{j}_{1:k}(y_{\mathrm{obs}})\big{\}},

\big{\{}y:M_{1:k}^{\mathrm{WBS}}(y,w)=\{b_{1:k},d_{1:k},j_{1:k}\}\big{\}}=\big{\{}y:\Gamma y\geq 0\big{\}}.

\big{\{}y:M_{1:k}^{\mathrm{WBS}}(y,w)=\{b_{1:k},d_{1:k},j_{1:k}\}\big{\}}=\big{\{}y:\Gamma y\geq 0\big{\}}.

M^{\mathrm{CBS}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{a}_{1:k}(y_{\mathrm{obs}}),\;\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}})\big{\}},

M^{\mathrm{CBS}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{a}_{1:k}(y_{\mathrm{obs}}),\;\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}})\big{\}},

\big{\{}y:M_{1:k}^{\mathrm{CBS}}(y,w)=\{a_{1:k},b_{1:k},d_{1:k}\}\big{\}}=\big{\{}y:\Gamma y\geq 0\big{\}}.

\big{\{}y:M_{1:k}^{\mathrm{CBS}}(y,w)=\{a_{1:k},b_{1:k},d_{1:k}\}\big{\}}=\big{\{}y:\Gamma y\geq 0\big{\}}.

2\sum_{\ell=1}^{k}\Big{[}C(|I^{(\ell)}_{j_{k}}|-1,2)-1+\sum_{j^{\prime}\neq j_{k}}C(|I^{(\ell)}_{j^{\prime}}|-1,2)\Big{]}.

2\sum_{\ell=1}^{k}\Big{[}C(|I^{(\ell)}_{j_{k}}|-1,2)-1+\sum_{j^{\prime}\neq j_{k}}C(|I^{(\ell)}_{j^{\prime}}|-1,2)\Big{]}.

M^{\mathrm{FL}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}}),\;\hat{R}_{1:k}(y_{\mathrm{obs}})\big{\}},

M^{\mathrm{FL}}_{1:k}(y_{\mathrm{obs}})=\big{\{}\hat{b}_{1:k}(y_{\mathrm{obs}}),\;\hat{d}_{1:k}(y_{\mathrm{obs}}),\;\hat{R}_{1:k}(y_{\mathrm{obs}})\big{\}},

\big{\{}y:M_{1:k}^{\mathrm{FL}}(y)=\{b_{1:k},d_{1:k},R_{1:k}\}\big{\}}=\{y:\Gamma y\geq 0\},

\big{\{}y:M_{1:k}^{\mathrm{FL}}(y)=\{b_{1:k},d_{1:k},R_{1:k}\}\big{\}}=\{y:\Gamma y\geq 0\},

(Φ (V_{up} / τ) - Φ (v^{T} y_{obs} / τ)) / (Φ (V_{up} / τ) - Φ (V_{lo} / τ))

(Φ (V_{up} / τ) - Φ (v^{T} y_{obs} / τ)) / (Φ (V_{up} / τ) - Φ (V_{lo} / τ))

\mathcal{V}_{\text{lo}}=v^{T}y_{\mathrm{obs}}-\min_{j:\rho_{j}>0}\big{(}\Gamma y_{\mathrm{obs}}\big{)}_{j}/\rho_{j},\quad\text{and}\quad\mathcal{V}_{\text{up}}=v^{T}y_{\mathrm{obs}}-\max_{j:\rho_{j}<0}\big{(}\Gamma y_{\mathrm{obs}}\big{)}_{j}/\rho_{j}.

\mathcal{V}_{\text{lo}}=v^{T}y_{\mathrm{obs}}-\min_{j:\rho_{j}>0}\big{(}\Gamma y_{\mathrm{obs}}\big{)}_{j}/\rho_{j},\quad\text{and}\quad\mathcal{V}_{\text{up}}=v^{T}y_{\mathrm{obs}}-\max_{j:\rho_{j}<0}\big{(}\Gamma y_{\mathrm{obs}}\big{)}_{j}/\rho_{j}.

\Big{\{}v^{T}Y:M(Y)=M(y_{\mathrm{obs}}),\;\|Y\|_{2}=\|y_{\text{obs}}\|_{2},\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\bar{y}_{\mathrm{obs},(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}\ell\neq j\Big{\}}.

\Big{\{}v^{T}Y:M(Y)=M(y_{\mathrm{obs}}),\;\|Y\|_{2}=\|y_{\text{obs}}\|_{2},\;\bar{Y}_{(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}=\bar{y}_{\mathrm{obs},(\hat{c}_{\ell}+1):\hat{c}_{\ell+1}}\ell\neq j\Big{\}}.

\Big{\{}y\;:\;y=y^{(m-1)}+r(\omega)\sin(\omega)\cdot s+r(\omega)\cos(\omega)\cdot t\quad\text{for any }\omega\in[-\pi/2,\pi/2]\Big{\}},

\Big{\{}y\;:\;y=y^{(m-1)}+r(\omega)\sin(\omega)\cdot s+r(\omega)\cos(\omega)\cdot t\quad\text{for any }\omega\in[-\pi/2,\pi/2]\Big{\}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\frefformat

plainTHMtheorem\fancyrefdefaultspacing#1\FrefformatplainTHMTheorem\fancyrefdefaultspacing#1\frefformatvarioTHMtheorem\fancyrefdefaultspacing#1#3\FrefformatvarioTHMTheorem\fancyrefdefaultspacing#1#3 \frefformatplainthmtheorem\fancyrefdefaultspacing#1\FrefformatplainthmTheorem\fancyrefdefaultspacing#1\frefformatvariothmtheorem\fancyrefdefaultspacing#1#3\FrefformatvariothmTheorem\fancyrefdefaultspacing#1#3 \frefformatplainLEMlemma\fancyrefdefaultspacing#1\FrefformatplainLEMLemma\fancyrefdefaultspacing#1\frefformatvarioLEMlemma\fancyrefdefaultspacing#1#3\FrefformatvarioLEMLemma\fancyrefdefaultspacing#1#3 \frefformatplainlemlemma\fancyrefdefaultspacing#1\FrefformatplainlemLemma\fancyrefdefaultspacing#1\frefformatvariolemlemma\fancyrefdefaultspacing#1#3\FrefformatvariolemLemma\fancyrefdefaultspacing#1#3 \frefformatplainlemmalemma\fancyrefdefaultspacing#1\FrefformatplainlemmaLemma\fancyrefdefaultspacing#1\frefformatvariolemmalemma\fancyrefdefaultspacing#1#3\FrefformatvariolemmaLemma\fancyrefdefaultspacing#1#3 \frefformatplainpropproposition\fancyrefdefaultspacing#1\FrefformatplainpropProposition\fancyrefdefaultspacing#1\frefformatvariopropproposition\fancyrefdefaultspacing#1#3\FrefformatvariopropProposition\fancyrefdefaultspacing#1#3 \frefformatplaincorcorollary\fancyrefdefaultspacing#1\FrefformatplaincorCorollary\fancyrefdefaultspacing#1\frefformatvariocorcorollary\fancyrefdefaultspacing#1#3\FrefformatvariocorCorollary\fancyrefdefaultspacing#1#3 \frefformatplaindefdefinition\fancyrefdefaultspacing#1\FrefformatplaindefDefinition\fancyrefdefaultspacing#1\frefformatvariodefdefinition\fancyrefdefaultspacing#1#3\FrefformatvariodefDefinition\fancyrefdefaultspacing#1#3 \frefformatplainclclaim\fancyrefdefaultspacing#1\FrefformatplainclClaim\fancyrefdefaultspacing#1\frefformatvarioclclaim\fancyrefdefaultspacing#1#3\FrefformatvarioclClaim\fancyrefdefaultspacing#1#3 \frefformatplainsssubsection\fancyrefdefaultspacing#1\FrefformatplainssSubsection\fancyrefdefaultspacing#1\frefformatvariosssubsection\fancyrefdefaultspacing#1#3\FrefformatvariossSubsection\fancyrefdefaultspacing#1#3 \frefformatplainalgalgorithm\fancyrefdefaultspacing#1\FrefformatplainalgAlgorithm\fancyrefdefaultspacing#1\frefformatvarioalgalgorithm\fancyrefdefaultspacing#1#3\FrefformatvarioalgAlgorithm\fancyrefdefaultspacing#1#3 \frefformatplainassassumption\fancyrefdefaultspacing#1\FrefformatplainassAssumption\fancyrefdefaultspacing#1\frefformatvarioassassumption\fancyrefdefaultspacing#1#3\FrefformatvarioassAssumption\fancyrefdefaultspacing#1#3 \frefformatplainremremark\fancyrefdefaultspacing#1\FrefformatplainremRemark\fancyrefdefaultspacing#1\frefformatvarioremremark\fancyrefdefaultspacing#1#3\FrefformatvarioremRemark\fancyrefdefaultspacing#1#3

Post-Selection Inference for Changepoint Detection Algorithms

with Application to Copy Number Variation Data

SANGWON HYUN∗, KEVIN LIN, MAX G’SELL, RYAN J. TIBSHIRANI

Department of Statistics, Carnegie Mellon University, 132 Baker Hall, Pittsburgh, PA 15213. *

Abstract

Changepoint detection methods are used in many areas of science and engineering, e.g., in the analysis of copy number variation data, to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or presence) of given changepoints, post-detection, are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods towards changepoint detection, focusing as our main scientific application on copy number variation data. As for changepoint algorithms, we study binary segmentation, and two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory: we use auxiliary randomization to improve power, which requires implementations of MCMC algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and an example analysis on array comparative genomic hybridization (CGH) data. CGH analysis; changepoint detection; copy number variation; hypothesis tests; post-selection inference; segmentation algorithms

00footnotetext: To whom correspondence should be addressed: [email protected].

1 Introduction

Changepoint detection is the problem of identifying changes in data distribution along a sequence of observations. We study the canonical changepoint problem, where changes occur only in the mean: let vector $Y=(Y_{1},\ldots,Y_{n})\in\mathbb{R}^{n}$ be a data vector with independent entries following

[TABLE]

where the unknown mean vector $\theta\in\mathbb{R}^{n}$ forms a piecewise constant sequence. That is, for locations $1\leq b_{1}<\cdots<b_{t}\leq n-1$ ,

[TABLE]

where for convenience we write $b_{0}=0$ and $b_{t+1}=n$ . We call $b_{1},\ldots,b_{t}$ changepoint locations of $\theta$ . Changepoint detection algorithms typically focus on estimating the number of changepoints $t$ (which could possibly be 0), as well as the locations $b_{1},\ldots,b_{t}$ , from a single realization $Y$ . Roughly speaking, changepoint methodology (and its associated literature) can be divided into two classes of algorithms: segmentation algorithms and penalization algorithms. The former class includes binary segmentation (BS) (Vostrikova, 1981) and popular variants like wild binary segmentation (WBS) (Fryzlewicz, 2014) and circular binary segmentation (CBS) (Olshen et al., 2004); the latter class includes the fused lasso (FL) (Tibshirani et al., 2005) (also called total variation denoising (Rudin et al., 1992) in signal processing), and the Potts estimator (Boysen et al., 2009). These two classes have different strengths; see, e.g., Lin et al. (2016) for more discussion.

Having estimated changepoint locations, a natural follow-up goal would be to conduct statistical inference on the significance of the changes in mean at these locations. Despite the large number of segmentation algorithms and penalization algorithms available for changepoint detection, there has been very little focus on formally valid inferential tools to use post-detection. In this work, we describe a suite of inference tools to use after a changepoint algorithm has been applied—namely, BS, WBS, CBS, or FL. We work in the framework of post-selection inference, also called selective inference. The specific machinery that we build off was first introduced in Lee et al. (2016); Tibshirani et al. (2016), and further developed in various works, notably Fithian et al. (2014); Fithian et al. (2015); Tian and Taylor (2018), whose extensions we rely on in particular. The basic inference procedure we develop can be outlined as follows.

Given data $Y$ , apply a changepoint algorithm to detect some fixed number of changepoints $k$ . Denote the sorted estimated changepoint locations by

[TABLE]

and their respective changepoint directions (whether the estimated change in mean was positive or negative) by $\hat{d}_{1},\ldots,\hat{d}_{k}\in\{-1,1\}$ . For notational convenience, we set $\hat{c}_{0}=0$ and $\hat{c}_{k+1}=n$ . The specifics of the changepoint algorithms that we consider are given in \Frefsec:algorithms. 2. 2.

Form contrast vectors $v_{1},\ldots,v_{k}\in\mathbb{R}^{n}$ , defined so that for arbitrary $y\in\mathbb{R}^{n}$ ,

[TABLE]

the difference between the sample means of segments to right and left of $\hat{c}_{j}$ , for $j=1,\ldots,k$ . 3. 3.

For each $j=1,\ldots,k$ , we test the hypothesis $H_{0}:v_{j}^{T}\theta=0$ by rejecting for large values of a statistic $T(Y,v_{j})$ , which is computed based on knowledge of the changepoint algorithm that produced (2) in Step 1, and the desired contrast vector (3) formed in Step 2. Each statistic yields an exact p-value under the null (assuming Gaussian errors (1)). The details are given in Sections 2.2 and 3. 4. 4.

Optionally, we can use Bonferroni correction and multiply our p-values by $k$ , to account for multiplicity.

It is worth mentioning that several variants of this basic procedure are possible. For example, the number of changepoints $k$ in Step 1 need not be seen as fixed and may be itself estimated from data; the set of estimated changepoints (2) may be pruned after Step 1 to eliminate changepoints that lie too close to others, and alternative contrast vectors to (3) in Step 2 may be used to measure more localized mean changes; these are all briefly described in \Frefsec:practicalities. Though not covered in our paper, the p-values from our tests can be inverted to form confidence intervals for population contrasts $v_{j}^{T}\theta$ for $j=1,\ldots,k$ (Lee et al., 2016; Tibshirani et al., 2016).

At a more comprehensive level, our contributions in this work are to implement theoretically valid inference tools and practical guidance for each combination of the following choices that a typical user might face in a changepoint analysis: the algorithm (BS, WBS, CBS, or FL), number of estimated changepoints $k$ (fixed or data-driven), the null hypothesis model (saturated or selected model, to be explained in \Frefsec:post-selection), what type of conditioning (plain or marginalized, to be explained in \Frefsec:randomization), and the error variance $\sigma^{2}$ (known or unknown). In \Frefsec:practicalities, we summarize the tradeoffs underlying each of these choices.

Finally, as the primary application of our inference tools, we study comparative genomic hybridization (CGH) data, making particular suggestions geared towards this problem throughout the paper. We begin with a motivating CGH data example in the next subsection, and return to it at the end of the paper.

1.1 Motivating example: array CGH data analysis

We examine array CGH data from the 14th chromosome of cell line GM01750, one of the 15 datasets from Snijders et al. (2001); more background can be found in Lai et al. (2005) and references therein. Array CGH data are $\log_{2}$ ratios of dye intensities of diseased to healthy subjects’ measurements, mixed across many samples. Normal regions of the gene are thought to have an underlying mean $\log_{2}$ ratio of zero, and aberrations are regions of upward or downward departures from zero because the gene in that region has been mutated – duplicated or deleted. The presence and locations of aberrations are well studied in the biomedical literature to be associated with the presence of a wide range of genetically driven diseases – as many types of cancer, Alzheimer, and autism (Fanciulli et al., 2007; Sebat et al., 2007; Consortium et al., 2008; Stefansson et al., 2008; Walters et al., 2010; Bochukova et al., 2010). Accurate changepoint analysis of array CGH data is thus useful in studying association with diseases, and for medical diagnosis.

The data is plotted in the left panel of \Freffig:intro. Two locations $\hat{c}_{1}<\hat{c}_{2}$ , marked A and B respectively, were detected by running 2-step WBS. Ground truth in this data set can be defined via an external process called called karyotyping; this is done by Snijders et al. (2001) who finds only one true changepoint at location A. (To be precise, they do not report exact locations of abnormalities, but find a single start-to-middle deviation from zero level.)

Without access to any post-selection inference tools, we might treat locations A and B as fixed, and simply run t-tests for equality of means of neighboring data segments, to the left and right of each location. This is precisely testing the null hypothesis $H_{0}:v_{j}^{T}\theta=0$ , $j=1,2$ , where the contrast vectors are as defined in (3). P-values from the t-tests are reported in the first row of the table in \Freffig:intro: we see that location A has a p-value of $<10^{-5}$ , but location B also has a small p-value of $5\times 10^{-4}$ , which is troublesome. The problem is that location B was specifically selected by WBS because (loosely put) the sample means to left and right of B are well separated, thus a t-test a location B is bound to be optimistic.

Using the tools we describe shortly, we test $H_{0}:v_{j}^{T}\theta=0$ , $j=1,2$ in two ways: using a saturated model and a selected model on the mean vector $\theta$ . The satured model assumes nothing about $\theta$ , while the selected model assumes $\theta$ is constant between the intervals formed by $A$ and $B$ . Both tests yield a p-value $<10^{-5}$ at location A, but only a moderately small p-value at location B. If we were to use the Bonferroni correction at a nominal significance level $\alpha=0.05$ , then in neither case would we reject the null at location B.

1.2 Related work

In addition to the references on general post-selection inference methodology given previously, we highlight the recent work of Hyun et al. (2018), who study post-selection inference for the generalized lasso, a special case of which is the fused lasso. These authors already characterize the polyhedral form of fused lasso selection events, and study inference using contrasts as in (3). While writing the current paper, we became aware of the independent contributions of Umezu and Takeuchi (2017), who study multi-dimensional changepoint sequences, but focus problems in which the mean $\theta$ has only one changepoint. Aside from these papers, there is little focus on valid inference methods to apply post-detection in changepoint analysis. On the other hand, there is a huge literature on changepoint estimation, and inference for fixed hypotheses in changepoint problems; we refer to Jandhyala et al. (2013); Aue and Horvath (2013); Horvath and Rice (2014), which collectively summarize a good deal of the literature.

2 Preliminaries

2.1 Review: changepoint algorithms

Below we describe the changepoint algorithms that we will study in this paper. For the first three segmentation algorithms, we will focus on formulations that run the algorithm for a given number of steps $k$ ; these algorithms are typically described in the literature as being run until internally calculated statistics do not exceed a given threshold level $\tau$ . The reason that we choose the former formulation is twofold: first, we feel it is easier for a user to specify a priori a reasonable number of steps $k$ , versus a threshold level $\tau$ ; second, we can use the method in Hyun et al. (2018) to adaptively choose the number of steps $k$ and still perform valid inferences. In what follows, we use the notation $y_{a:b}=(y_{a},y_{a+1},\ldots,y_{b})$ and $\bar{y}_{a:b}=(b-a+1)^{-1}\sum_{i=a}^{b}y_{i}$ for a vector $y$ .

Binary segmentation (BS).

Given a data vector $y\in\mathbb{R}^{n}$ , the $k$ -step BS algorithm (Vostrikova, 1981) sequentially splits the data based on the cumulative sum (CUSUM) statistics, defined below. At a step $\ell=1,\ldots,k$ , let $\hat{b}_{1:(\ell-1)}$ be the changepoints estimated so far, and let $I_{j}$ , $j=1,\ldots,\ell-1$ be the partition of $\{1,\ldots,n\}$ induced by $\hat{b}_{1:(\ell-1)}$ . Intervals of length 1 are discarded. Let $s_{j}$ and $e_{j}$ be the start and end indices of $I_{j}$ . The next changepoint $\hat{b}_{\ell}$ and maximizing interval $\hat{j}_{\ell}$ are chosen to maximize the absolute CUSUM statistic:

[TABLE]

Additionally, the direction $\hat{d}_{\ell}$ of the new changepoint is calculated by the sign of the maximizing absolute CUSUM statistic, $\hat{d}_{\ell}=\mathrm{sign}(g_{(s_{j},b_{\ell},e_{j})}^{T}y)$ for $j=\hat{j}_{\ell+1}$ .

Wild binary segmentation (WBS).

The $k$ -step WBS algorithm (Fryzlewicz, 2014) is a modification of BS that calculates CUSUM statistics over randomly drawn segments of the data. Denote by $w=\{w_{1},\ldots,w_{B}\}=\{(s_{1},\ldots,e_{1}),\ldots,(s_{B},\ldots,e_{B})\}$ a set of $B$ uniformly randomly drawn intervals with $1\leq s_{i}<e_{i}\leq n$ , $i=1,\ldots,B$ . At a step $\ell=1,\ldots,k$ , let $J_{\ell}$ to be the index set of the intervals in $w$ which do not intersect with the changepoints $\hat{b}_{1:(\ell-1))}$ estimated so far. The next changepoint $\hat{b}_{\ell}$ and the maximizing interval $\hat{j}_{\ell}$ are obtained by:

[TABLE]

where $g_{(s,b,e)}^{T}y$ is as defined in (4). Similar to BS, the direction of the changepoint $\hat{d}_{\ell}$ is defined by the sign of the maximizing absolute CUSUM statistic.

Circular binary segmentation (CBS).

The $k$ -step CBS algorithm (Olshen et al., 2004) specializes in detecting pairs of changepoints that have alternating directions. At a step $\ell=1,\ldots,k$ , let $\hat{a}_{1:(\ell-1)}$ , $\hat{b}_{1:(\ell-1)}$ be the changepoints estimated so far (with the pair $a_{j}$ , $b_{j}$ estimated at step $j$ ), and let $I_{j}$ , $j=1,\ldots,2(\ell-1)+1$ be the associated partition of $\{1,\ldots,n\}$ . Intervals of length 2 are discarded. Let $s_{j}$ and $e_{j}$ denote the start and end index of $I_{j}$ . The next changepoint pair $\hat{a}_{\ell}$ and $\hat{b}_{\ell}$ , and the maximizing interval $\hat{j}_{\ell}$ , are found by:

[TABLE]

As before, the new changepoint direction $\hat{d}_{\ell}$ is defined based on the sign of the (modified) CUSUM statistic, $\hat{d}_{\ell}=\mathrm{sign}(g^{T}_{(s_{j},a_{\ell+1},b_{\ell+1},e_{j})}y)$ for $j=\hat{j}_{\ell+1}(y)$ .

Fused lasso.

The fused lasso (FL) estimator (Rudin et al., 1992; Tibshirani et al., 2005) is defined by solving the convex optimization problem:

[TABLE]

for a tuning parameter $\lambda\geq 0$ . The fused lasso can be seen as a $k$ -step algorithm by sweeping the tuning parameter from $\lambda=\infty$ down to $\lambda=0$ . Then, at given values of $\lambda$ (called knots), the FL estimator introduces an additional changepoint in the solution in (7) (Hoefling, 2010).

2.2 Review: post-selection inference

We briefly review post-selection inference as developed in Lee et al. (2016); Tibshirani et al. (2016); Fithian et al. (2014). For a more thorough and general treatment, we refer to these papers or to Hyun et al. (2018). Our description here will be cast towards changepoint problems. For clarity, we notationally distinguish between a random vector $Y$ distributed as in (1), and $y_{\mathrm{obs}}$ , a single data vector we observe for changepoint analysis. When a changepoint algorithm—such as BS, WBS, CBS, or FL—is applied to the data $y_{\mathrm{obs}}$ , it selects a particular changepoint model $M(y_{\mathrm{obs}})$ . The specific forms of such models are described in \Frefsec:polyhedra; for now, loosely, we may think of $M(y_{\mathrm{obs}})$ as the estimated changepoint locations and directions made by the algorithm on the data at hand. Post-selection inference revolves around the selective distribution, i.e., the law of

[TABLE]

under the null hypothesis $H_{0}:v^{T}\theta=0$ , for any $v$ that is a measurable function of $M(y_{\mathrm{obs}})$ . Here $q(Y)$ is a vector of sufficient statistic of nuisance parameters that need to be conditioned on in order to tractably compute inferences based on (8). The explicit form of $q(Y)$ differs based on the assumptions imposed on $\theta$ under the null model. Broadly, there are two classes of null models we may study: saturated and selected models (Fithian et al., 2014). Computationally, in either null models, it is important for the selection event $\{y:M(y)=M(y_{\mathrm{obs}})\}$ be polyhedral. This is described in detail in Section 3.1, where we show that this holds for BS, WBS, CBS, and FL.

Saturated model.

The saturated model assumes that $Y$ is distributed as in (1) with known error variance $\sigma^{2}$ , and assumes nothing about the mean vector $\theta$ . We set $q(Y)=\Pi_{v}^{\perp}Y$ , the projection of $Y$ onto the hyperplane orthogonal to $v$ . The selective distribution becomes the law of

[TABLE]

Selected model.

The selected model again assumes that $Y$ follows (1), but additionally assumes that the mean vector $\theta$ is piecewise constant with changepoints at the sorted estimated locations $\hat{c}_{1:k}=\hat{c}_{1:k}(y_{\mathrm{obs}})$ (assuming we have run our changepoint algorithm for $k$ steps). That is, we assume

[TABLE]

where for convenience we use $\hat{c}_{0}=0$ and $\hat{c}_{k+1}=n$ . Under this assumption, the law of $Y$ becomes a $(k+1)$ -parameter Gaussian distribution. Additionally, with the contrast vector $v_{j}$ defined as in (3), for any fixed $j=1,\ldots,k$ , the quantity $v_{j}^{T}\theta$ of interest is simply the difference between two of the parameters in this distribution. Assuming $\sigma^{2}$ is known, the sufficient statistics $q(Y)$ for the nuisance parameters in the Gaussian family are simply sample averages of the appropriate data segments, and the selective distribution becomes the law of

[TABLE]

Part of the strength of the selected model is that we can properly treat $\sigma^{2}$ as unknown; in this case, we must only additionally condition on the Euclidean norm of $y_{\mathrm{obs}}$ to cover this nuisance parameter, and the selective distribution becomes the law of

[TABLE]

3 Inference for changepoint algorithms

We describe our contributions that enable post-selection inference for changepoint analyses, beginning with the form of model selection events for common changepoint algorithms. We then describe computational details for saturated and selected model tests, and auxiliary randomization.

3.1 Polyhedral selection events

We show that, for each of the BS, WBS, and CBS algorithms, there is a parametrization for their models such that event $\{y:M(y)=M(y_{\mathrm{obs}})\}$ is a polyhedron—in fact a convex cone—of the form $\{y:\Gamma y\geq 0\}$ , for a matrix $\Gamma\in\mathbb{R}^{m\times n}$ that depends on $M(y_{\mathrm{obs}})$ (and we interpret the inequality $\Gamma y\geq 0$ componentwise). Throughout the description of the polyhedra for each algorithm, we display the number of rows in $\Gamma$ since it loosely denotes how “complex” each model selection event is. The same was already shown for FL in Hyun et al. (2018), and we omit details, but briefly comment on it below. Overall, the $\Gamma$ matrices for FL and BS are linear in $n$ , while it is quadratic in $n$ for CBS, and $O(Bkp)$ for WBS using intervals of length $p$ . This number can grow faster than linear in $n$ if $B\geq n$ , which is recommended in practice (Fryzlewicz, 2014).

Selection event for BS.

We define the model for the $k$ -step BS estimator as

[TABLE]

where $\hat{b}_{1:k}(y_{\mathrm{obs}})$ and $\hat{d}_{1:k}(y_{\mathrm{obs}})$ are the changepoint locations and directions when the algorithm is run on $y_{\mathrm{obs}}$ , as described in Section 2.1.

Proposition 1.

Given any fixed $k\geq 1$ and $b_{1:k},d_{1:k}$ , we can explicitly construct $\Gamma$ where

[TABLE]

and where $\Gamma$ has $2\sum_{\ell=1}^{k}(n-\ell-1)$ rows.

Proof.

When $k=1$ , $2(n-2)$ linear inequalities characterize the single changepoint model $\{b_{1},d_{1}\}$ :

[TABLE]

Now by induction, assume we have constructed a polyhedral representation of the selection event up through step $k-1$ . All that remains is to characterize the $k$ th estimated changepoint and direction $\{b_{k},d_{k}\}$ by inequalities that are linear in $y$ . This can be done with $2(n-k-1)$ inequalities. To see this, assume without a loss of generality that the maximizing interval is $j_{k}=k$ ; then $\{b_{k},d_{k}\}$ must satisfy the $2(|I_{k}|-2)$ inequalities

[TABLE]

For each interval $I_{\ell}$ , $\ell=1,\ldots,k-1$ , we also have $2(|I_{\ell}|-1)$ inequalities

[TABLE]

The last two displays together completely determine $\{b_{k},d_{k}\}$ , and as $\sum_{\ell=1}^{k}|I_{\ell}|=n$ , we get our desired total of $2(n-k-1)$ inequalities. ∎

Selection event for WBS.

We define the model of the $k$ -step WBS estimator as

[TABLE]

where $w$ is the set of $B$ intervals that the algorithm uses, $\hat{b}_{1:k}(y_{\mathrm{obs}})$ and $\hat{d}_{1:k}(y_{\mathrm{obs}})$ are the changepoint locations and directions, and $\hat{j}_{1:k}(y_{\mathrm{obs}})$ are the maximizing intervals.

Proposition 2.

Given any fixed $k\geq 1$ , and $\{w,b_{1:k},d_{1:k},j_{1:k}\}$ , we can explicitly construct $\Gamma$ where

[TABLE]

The number of rows in $\Gamma$ will vary depending on the configuration of $w$ and $b_{1:k}$ , but if each of the $B$ intervals in $w$ has length $p$ , it will be at most $2\sum_{\ell=1}^{k}((B-\ell)\cdot(p-1)+(p-2))$ .

The proof of \Frefprop:wbs-polyhedral-event is only slightly more complicated than that of \Frefprop:bs-polyhedral-event, and is deferred until Appendix A. Note that unlike BS, the maximizing intervals $\hat{j}_{1:k}$ are part of WBS’s model.

Selection event for CBS.

Finally, we define the model for the $k$ -step CBS estimator as

[TABLE]

where now $\hat{a}_{1:k}(y_{\mathrm{obs}})$ and $\hat{b}_{1:k}(y_{\mathrm{obs}})$ are the pairs of estimated changepoint locations, and $\hat{d}_{1:k}(y_{\mathrm{obs}})$ are the changepoint directions, as described in Section 2.1.

Proposition 3.

Given any fixed $k\geq 1$ and $\{a_{1:k},b_{1:k},d_{1:k}\}$ , we can explicitly construct $\Gamma$ where

[TABLE]

Let $I_{j}^{(\ell)}$ denote the $j$ th interval formed and $j_{\ell}$ be the selected interval defined in (5) for an intermediate step $\ell\in\{1,\ldots,k\}$ , and let $C(x,2)={x\choose 2}$ . Then $\Gamma$ has a number of rows equal to

[TABLE]

The proof of \Frefprop:cbs-polyhedral-event is only slightly more complicated than that of \Frefprop:bs-polyhedral-event, and is deferred until Appendix A.

Selection events for FL, and a brief comparison.

The model for the $k$ -step FL estimator is:

[TABLE]

where $\hat{b}_{1:k}(y)$ and $\hat{d}_{1:k}(y)$ are changepoint locations and directions, and $\smash{\hat{R}_{\ell}(y)\in\mathbb{R}^{n-\ell},\ell=1,\ldots,k}$ whose elements represent signs of a certain statistic $h_{i}(y)$ calculated at location $i$ in competition for maximization with $\hat{b}_{\ell}$ at step $\ell$ . These statistics $h_{i}(y)$ are weighted mean differences at location $i$ and are analogous to CUSUM statistics in BS. Hyun et al. (2018) make this representation more explicit, proving that for any fixed $k\geq 1$ and $b_{1:k},d_{1:k},R_{1:k}$ , we can explicitly construct $\Gamma$ such that

[TABLE]

where $\Gamma$ has the same number of rows as a $k$ -step BS event.

3.2 Computation of p-values

Given a precise description of the polyhedral selection event $\{y:M(y)=M(y_{\mathrm{obs}})\}$ , we can describe the methods to compute the p-value, i.e. the tail probability of the selective distributions described in \Frefsec:post-selection. Without loss of generality, all of our descriptions will be specialized to testing the null hypothesis of $H_{0}:v^{T}\theta=0$ against the one-sided alternative $H_{1}:v^{T}\theta>0$ . For saturated model tests, this exact calculation has been developed in previous work and we review it as it is relevant to our contributions on increasing its power. For selected model tests, an approximation was described in previous work, but we develop a new hit-and-run sampler that has not been implemented before.

Saturated model tests: exact formulae.

As shown in Lee et al. (2016) and Tibshirani et al. (2016), the saturated selective distribution (9) has a particularly computationally convenient distribution when $Y$ is Gaussian and the model selection event $\{y:M(y)=M(y_{\mathrm{obs}})\}$ is a polyhedral set in $y$ . In this case, the law of (9) is a truncated Gaussian (TG), whose truncation limits depend only on $\Pi_{v}^{\perp}y_{\mathrm{obs}}$ , and can be computed explicitly. Its tail probability can be computed in closed form (without Monte Carlo sampling). That is, the probability that $v^{T}Y\geq v^{T}y_{\mathrm{obs}}$ under the law of (9) is exactly equal to

[TABLE]

where $\Phi(\cdot)$ represents the standard Gaussian CDF, $\tau=\sigma^{2}\|v\|^{2}_{2}$ , $\rho=\Gamma v/\|v\|^{2}_{2}$ and

[TABLE]

This above equation is commonly referred as the TG statistic. Since this statistic is a pivot, it is the p-value used for the saturated model test.

Selected model tests: hit-and-run sampling.

To compute the p-value for selected model tests, Fithian et al. (2015) proposed a hit-and-run strategy for sampling from the distribution for the known $\sigma^{2}$ setting, (10). This was implemented by the authors, and we briefly review the details in Appendix B. For the unknown $\sigma^{2}$ setting, Fithian et al. (2014) suggested an importance sampling strategy for sampling the distribution (11). However, we find that an intuitive hit-and-run strategy can be adapted to the unknown $\sigma^{2}$ setting and implement this as a new algorithm.

Given a changepoint $j=1,\ldots,k$ , observe that we can design a segment test contrast $v$ where sampling from (11) is equivalent to sampling uniformly from the set

[TABLE]

Note that the above set no longer depends on $\theta$ or $\sigma^{2}$ . This is because we conditioned all the relevant sufficient statistics under the selected model. Our hit-and-run sampler then sequentially draws samples $v^{T}Y$ from the above set. For notational convenience, observe that the last $k$ constraints in (14) can be rewritten as $AY=Ay_{\text{(obs)}}$ for some matrix $A\in\mathbb{R}^{k\times n}$ . Our new hit-and-run algorithm is then shown in \Frefalg:hitandrun.

3.3 Randomization and marginalization

We apply the ideas of randomization in Tian and Taylor (2015) that improve the power of selective inference to changepoint algorithms and devise explicit samplers. We investigate two specific forms of randomization: randomization over additive noise and randomization over random intervals. We specialize the following descriptions to saturated models. We note that similar randomization of selected model inferences is also possible but is doubly computationally burdensome.

Marginalization over additive noise.

Tian and Taylor (2015) shows that performing inference based on the selected model $M(y_{\mathrm{obs}}+w_{\mathrm{obs}})$ where $w_{\mathrm{obs}}$ is additive noise and then marginalizing over $W$ leads to improved power. Here, $w_{\mathrm{obs}}$ is a realization of a random component $W$ sampled from $\mathcal{N}(0,\sigma_{\text{add}}^{2}I_{n})$ , where $\sigma_{\text{add}}^{2}>0$ is set by the user. Fithian et al. (2014) provides a mathematical basis for pursuing such randomization, stating that less conditioning results in an increase in Fisher information. For additive noise, the above model selection event is:

[TABLE]

This means the new polyhedron formed by the model selection event based on perturbed data $y_{\mathrm{obs}}+w_{\mathrm{obs}}$ is slightly shifted.

Porting the ideas of Tian and Taylor (2015) to our setting, to test the one-sided null hypothesis $H_{0}:v^{T}\theta=0$ , we want to compute the following tail probability of the marginalized selective distribution,

[TABLE]

It is hard to directly compute this. However, the formulas in (12) and (13) give us exact formulas to compute the non-marginalized tail-probabilities,

[TABLE]

The following proposition shows that we can compute $T(y_{\mathrm{obs}},v)$ by reweighting instances of $T(y_{\mathrm{obs}},v,w_{\mathrm{obs}})$ via importance sampling. Here, let $E_{1}=\mathds{1}[M(Y+W)=M(y_{\mathrm{obs}}+W)]$ and $E_{2}=\mathds{1}[\Pi_{v}^{\perp}Y=\Pi_{v}^{\perp}y_{\text{obs}}]$ .

Proposition 4.

Let $\Omega$ denote the support of the random component $W$ . If the distribution of $W$ is independent of the random event $E_{2}$ , (15) can be exactly computed as

[TABLE]

where the weighting factor is $a(w_{\mathrm{obs}})=\mathbb{P}(W=w_{\mathrm{obs}}|E_{1},E_{2})/\mathbb{P}(W=w_{\mathrm{obs}})$ .

The first equality in (16) demonstrates the reweighting of $T(y_{\mathrm{obs}},v,w_{\mathrm{obs}})$ , but the second equality gives a sampling strategy where we approximate the integrals. \Frefalg:additive-importance-sampler describes this, where for one realization $w_{\mathrm{obs}}$ , we let $k(w_{\mathrm{obs}})$ and $g(w_{\mathrm{obs}})$ denote the integrand of the last term’s numerator and denominator in (16) respectively.

Marginalization over WBS intervals.

In contrast to the above setting where $W$ represents Gaussian noise, in wild binary segmentation described in \Frefsec:algorithms, $W$ represents the set of $B$ randomly drawn intervals. Observe that \Frefprop:additive_noise still applies to this setting, where $M(y_{\mathrm{obs}}+w_{\mathrm{obs}})$ is now replaced with $M(y_{\mathrm{obs}},w_{\mathrm{obs}})$ , as described in \Frefsec:polyhedra. However, one additional complication is that the maximizing intervals $\hat{j}_{1:k}$ in the model $M(y_{\mathrm{obs}},w_{\mathrm{obs}})$ are embedded in the construction of the matrix $\Gamma$ representing the polyhedra. This prevents a naive resampling of all $B$ intervals.

We describe how to overcome this complication. Let $\{W_{\hat{j}_{1}},\ldots,W_{\hat{j}_{k}}\}$ be the maximizing intervals. We resample all other intervals, $W_{\ell}$ for $\ell\in\{1,\ldots,B\}\backslash\{\hat{j}_{1},\ldots,\hat{j}_{k}\}$ . Specifically, for each of such intervals $W_{\ell}=(s_{\ell},\ldots,e_{\ell})$ , $s_{\ell}$ and $e_{\ell}$ are sampled uniformly between $1$ to $n$ where $s_{\ell}<e_{\ell}$ . After all $B-k$ intervals are resampled, a check is performed to ensure that $\{W_{\hat{j}_{1}},\ldots,W_{\hat{j}_{k}}\}$ are still the maximizing intervals when WBS is applied again to $y_{\mathrm{obs}}$ . The full algorithm is in \Frefalg:wbs-importance-sampler.

4 Practicalities and extensions

The above sections formalize the mechanisms to perform selective inference with respect to the basic procedure highlighted in \Frefsec:introduction. We now briefly summarize the all the combination of choices that the user faces based on the methods developed in the above sections and their practical impact.

4.1 Practical considerations

There are some practical choices that the user needs to make when implementing the procedure. Here, we outline a few, each related with a key element of the broader inference procedure.

•

Algorithm (BS, WBS, CBS and FL): It is useful for the user to be able to compare algorithms. CBS is specialized for pairs of changepoints, and WBS specializes in localized changepoint detection compared to BS. FL and BS have similar mechansims which sequentially admit changepoints by maximizing a statistic. However, BS has a simpler mechanism and a less complex selection event, potentially giving higher post-selection conditional power.

•

Conditioning (Plain or marginalized): Marginalizing over a source of randomness yields tests with higher power than plain inference, but at two costs: increased computational burden due to MCMC sampling being required, and worsened detection ability when using additive noise marginalization. Also, the marginalized p-values are subject to the sampling randomness, and the number of trials $T$ needed to reduce the p-values’ intrinsic variability scales with $\sigma^{2}_{\text{add}}$ .

•

Number of estimated changepoints $k$ (Fixed or data-driven): As currently described in \Frefsec:algorithms, the changepoint algorithms discussed in our paper require the user to pre-specify the number of estimated changepoints $k$ . However, we can adopt local stopping rules from Hyun et al. (2018) to adaptively choose $k$ . This variation increases the complexity of the polyhedra compared to those in \Frefsec:polyhedra, leading to lower statistical power than its fixed- $k$ counterpart. This is shown in Appendix D.

•

Assumed null model (Saturated or selected): As mentioned in \Frefsec:post-selection, selected model tests are valid under a stricter set of assumptions but often yield higher power. Computationally, saturated model tests are often simpler to perform than selected model tests due to the closed form expression of the tail probability.

•

Error variance $\sigma^{2}$ (Known or unknown): Saturated model tests require $\sigma^{2}$ to be known. In practice, we need to estimate it in-sample from a reasonable changepoint mean fitted to the same data, or estimated out-of-sample on left-out data. Selected model tests have the advantage of not requiring knowledge of $\sigma^{2}$ .

4.2 Extensions

As mentioned in Hyun et al. (2018), there are many practically-motivated extensions to the baseline procedure mentioned in \Frefsec:introduction to either improve power or interpretability. We highlight these below. All of these extensions will still give proper Type-I error control under the appropriate null hypotheses.

•

Designing linear contrasts: The user can make many types of contrast vectors $v$ to fit their analysis, in addition to the segment test contrasts (3), as long as it measurable with respect to $M(y_{\mathrm{obs}})$ . One example is the spike test from (Hyun et al., 2018) of single location mean changes. For CNV analysis, it could be useful to test regions between an adjacent pair of changepoints away from the immediately surrounding regions. Also, a step-sign plot (a plot that shows the locations and direction of the changepoints, but not their magnitude) can help the user design contrasts (Hyun et al., 2018).

•

Post-processing the estimated changepoints: Multiple detected changepoints too close to one another can hurt the power of segment tests. Post-processing the estimated changepoints based on decluttering (Hyun et al., 2018) or filtering (Lin et al., 2017) so the new set of changepoints are well-separated can lead to contrasts that yield higher power. We show empirical evidence of this improving power of the fused lasso, in Appendix C.1.

•

Pre-cutting: We can also modify all the algorithms in \Frefsec:algorithms to start with an initial existing set of changepoints. This is useful in CGH analyses, when it is not meaningful to consider segments that start in one chromosome and end in another. By pooling information in this manner from separate chromosomal regions, the pre-cut analysis is an improvement over conducting separate analyses in individual chromosomes.

5 Simulations

5.1 Gaussian simulations

In this section, we show simulation examples to demonstrate properties of the segmentation post-selection inference tools presented in the current paper. The mean $\theta$ consists of two alternating-direction changepoints of size $\delta$ in the middle as in (17), chosen to be a realistic example of mutation phenomena as observed in array CGH datasets (Snijders et al., 2001). We vary the signal size $\delta\in(0,4)$ , while generating Gaussian data from a fixed noise level $\sigma^{2}=1$ .

This is the duplication mutation scenario. The sample size $n=200$ is chosen to be in the scale of the chromosomal data. An example of this synthetic dataset can be seen in Figure 2.

[TABLE]

Methodology.

In the following simulations, we consider the following four estimators (BS, WBS, CBS and FL) each run for two steps. From each, we perform both saturated and selected model tests. For the latter, we only include the results of BS and FL for simplicity, for both settings of known and unknown noise parameter $\sigma^{2}$ . We use the basis procedure outlined in \Frefsec:introduction with a significance level of $\alpha=0.05$ . We verify the Type-I error control of our methods next. Throughout the entire simulation suite to come, the standard deviation in each of the power curves and detection probabilities is less than 0.02. For each method, for each signal-to-noise size $\delta$ , we run more than 250 trials.

Type-I error control verification.

We examine all our statistical inferences under the global null where $\theta=0$ to demonstrate their validity – uniformity of null p-values, or type I error control. Specifically, any simulations from the no-signal regime $\delta=0$ from the middle mutation (17) can be used. When there is no signal, the null scenario $v^{T}\theta=0$ is always true so we expect all p-value to be uniformly distributed between 0 and 1. We verify this expected behavior in \Freffig:null-dist. We notice that the methods that require MCMC (marginalized saturated and selected model tests) requires more trials to converge towards the uniform distribution compared to their counterparts that have exact calculations.

Calculating power.

Since the tests are performed only when a changepoint is selected, it is necessary to separate the detection ability of the estimator from power of the test. To that end, we define the following quantities,

[TABLE]

The overall power of an inference tool can only be assessed by examining the conditional and unconditional power together. We consider a detection to be correct if it is within $\pm 2$ of the true changepoint locations.

Power comparison across signal sizes $\delta$ .

For saturated model tests, we perform additive-noise inferences using Gaussian $\mathcal{N}(0,\sigma_{\text{add}}^{2})$ with $\sigma_{\text{add}}=0.2$ for BS, FL, and CBS. For WBS, we employ the randomization scheme as described in \Frefsec:randomization with $B=n$ . With the metrics in (19)-(20), we examine the performance of the four methods. The solid lines in \Freffig:power-comparison show the “plain” method where model selection based on $M(y_{\mathrm{obs}})$ . The dotted lines show the marginalized counterparts where the model selection is $M(y_{\mathrm{obs}},W)$ , margnialized over $W$ .

WBS and CBS have higher conditional and unconditional power than BS. This is as expected since the former two are more adept for localized change-points of alternating directions. FL noticeably under-performs in power compared to segmentation methods. This is partially caused by FL’s detection behavior, and can be explained by examining alternative measures of detection and improved with post-processing. This investigation is deferred to Appendix C.1. The marginalized versions of each algorithm have noticeably improved power, but almost unnoticeably worse detection than their non-randomized, plain versions (middle panel of \Freffig:power-comparison) . Combined, in terms of unconditional power, marginalized inferences clearly dominate their plain counterparts.

Selected model inference simulations are shown in \Freffig:power-comparison-selected. Surprisingly, there is an almost inconceivable drop in power from unknown $\sigma^{2}$ to known $\sigma^{2}$ . Compared to the saturated model tests in \Freffig:power-comparison, there is smaller power gap between FL and BS. Also, selected model tests appear to have higher power than saturated model tests. In general however, it is hard to compare the power of saturated and selected models due to the clear difference in model assumptions.

Comparison with sample-splitting.

Sample splitting is another valid inference technique. After splitting the dataset in half based on even and odd indices, we run a changepoint algorithm on one dataset and conduct classical one-sided t-test on the other. This is the most comparable test, as it does not assume $\sigma^{2}$ is known and conducts a one-sided test of the null $H_{0}:v^{T}\theta=0$ . Instead of $\pm 2$ slack used for calculating detection in selective inference detection (dotted and dashed lines), $\pm 1$ was used for sample splitting inference (solid line). The loss in detection accuracy in the middle panel of \Freffig:samplesplit shows the downside of halving data size for detection. Unconditional power for marginalized saturated model tests and selected model tests are noticeably higher than the other two.

5.2 Pseudo-real simulation with heavy tails

We present pseudo-real datasets based on a single chromosome – chromosome 9 in GM01750 – in order to investigate how heavy-tailed distributions affect our inferences. We only present saturated model tests for brevity. From the original data, we estimate a 1-changepoint mean $\theta$ , shown in the bold red line in Figure 7, and residuals $r$ , both based on a fitted 1-step wild binary segmentation model. The QQ plot shows that these residuals have heavier tails than a Gaussian (top middle panel of \Freffig:pseudoreal), and are close in distribution to a Laplacian. This motivates us to generate synthetic data $y=\theta+\epsilon$ by adding noise $\epsilon$ in three ways:

Gaussian noise $\epsilon\sim\mathcal{N}(0,\sigma^{2}I)$ (black), 2. 2.

Laplace noise $\epsilon\sim\operatorname{Laplace}(0,\sigma/\sqrt{2})$ (green), and 3. 3.

Bootstrapped residuals, $\epsilon=b(r)$ , where $b(\cdot)$ samples the residuals with replacement (red).

We then investigate the behavior of saturated model tests after a 3-step binary segmentation across all three types of noises when the null hypothesis $H_{0}:v^{T}\theta=0$ is true. To set $\sigma^{2}$ for these saturated model tests, we compute the empirical variance after fitting a pre-cut 10-step wild binary segmentation across the entire cell line. The results are shown in Figure 7. Exactly valid null p-values would follow the theoretical $U(0,1)$ distribution, optimistic (superuniform) p-values would lie below the diagonal, and conservative (subuniform) p-values would lie above the diagonal. We see that the inferences are exactly valid with Gaussian noise but is optimistic with both Laplacian noise and bootstrapped residuals (panel B of \Freffig:pseudoreal).

To overcome this optimism, we modify the bootstrap substitution method (Tibshirani et al., 2018). Let $\beta$ denote $\bar{\theta}$ , the grand mean of $\theta$ . Originally, the authors’ main idea is to approximate the law of $v^{T}Y$ used to construct the TG statistic (12) with the bootstrapped distribution of $v^{T}(Y-\beta)$ by bootstrapping the residuals, $y-\bar{y}$ . Here, the empirical grand mean $\bar{y}$ represents the simplest model with no changepoints. While this estimate will usually restore validity, it is expected to produce overly conservative p-values if there exist any changepoints (panel C of \Freffig:pseudoreal).

Hence, we instead consider the bootstrapped distribution of $v^{T}(Y-\theta)$ , by bootstrapping the residuals, $y-\hat{\theta}$ , where $\hat{\theta}$ is a piecewise constant estimate of $\theta$ . For our instance, we use a $k$ -step binary segmentation model to estimate $\hat{\theta}$ , where we choose $k$ using two-fold cross validation from a two-fold split of the data $y$ into odd and even indices. This procedure is not valid in general and should be used with caution. In order to combat the main risk of over-fitting of $\hat{\theta}$ , we may further modify this procedure by excluding shorter segments in $\hat{\theta}$ prior to bootstrapping. For our dataset, these potential downsides do not seem to come to fruition in practice. At the sample size $n\simeq 100$ and signal-to-noise ratio of our current dataset, the resulting p-values in both heavy-tailed and Gaussian data are convincingly uniform (panel D of \Freffig:pseudoreal).

6 Copy Number Variation (CNV) data application

Array CGH analyses detect changes in expression levels (measured as a log ratio in fluorescence intensity between test and reference samples) across the genome. Aberrations found are linked with the presence of a wide range of genetically driven diseases – as many types of cancer, Alzheimer’s disease, and autism, see, eg. Consortium et al. (2008); Bochukova et al. (2010).

The datasets we study in this paper are originally from Snijders et al. (2001), and have been studied by numerous works in the statistics literature, e.g. Hao et al. (2013); Lai et al. (2008). In each dataset consist of individual cell lines with $2,000$ measurements or more across 23 chromosomes. Our analysis focuses on middle-to-middle duplication, the setting that was studied in \Frefsec:simulation.

In our analysis, we use a 4-step wild binary segmentation and perform marginalized saturated model tests on two cell lines GM01524 and GM01750 in \Freffig:analysis. Recall that the 14th chromosome of the latter cell line was shown in \Freffig:intro. As decribed in \Frefsec:practicalities, we pre-cut both analyses at chromosome boundaries since the ordering of 1 through 23 is essentially arbitrary. In GM01524, we can see that the our choice of methods – segment test inferences on changepoints recovered from pre-cut wild binary segmentation, after decluttering – deems two changepoint locations A and B of alternating directions in chromosome 6 to be significant, and two other locations to be spurious, at the signifance level $\alpha=0.05$ after Bonferroni correction. This result is consistent with karyotyping results of a single middle-to-middle duplication. Likewise, in GM01750, the wild binary segmentation inference correctly identified the two start-to-middle duplications in chromosomes 9 and 14 which were confirmed with karyotyping, and correctly invalidated the rest.

7 Conclusions

We have described an approach to conduct post-selection inference on changepoints detected by common segmentation algorithms, using the same data for detection and testing. Through simulations, we demonstrated the detection probability and power over signal-to-noise ratios in a variety of settings, as well as our tools’ robustness to heavy-tailed data. Finally, we demonstrated the application in array CGH data, where we show that our methods effectively provide a statistical filter that retains the changepoints that validated by karyotyping and discards the rest.

Future work in this area could improve the practical applicability of these methods. One useful extension would be to incorporate more complex and realistic noise models. For example, the selected model testing framework can be extended to include other exponential family models. The methodology for inference after changepoint detection may also be extended to multiple streams of copy number variation data in order to make more powerful inferences about changepoint locations. These and other methodological extensions can be useful for newer types of copy number variation data from recent technology, such as next-generation sequencing.

8 Code and supplemental material

The code to perform estimation as well as saturated model tests are in https://github.com/robohyun66/binseginf, while the code to perform selected model tests are additionally in https://github.com/linnylin92/selectiveModel.

The following is a brief summary of the supplements. Appendix A contains the proofs omitted from the main text. Appendix B contains the algorithmic details for the selected model test sampler in the known $\sigma^{2}$ setting. Appendix C contains numerous additional simulations results and details. Appendix D contains a description of the procedure to choose $k$ adaptively and its corresponding simulation results. Appendix E contains additional results on our array CGH application.

9 Acknowledgment

The authors used Pittsburgh Supercomputing Center resources (Proposal/Grant Number: DMS180016P). Sangwon Hyun was supported by supported by NSF grants DMS-1554123 and DMS-1613202. Max G’Sell was supported by NSF grant DMS-1613202. Ryan Tibshirani was supported by NSF grant DMS-1554123.

Appendix A Additional proofs

A.1 Proof of \Frefprop:wbs-polyhedral-event, (WBS)

Proof.

The construction of $\Gamma$ is basically the same as that for BS in \Frefprop:bs-polyhedral-event; the only difference is that, at step $k$ , the inequalities defining the new rows of $\Gamma$ are based on the intervals $w_{j_{k}}$ and $w_{\ell}$ , $\ell\in J_{k}\backslash\{j_{k}\}$ , instead of $I_{j_{k}}$ and $I_{\ell}$ , $\ell\neq j_{k}$ , respectively. To compute the upper bound on the number of rows $m$ , observe that in step $\ell\in\{1,\ldots,k\}$ , there are at most $B-\ell+1$ intervals remaining. Among these, the interval $j_{k}$ contributes $p-2$ inequalities, and the remaining $B-\ell$ intervals contributes $p-1$ inequalities. ∎

A.2 Proof of \Frefprop:cbs-polyhedral-event, (CBS)

Proof.

The proof follows similarly to the proof of \Frefprop:bs-polyhedral-event. Observe that for any $k^{\prime}<k$ , the model $M^{\mathrm{CBS}}_{1:k^{\prime}}(y_{\mathrm{obs}})$ is strictly contained in the model $M^{\mathrm{CBS}}_{1:k}(y_{\mathrm{obs}})$ . Hence, we can proceed using induction, and let $b_{i}$ for $i\in\{1,\ldots,k\}$ denote $\hat{b}_{i}$ for simplicity, and do the same for $a_{i}$ , $d_{i}$ and $j_{i}$ . Let $C(x,2)={x\choose 2}$ for simplicity as well.

For $k=1$ , the following $2\cdot(C(n-1,2)-1)$ inequalities characterize the selection of the changepoint model $\{a_{1},b_{1},d_{1}\}$ ,

[TABLE]

for all $r,t\in\{1,\ldots,n-1\}$ where $r<t$ , $r\neq a_{1}$ and $t\neq b_{1}$ .

By induction, assume we have constructed the polyhedra for the model, $M^{\mathrm{CBS}}_{1:(k-1)}(y_{\mathrm{obs}})=\{a_{1:(k-1)},b_{1:(k-1)},d_{1:(k-1)}\}$ . To construct $M^{\mathrm{CBS}}_{1:k}(y_{\mathrm{obs}})$ , all that remains is to characterize the $k$ th parameters $\{a_{k},b_{k},d_{k}\}$ . To do this, assume that $j_{k}$ corresponds with the interval $I_{k}$ having the form $\{s_{k},\ldots,e_{k}\}$ . Within this interval, we form the first $2\cdot(C(|I_{j_{k}}|-1,2)-1)$ inequalities of the form,

[TABLE]

for all $r,t\in\{s_{k},\ldots,e_{k}-1\}$ where $r<t$ and $r\neq a_{k}$ and $t\neq b_{k}$ . The remaining inequalities originate from the remaining intervals. For each interval $I_{\ell}$ , for $\ell\in\{1,\ldots,2k-1\}\backslash\{j_{k}\}$ , let $I_{\ell}$ have the form $\{s_{\ell},\ldots,e_{\ell}\}$ . We form the next $2\cdot C(|I_{\ell}|-1,2)$ inequalities of the form

[TABLE]

for all $r,t\in\{s_{\ell},\ldots,e_{\ell}-1\}$ where $r<t$ . ∎

A.3 Proof of \Frefprop:additive_noise, (Marginalization)

Proof.

For concreteness, we write the proof where $W$ represents additive noise, but the proof generalizes to the setting where $W$ represents random intervals easily. First write $T(y_{\mathrm{obs}},v)$ as an integral over the joint density of $W$ and $Y$ ,

[TABLE]

Then the joint density $f_{W,Y|E_{1},E_{2}}(w,y)$ partitions into two components, whose latter component (a probability mass function) can be rewritten using Bayes rule. For convenience, denote $g(w)=\mathbb{P}(E_{1}|W=w,E_{2})$ .

[TABLE]

where we used the independence between $W$ and $E_{2}$ in the last equality. With this, $T(y_{\mathrm{obs}},v)$ from (21) becomes:

[TABLE]

Now, rearranging, we get:

[TABLE]

This proves the first equality in \Frefprop:additive_noise. To show what the weighting factor $a(w)$ equals, observe that by applying Bayes rule to the numerator of $a(w_{\mathrm{obs}})$ , and rearranging:

[TABLE]

Finally, to show the seocnd equality in \Frefprop:additive_noise, observe that we can also represent $a(w)$ as

[TABLE]

by definition, where the denominator is the expectation taken with respect to the random variable $W$ . Leveraging the geometric theorems of Lee et al. (2016); Tibshirani et al. (2016), it can be shown that

[TABLE]

Also from the same references as well as stated in \Frefsec:randomization, we know that

[TABLE]

Putting (23), (24) and (25) together into (22), we complete the proof by obtaining

[TABLE]

∎

Appendix B Selected model tests, hit-and-run sampling for known

$\sigma^{2}$

The following is the hit-and-run sampler to estimate the tail probability of the law of (9). This is for the known $\sigma^{2}$ setting, which differs from the setting described in the main text in \Frefsec:computation. This was briefly described in Fithian et al. (2015) but the authors have later implemented it in ways not originally described in the above work to make it more efficient. We do not claim novelty for the following algorithm, but simply state it for completion. The original code can be found the repository https://github.com/selective-inference, and we reimplemented it to suite our coding framework and simulation setup.

We specialize our description to test the null hypothesis $H_{0}:v^{T}\theta=0$ against the one-sided alternative $H_{1}:v^{T}\theta>0$ . There are some notation to clarify prior to describing the algorithm. Let $v\in\mathbb{R}^{n}$ denote the vector such that

[TABLE]

As in \Frefsec:computation, let $A\in\mathbb{R}^{k\times n}$ denote the matrix such that the last $k$ equations in the above display are satisfied if and only if $AY=Ay_{\mathrm{obs}}$ . Based on \Frefsec:polyhedra, observe that our goal reduces to sampling from the $n$ -dimensional distribution

[TABLE]

where $I_{n}$ is the $n\times n$ identity matrix.

The first stage of the algorithm removes the nullspace of $A$ in the following sense. Construct any matrix $B\in\mathbb{R}^{n\times n}$ such that it has full rank and the last $k$ rows are equal to $A$ . Then, consider the following $n$ -dimensional distribution.

[TABLE]

Note that $B^{-1}Y^{\prime}$ has the same law as (26). Observe that the above distribution is a conditional Gaussian, meaning we can remove the last conditioning event. Towards that end, let $\Gamma^{\prime\prime}$ denote the first $n-k$ columns of the matrix $\Gamma B^{-1}$ , and let $u^{\prime\prime}$ denote the last $k$ columns of $\Gamma B^{-1}$ left-multiplying $Ay_{\mathrm{obs}}$ . Also, consider the following partitioning of the matrix $B^{T}B$ ,

[TABLE]

where $B_{11}$ is a $(n-k)\times(n-k)$ submatrix, $B_{12}$ is a $(n-k)\times k$ submatrix, and $B_{22}$ is a $k\times k$ submatrix. Then, consider the following $n-k$ -dimensional distribution.

[TABLE]

Note that $Y^{\prime\prime}$ has the same law as the first $n-k$ coordinates of (27).

The next stage of the algorithm whitens the above distribution so its covariance is the identity. Let $\mu^{\prime\prime}$ and $\Sigma^{\prime\prime}$ denote the mean and variance of the unconditional form of the above distribution (28). Let $\Theta$ be the matrix such that $\Theta\Sigma^{\prime\prime}\Theta^{T}=I_{n}$ . This must exist since $\Sigma^{\prime\prime}$ is positive definite. Consider the following $n-k$ dimensional distribution,

[TABLE]

Note that $\Theta^{-1}Z+\mu^{\prime\prime}$ has the same law as (28). Hence, we have constructed linear mapping $F$ and $G$ between (26) and (29) such that $F(Y)\overset{d}{=}Z$ , and $G(Z)\overset{d}{=}Y$ .

In order to set up a hit-and-run sampler, generate $p$ unit vectors $g_{1},\ldots,g_{p}$ . (The choice of $p$ is arbitrary, and the specific method of generating these $p$ vectors is also arbitrary.) Our hit-and-run sampler with move in the linear directions dictated by $g_{1},\ldots,g_{p}$ . We are now ready to describe the hit-and-run sampler in \Frefalg:hitandrun_knownsigma, which leverages many of the same calculations in (12) and (13). The similarity arises since $\Pi_{g_{i}}^{\perp}Z=\Pi_{g_{i}}^{\perp}(Z+g_{i})$ by definition of projection.

The computational efficiency of the above algorithm comes from the fact that little multiplication needs to be done with the polyhedron matrix $\Gamma^{\prime\prime}\Theta^{-1}$ , a potentially huge matrix. $U$ and $\rho_{1},\ldots,\rho_{p}$ , each vectors of the same length, carry all the information needed about polyhedron throughout the entire procedure of generating $M$ samples.

Appendix C Additional simulation results

C.1 Power comparison using unique detection

Fused lasso was appeared to have a large drop in power compared to segmentation algorithms. In addition to these three measures shown in \Frefsec:simulation, for multiple changepoint problems like middle mutations it is useful to measure performance using an alternative measure of detection called unique detection. This is useful because some algorithms – mainly fused lasso, but to also binary segmentation to some extent, primarily in later steps – admit “clumps” of nearby points. If this clumped detection pattern occurs in early steps, the algorithm requires more steps than others to fully admit the correct changepoints. In this case, detection alone is not an adequate metric, and unique detection can be used in place.

[TABLE]

In plain words, unique detection is measuring how many of the true changepoint locations have been approximately recovered.

We present a simple case study. In addition to a 2-step fused lasso, imagine using a 3-step fused lasso, but with post-processing. For post-processing, declutter by centroid clustering with maximum distance of 2, and test the $k_{0}<3$ changepoints, pitting the resulting segment test p-values against $0.05/k_{0}$ . A 2-step fused lasso’s detection does not reach 1 even at high signals ( $\delta=4$ ) because of the aforementioned clumped detection behavior. The resulting segment tests are also not powerful, since the segment test contrast vectors consist of left and right segments which do not closely resemble true underlying piecewise constant segments in the data. However, when detection is replaced with unique detection, two things are noticeable. First, decluttered lasso’s detection performance is noticeably improved when going from 2 to 3 steps. Also, when unconditional power is calculated using unique detection, binary segmentation does not have as large of an advantage over the the several variants of fused lasso. This is shown in \Freffig:unique-power-comparison. We see from the right figure (compared to the left) that the a “decluttered” version of 2- or 3-step fused lasso has much closer unconditional power to binary segmentation.

C.2 Power comparison with different mean shape

The synthetic mean discussed here consists of a single upward changepoint piece-wise constant mean, as shown in (31) and \Freffig:power-comparison-data-edge. This is chosen to be another realistic example of the mutation phenomenon as observed in array CGH datasets from Snijders et al. (2001), in addition to the case shown in the main text. We focus on the duplication mutation scenario, but the results apply similarly to deletions. As before, the sample size $n=200$ was chosen to be in the scale of the data length in a typical array CGH dataset in a single chromosome. An example of this synthetic dataset can be seen in Figure 2. For saturated model tests, WBS no longer outperforms binary segmentation in power. This is expected since there is only a single changepoint not accompanied by opposing-direction changepoints.

[TABLE]

C.3 Sample splitting (continued)

The results in \Freffig:samplesplit were based on approximate detection where, for methods used on the entire dataset of length $n$ , we defined a detection event as estimating $\pm 2$ of the true changepoint locations. For sample splitting, this was defined as estimate $\pm 1$ of the true changepoint location based on half the dataset. This choice of approximate detection is somewhat arbitrary, and it is informative to see if the results would change if we considered only exact detection. We can see from \Freffig:samplesplit-exact that randomized TG p-values have comparable power with sample splitting inferences, among tests that are regarding exactly the right changepoints.

Appendix D Model size selection using information criteria

Throughout the paper we assume that the number of algorithm steps $k$ is fixed. Hyun et al. (2018) introduces a stopping rule based on information criteria (IC) which can be characterized as a polyhedral selection event. The IC for the sequence of models $M_{1:\ell},\ell=1,\ldots,n-1$ is

[TABLE]

We omit the dependency on $y$ when obvious. We use the BIC complexity penalty $p(M_{k})=\sigma^{2}\cdot k\cdot\log(n)$ for this paper. Also define $S_{\ell}(y)=\mathrm{sign}\left(J(M_{1:\ell})-J(M_{1:(\ell-1)})\right)$ to be the sign of the difference in IC between step $\ell-1$ and $\ell$ . This is a $+1$ for a rise and $-1$ for a decline. A data-dependent stopping rule $\hat{k}$ is defined as

[TABLE]

which is a local minimization of IC, defined as the first time $q$ consecutive rises occur. As discussed in Hyun et al. (2018), $q=2$ is a reasonable choice for the changepoint detection. To carry out valid selective inference, we condition on the selection event $\mathds{1}[S_{1:(k+q)}(y)=S_{1:(k+q)}(y_{\mathrm{obs}})]$ , which is enough to determine $\hat{k}$ . A $k$ -step model for $k$ chosen by (33) can be understood to be $M_{1:\hat{k}}(Y)=M_{1:k}(y_{\mathrm{obs}})$ . The corresponding selection event $P_{M_{1:\hat{k}}}$ is with the additional halfspaces, as outlined in Hyun et al. (2018). Simulations in Figure 13 show that introducing IC stopping is valid, by controlled type-I error, but comes at the cost of considerable power loss.

Bibliography31

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aue and Horvath (2013) Aue, A. and Horvath, L. (2013). Structural breaks in time series. Journal of Time Series Analysis 34, 1–16.
2Bochukova et al. (2010) Bochukova, E. G., Huang, N., Keogh, J., Henning, E., Purmann, C., Blaszczyk, K., Saeed, S., Hamilton-Shield, J., Clayton-Smith, J., O’Rahilly, S., et al. (2010). Large, rare chromosomal deletions associated with severe early-onset obesity. Nature 463, 666.
3Boysen et al. (2009) Boysen, L., Kempe, A., Liebscher, V., Munk, A., and Wittich, O. (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. Annals of Statistics 37, 157–183.
4Consortium et al. (2008) Consortium, I. S. et al. (2008). Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237.
5Fanciulli et al. (2007) Fanciulli, M., Norsworthy, P. J., Petretto, E., Dong, R., Harper, L., Kamesh, L., Heward, J. M., Gough, S. C., De Smith, A., Blakemore, A. I., et al. (2007). Fcgr 3b copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity. Nature genetics 39, 721.
6Fithian et al. (2014) Fithian, W., Sun, D., and Taylor, J. (2014). Optimal inference after model selection. ar Xiv: 1410.2597.
7Fithian et al. (2015) Fithian, W., Taylor, J., Tibshirani, R., and Tibshirani, R. J. (2015). Selective sequential model selection. ar Xiv: 1512.02565.
8Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Annals of Statistics 42, 2243–2281.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Post-Selection Inference for Changepoint Detection Algorithms

Abstract

1 Introduction

1.1 Motivating example: array CGH data analysis

1.2 Related work

2 Preliminaries

2.1 Review: changepoint algorithms

Binary segmentation (BS).

Wild binary segmentation (WBS).

Circular binary segmentation (CBS).

Fused lasso.

2.2 Review: post-selection inference

Saturated model.

Selected model.

3 Inference for changepoint algorithms

3.1 Polyhedral selection events

Selection event for BS.

Proposition 1**.**

Proof.

Selection event for WBS.

Proposition 2**.**

Selection event for CBS.

Proposition 3**.**

Selection events for FL, and a brief comparison.

3.2 Computation of p-values

Saturated model tests: exact formulae.

Selected model tests: hit-and-run sampling.

3.3 Randomization and marginalization

Marginalization over additive noise.

Proposition 4**.**

Marginalization over WBS intervals.

4 Practicalities and extensions

4.1 Practical considerations

4.2 Extensions

5 Simulations

5.1 Gaussian simulations

Methodology.

Type-I error control verification.

Calculating power.

Power comparison across signal sizes δ\deltaδ.

Comparison with sample-splitting.

5.2 Pseudo-real simulation with heavy tails

6 Copy Number Variation (CNV) data application

7 Conclusions

8 Code and supplemental material

9 Acknowledgment

Appendix A Additional proofs

A.1 Proof of \Frefprop:wbs-polyhedral-event, (WBS)

Proof.

A.2 Proof of \Frefprop:cbs-polyhedral-event, (CBS)

Proof.

A.3 Proof of \Frefprop:additive_noise, (Marginalization)

Proof.

Appendix B Selected model tests, hit-and-run sampling for known

Appendix C Additional simulation results

C.1 Power comparison using unique detection

C.2 Power comparison with different mean shape

C.3 Sample splitting (continued)

Appendix D Model size selection using information criteria

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Power comparison across signal sizes $\delta$ .