Selective inference after feature selection via multiscale bootstrap

Yoshikazu Terada; Hidetoshi Shimodaira

arXiv:1905.10573·stat.ME·June 2, 2022

Selective inference after feature selection via multiscale bootstrap

Yoshikazu Terada, Hidetoshi Shimodaira

PDF

Open Access

TL;DR

This paper introduces a multiscale bootstrap method for selective inference that provides more accurate and less biased p-values after feature selection, applicable to complex algorithms beyond traditional methods like Lasso.

Contribution

It proposes a novel resampling approach using multiscale bootstrap to compute unbiased p-values for feature selection, overcoming limitations of existing methods.

Findings

01

Multiscale bootstrap yields more accurate p-values than classical bootstrap.

02

The method is effective for complex feature selection algorithms like non-convex regularization.

03

Numerical experiments confirm the method's robustness and applicability.

Abstract

It is common to show the confidence intervals or $p$ -values of selected features, or predictor variables in regression, but they often involve selection bias. The selective inference approach solves this bias by conditioning on the selection event. Most existing studies of selective inference consider a specific algorithm, such as Lasso, for feature selection, and thus they have difficulties in handling more complicated algorithms. Moreover, existing studies often consider unnecessarily restrictive events, leading to over-conditioning and lower statistical power. Our novel and widely-applicable resampling method via multiscale bootstrap addresses these issues to compute an approximately unbiased selective $p$ -value for the selected features. As a simplification of the proposed method, we also develop a simpler method via the classical bootstrap. We prove that the $p$ -value computed by…

Equations106

\exists β^{*} \in R^{p}; ξ = X β^{*},

\exists β^{*} \in R^{p}; ξ = X β^{*},

β^{(M)}

β^{(M)}

β

β

P_{H_{0}} (H_{0} is rejected ∣ H_{0} is selected) \leq α,

P_{H_{0}} (H_{0} is rejected ∣ H_{0} is selected) \leq α,

P_{H_{0}} (H_{0} : β_{j} = 0 is rejected ∣ j \in \hat{M})

P_{H_{0}} (H_{0} : β_{j} = 0 is rejected ∣ j \in \hat{M})

= M : j \in M \sum P_{H_{0}} (H_{0} : β_{j} = 0 is rejected ∣ \hat{M} = M) \frac{P _{H_{0}} ( M ^ = M )}{P _{H_{0}} ( j \in M ^ )}

\leq M : j \in M max P_{H_{0}} (H_{0} : β_{j} = 0 is rejected ∣ \hat{M} = M) .

Y := f_{n} (X_{n}) \sim N_{m + 1} (μ, I_{m + 1}) .

Y := f_{n} (X_{n}) \sim N_{m + 1} (μ, I_{m + 1}) .

Y^{*} = f_{n} (X_{n^{'}}^{*}) \sim N_{m + 1} (y, σ^{2} I_{m + 1}), σ^{2} = \frac{n}{n ^{'}} .

Y^{*} = f_{n} (X_{n^{'}}^{*}) \sim N_{m + 1} (y, σ^{2} I_{m + 1}), σ^{2} = \frac{n}{n ^{'}} .

\forall μ \in \partial H; P (p (H ∣ Y) < α ∣ μ) \approx α

\forall μ \in \partial H; P (p (H ∣ Y) < α ∣ μ) \approx α

p_{BP} (H ∣ y) := P_{1} (Y^{*} \in H ∣ y)

p_{BP} (H ∣ y) := P_{1} (Y^{*} \in H ∣ y)

α_{σ^{2}} (H ∣ y) := P_{σ^{2}} (Y^{*} \in H ∣ y),

α_{σ^{2}} (H ∣ y) := P_{σ^{2}} (Y^{*} \in H ∣ y),

ψ_{σ^{2}} (H ∣ y) := σ \overset{ˉ}{Φ}^{- 1} (α_{σ^{2}} (H ∣ y)),

ψ_{σ^{2}} (H ∣ y) := σ \overset{ˉ}{Φ}^{- 1} (α_{σ^{2}} (H ∣ y)),

ψ_{σ^{2}} (H ∣ y) = v (H ∣ y) + γ (H ∣ y) σ^{2} + O_{p} (n^{- 1}),

ψ_{σ^{2}} (H ∣ y) = v (H ∣ y) + γ (H ∣ y) σ^{2} + O_{p} (n^{- 1}),

p_{AU} (H ∣ y)

p_{AU} (H ∣ y)

\forall μ \in \partial H; P (p_{AU} (H ∣ Y) < α ∣ μ) = α + O (n^{- 1}) .

\forall μ \in \partial H; P (p_{AU} (H ∣ Y) < α ∣ μ) = α + O (n^{- 1}) .

\forall μ \in \partial H; \frac{P ( p ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} \approx α .

\forall μ \in \partial H; \frac{P ( p ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} \approx α .

p_{SI} (H ∣ S, y) := \frac{Φ ˉ ( φ _{H} ( - 1∣ θ _{H} ))}{Φ ˉ ( φ _{H} ( - 1∣ θ _{H} ) + φ _{S} ( 0∣ θ _{S} ))},

p_{SI} (H ∣ S, y) := \frac{Φ ˉ ( φ _{H} ( - 1∣ θ _{H} ))}{Φ ˉ ( φ _{H} ( - 1∣ θ _{H} ) + φ _{S} ( 0∣ θ _{S} ))},

\forall μ \in \partial H; \frac{P ( p _{SI} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (n^{- 1}) .

\forall μ \in \partial H; \frac{P ( p _{SI} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (n^{- 1}) .

Z \sim N_{n} (ξ, τ^{2} I_{n}),

Z \sim N_{n} (ξ, τ^{2} I_{n}),

Y := Z / τ, μ := ξ / τ

Y := Z / τ, μ := ξ / τ

Y := τ^{- 1} B Z, μ := τ^{- 1} B ξ,

Y := τ^{- 1} B Z, μ := τ^{- 1} B ξ,

P (p_{j} (Y) < α ∣ j \in \hat{M} and \overset{s}{^}_{j} = s_{j}) = α

P (p_{j} (Y) < α ∣ j \in \hat{M} and \overset{s}{^}_{j} = s_{j}) = α

H

H

ψ_{σ^{2}} (H ∣ y) = v (H ∣ y) = \pm τ^{- 1} a_{j}^{T} z /∥ a_{j} ∥_{2},

ψ_{σ^{2}} (H ∣ y) = v (H ∣ y) = \pm τ^{- 1} a_{j}^{T} z /∥ a_{j} ∥_{2},

z_{σ}^{*} = X \hat{β}^{(LS)} + σ \overset{ϵ}{^}^{*},

z_{σ}^{*} = X \hat{β}^{(LS)} + σ \overset{ϵ}{^}^{*},

p_{SI} (H ∣ S, y) = \frac{Φ ˉ ( z _{H} )}{Φ ˉ ( z _{H} + z _{S} )},

p_{SI} (H ∣ S, y) = \frac{Φ ˉ ( z _{H} )}{Φ ˉ ( z _{H} + z _{S} )},

\forall μ \in \partial H; \frac{P ( p _{SI} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (λ^{2}) .

\forall μ \in \partial H; \frac{P ( p _{SI} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (λ^{2}) .

p_{SI-BP} (H ∣ S, y) := \frac{Φ ˉ ( z _{H} )}{Φ ˉ ( z _{H} + z _{S}^{'} )} .

p_{SI-BP} (H ∣ S, y) := \frac{Φ ˉ ( z _{H} )}{Φ ˉ ( z _{H} + z _{S}^{'} )} .

\forall μ \in \partial H; \frac{P ( p _{SI-BP} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (λ),

\forall μ \in \partial H; \frac{P ( p _{SI-BP} ( H ∣ S , Y ) < α ∣ μ )}{P ( Y \in S ∣ μ )} = α + O (λ),

MCP (β_{j} ∣ ρ, γ)

MCP (β_{j} ∣ ρ, γ)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference

Full text

Selective inference after feature selection via multiscale bootstrap

Yoshikazu Teradalabel=e1][email protected] [

Graduate School of Engineering Science, Osaka University

1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan

Hidetoshi Shimodairalabel=e2][email protected] [

Graduate School of Informatics, Kyoto University

Yoshida Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan

Abstract

It is common to show the confidence intervals or $p$ -values of selected features, or predictor variables in regression, but they often involve selection bias. The selective inference approach solves this bias by conditioning on the selection event. Most existing studies of selective inference consider a specific algorithm, such as Lasso, for feature selection, and thus they have difficulties in handling more complicated algorithms. Moreover, existing studies often consider unnecessarily restrictive events, leading to over-conditioning and lower statistical power. Our novel and widely-applicable resampling method via multiscale bootstrap addresses these issues to compute an approximately unbiased selective $p$ -value for the selected features. As a simplification of the proposed method, we also develop a simpler method via the classical bootstrap. We prove that the $p$ -value computed by our multiscale bootstrap method is more accurate than the classical bootstrap method. Furthermore, numerical experiments demonstrate that our algorithm works well even for more complicated feature selection methods such as non-convex regularization.

\startlocaldefs\endlocaldefs

t1 Jointly affiliated at RIKEN Center for Advanced Intelligence Project (AIP), 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.

and

1 Introduction

In the classical statistical inference, the specification of a hypothesis is assumed to be independent of obtained data. In recent years, since big and complicated data have been common in various fields, it is difficult to set hypotheses in advance. Thus, in modern data analysis, we commonly find useful hypotheses from obtained data using exploratory data analysis, and then we perform the classical inference for the selected hypotheses. However, we often ignore the effects of the hypothesis selection in the classical inference, and thus this naive approach will not provide a valid statistical inference. Recently, the selective (or post-selection) inference, which deals with the hypothesis selection effect appropriately, has drawn considerable attention not only in the statistical community but also in the machine learning community (e.g., Yang et al., 2016; Suzumura et al., 2017; Slim et al., 2019; Lim et al., 2020).

In this paper, we focus on the selective inference after the feature selection, i.e., predictor variable selection, in regression analysis. The most intuitive and straightforward approach of selective inference is proposed by Cox (1975) and called data splitting. In data splitting, an i.i.d. sample is divided into two subsamples: one is used for the feature selection, and the other is used for the inference of the selected features. However, this approach reduces available data for both feature selection and inference. Fithian, Sun and Taylor (2014) provides the theoretical foundation to consider the optimality of the selective inference in the sense of statistical power. In Berk et al. (2013), without assuming a specific feature selection method, a valid selective inference after feature selection for the submodel parameters is developed on the regression problem. Importantly, Berk et al. (2013) also introduces both submodel view and full-model view of the targets of selective inference after feature selection. Under the setting of Berk et al. (2013), Lee et al. (2016) characterizes the selection event in which a specific model is selected by Lasso (Tibshirani, 1996). More precisely, this selection event is represented as a union of polyhedra in the space of the response variable. In addition, based on this fact, Lee et al. (2016) proposes the exact selective inference for the feature selection via Lasso. The significance levels conditioned on the selection event are computed by truncated normal distributions, justified by the polyhedral lemma. Tibshirani et al. (2016) develops a general framework to perform selective inference after any selection event that is represented as a response vector falling into a polyhedral set. Tibshirani et al. (2018) proves that this selective inference is asymptotically valid even for non-normal error distributions.

On the other hand, the exact selective inference approaches such as Lee et al. (2016) and Tibshirani et al. (2016) assume that the selection event is explicitly represented as a union of polyhedra in the space of the response variable. Although the idea of the polyhedral lemma is, in fact, valid for any selective sets beyond a union of polyhedra (Liu, Markovic and Tibshirani, 2018), the existing approaches have computational difficulties in handling more complicated algorithms with non-convex penalties such as MCP (minimax concave penalty; Zhang, 2010) and SCAD (smoothly clipped absolute deviation; Fan and Li, 2001), where the selective sets become more complicated than the ordinary Lasso. Although the selective inference of Berk et al. (2013) is not limited to specific feature selection methods, the computation cost may be prohibitive for the number of variables over $20$ . In addition, it controls type-I errors simultaneously under all submodels, thus leading to very conservative confidence intervals. Moreover, most existing selective inference with the full-model view is unnecessarily over-conditioning and lower statistical power because the inference is conditioned on a selected model, whereas it could be minimally conditioned on a selected feature. The selective set of the minimally conditioning event becomes more complicated and computationally difficult, and thus its valid post-selection inference is implemented recently by Liu, Markovic and Tibshirani (2018) first time but only for the ordinary Lasso case.

Recently, Terada and Shimodaira (2017) extends the general hypothesis testing framework, called the problem of regions (Efron and Tibshirani, 1998), to the selective inference, and proposes a new selective inference approach via multiscale bootstrap of Shimodaira (2002, 2004, 2008). This approach is not based on the polyhedral lemma, and we can easily compute approximately unbiased selective $p$ -values for hypotheses conditioned on complicated selective sets. Moreover, Terada and Shimodaira (2017) provides the theoretical justification for this approach in two asymptotic theories. In this framework, we consider the general setting in which the hypothesis and the selection event are represented as regions in some parameter space. This approach can be widely applied because we do not need to know the shapes of these regions, but only need to prepare functions that tell whether these regions include a realization of the parameter estimate. In fact, Shimodaira and Terada (2019) describes an application of this approach for testing trees and edges in phylogenetics. Moreover, based on our idea described in this paper and Terada and Shimodaira (2017), Lim et al. (2020) develops the powerful selective inference after feature selection using the Hilbert Schmidt Independence Criterion and the Maximum Mean Discrepancy.

In the original form of multiscale bootstrap method, we change the sample size of bootstrap samples and then compute a bias-corrected $p$ -value using geometric quantities (curvature and signed distance of the region) estimated from the scaling-law of the bootstrap probability of the hypothesis and selection event. However, this multiscale bootstrap method cannot be directly applied to selective inference after feature selection since the shape of the selective region is unwillingly related to the sample size in the feature selection problem. To overcome this difficulty, we propose the use of the resampling of the residuals with scale change. The advantage of our method is that it can be applied to almost any feature selection algorithm. In addition, the computational complexity of our method is the same order as the classical bootstrap method.

This paper is organized as follows. In Section 2, we describe the setting of selective inference after feature selection. In Section 3, we give a brief exposition of multiscale bootstrap and the general selective inference via multiscale bootstrap. In Section 4, we develop a new selective inference algorithm via multiscale bootstrap in regression analysis. In Section 5, the usefulness of our approach is demonstrated through numerical experiments.

2 Selective inference after feature selection

We briefly describe the selective inference after feature selection in linear regression; we will give the setting in Section 4 with details. Let $Z=(Z_{1},\dots,Z_{n})^{T}$ be the response variable with mean $\xi\in\mathbb{R}^{n}$ and variance $\tau^{2}I_{n}$ . Let $x_{1},\dots,x_{p}\in\mathbb{R}^{n}$ be non-random features (i.e., predictor variables) and $X=(x_{1},\dots,x_{p})=(x_{ij})_{n\times p}$ . At first, we need to clarify the target of statistical inference. If we assume the following first-order correctness:

[TABLE]

the target of the estimators is clearly the “true” coefficients $\beta^{\ast}$ . However, in real data analysis, it is difficult to assume this correctness, as mentioned in Box’s famous quote “All models are wrong”. Even under the first-order correctness, the selected models may be wrong. Without the first-order correctness, the target of inference is ambiguous. Berk et al. (2013) clarifies the target of statistical inference in linear regression, and there are two possible types of the target:

•

Let $M\subseteq\{1,\dots,p\}$ be a specified set of features. We consider a submodel using the features in $M$ . The target for the submodel view with respect to $M$ is

[TABLE]

where $X_{M}$ is the predictor matrix consisting of the features of $M$ . The true value of $\beta_{j}^{(M)}\;(j\in M)$ in submodel view depends on $M$ .

•

The target for the full-model view is $\beta=\beta^{(M)}$ with $M=\{1,\ldots,p\}$ . Thus

[TABLE]

The true value of $\beta_{j}$ in the full-model view does not depend on $M$ . In this paper, the target for our method is the coefficients in the full-model view, while both views are discussed below.

Now, we describe the basic concepts of selective inference in regression. Let $\alpha$ be a significant level in selective inference. When the null hypothesis $H_{0}$ is selected based on data, we should control the selective type I error rate:

[TABLE]

where $P_{H_{0}}$ is a probability distribution under $H_{0}$ . The event $\{H_{0}\text{ is selected}\}$ is called the selection event. After feature selection based on data, we obtain the selected model $\hat{M}\subseteq\{1,\dots,p\}$ . Then, depending on whether the target is in the submodel view or the full-model view, the null hypotheses for $j\in\hat{M}$ is $H_{0}:\beta_{j}^{(\hat{M})}=0$ or $H_{0}:\beta_{j}=0$ , respectively. Here $\beta_{j}^{(\hat{M})}=0$ is a simplified notation of $\beta_{j}^{(M)}=0$ with $\hat{M}=M$ . Moreover, we may consider two different types of events $\{j\in\hat{M}\}$ or $\{\hat{M}=M\}$ as the selection event.

However, the event $\{j\in\hat{M}\}$ is not an appropriate selection event for the hypothesis of submodel view $H_{0}:\beta_{j}^{(\hat{M})}=0$ , because this hypothesis depends on the other selected features $\hat{M}\setminus\{j\}$ and thus the probability (1) does not make sense. Therefore, for the hypothesis of submodel view, the event $\{\hat{M}=M\}$ or more restrictive events are appropriate as a selection event; the event $\{\hat{M}=M,\mathrm{sign}(\hat{\beta}^{(M)})=s\}$ is sometimes considered as the selection event (e.g., Lee et al., 2016) for computational reason.

On the other hand, we can consider the two different types of conditioning $\{j\in\hat{M}\}$ and $\{\hat{M}=M\}\;(j\in M)$ as a selection event for the hypothesis of full-model view $H_{0}:\beta_{j}=0$ . Since both of these two events are appropriate, we may wonder which of them is more desirable. This is answered by the argument of the monotonicity of selective error in Proposition 3 of Fithian, Sun and Taylor (2014). Here we see its adaptation to our setting. Since $\{j\in\hat{M}\}=\bigcup_{M:j\in M}\{\hat{M}=M\}$ , we have

[TABLE]

If we control the selective type-I error $P_{H_{0}}(H_{0}:\beta_{j}=0\text{ is rejected}\mid\hat{M}=M)$ at level $\alpha$ for all models $M\;(j\in M)$ , the selective type-I error $P_{H_{0}}(H_{0}:\beta_{j}=0\text{ is rejected}\mid j\in\hat{M})$ is also automatically controlled at level $\alpha$ . Thus, this monotonicity tells us that the over-conditioning leads to a loss of information and that, for the hypothesis $H_{0}:\beta_{j}=0$ , the minimal selection event $\{j\in\hat{M}\}$ (i.e., the minimally conditioning and thus the maximal event set) is the most desirable in the sense of statistical power.

3 An overview of multiscale bootstrap

First, we describe the framework of the problem of regions in Section 3.1, followed by the basic idea of multiscale bootstrap for non-selective inference (Shimodaira, 2002, 2004, 2008) in Section 3.2. Then we briefly introduce the general selective inference framework proposed by Terada and Shimodaira (2017) in Section 3.3.

3.1 The problem of regions

The general statistical inference framework, in which the hypothesis is represented by a general region in some parameter space, is called the problem of regions (Efron and Tibshirani, 1998). This framework is an abstraction of many applications, e.g., phylogenetic inference, in which a confidence level is assigned for each clade of the estimated phylogenetic tree (Felsenstein, 1985; Efron, Halloran and Holmes, 1996).

Let $\mathcal{X}_{n}=(X_{1},\dots,X_{n})$ be a data with sample size $n$ . In the problem of regions, it is assumed that there exists a transform $f_{n}$ of $\mathcal{X}_{n}$ such that the transformed data follows the $(m+1)$ -dimensional Gaussian distribution with unknown mean parameter $\mu\in\mathbb{R}^{m+1}$ and covariance identity $I_{m+1}$ :

[TABLE]

Typically, $f_{n}$ involves multiplying the factor $\sqrt{n}$ to a form of sample average so that the covariance matrix of $Y$ is properly rescaled. Here, the $(m+1)$ -dimensional space of $Y$ will be referred to as the model space in this paper. Figure 1 shows the image of the model space. In addition, let $y\in\mathbb{R}^{m+1}$ be an observed value of $Y$ , and suppose that a bootstrap sample $\mathcal{X}_{n^{\prime}}^{\ast}=(X_{1}^{\ast},\dots,X_{n^{\prime}}^{\ast})$ with sample size $n^{\prime}$ is represented as a realization of the following Gaussian distribution in the model space:

[TABLE]

We will denote by $P_{\sigma^{2}}(\cdot|y)$ the probability measure of the bootstrap sample $Y^{\ast}$ with scale $\sigma>0$ . This framework is a simplification of reality and is justified by the central limit theorem in many situations.

Let $H\subset\mathbb{R}^{m+1}$ be a general region and let us consider $H_{0}:\mu\in H$ as a null hypothesis. It is assumed that the region $H$ , in a neighbourhood of the model space, is locally represented as $H=\{(u,v)\mid v\leq-h(u),\;u\in\mathbb{R}^{m},v\in\mathbb{R}\}$ using some continuous function $h:\mathbb{R}^{m}\rightarrow\mathbb{R}$ . Let $\partial H:=\{(u,v)\mid v=-h(u),\;u\in\mathbb{R}^{m}\}$ be the boundary surface of the region $H$ . In this setting, our main goal is to compute an approximately unbiased $p$ -value $p(H|y)$ for the null hypothesis $H_{0}:\mu\in H$ against the alternative hypothesis $H_{1}:\mu\not\in H$ . The approximately unbiased $p$ -value should satisfy

[TABLE]

for a given significance level $\alpha>0$ . In other words, the $p$ -value is approximately distributed as uniform over (0,1), i.e., $p(H|Y)\sim U(0,1)$ when $\mu\in\partial H$ . The difference between $P(p(H|Y)<\alpha\mid\mu)$ and $\alpha$ in (2) is called bias (or error). The bootstrap probability

[TABLE]

is considered as the most simple $p$ -value satisfying $(\ref{eq:au})$ ; see Efron, Halloran and Holmes (1996); Efron and Tibshirani (1998). More formally, in the classical large sample theory, if the region $H$ has a smooth boundary surface, the bootstrap probability $p_{\mathrm{BP}}(H|y)$ has the first-order accuracy: $\forall\mu\in\partial H;\;P(p_{\mathrm{BP}}(H|Y)<\alpha\mid\mu)=\alpha+O(n^{-1/2}).$ However, in many practical situations, the bootstrap probability $p_{\mathrm{BP}}$ often has a non-negligible bias.

3.2 Basic idea of multiscale bootstrap

To obtain more accurate $p$ -values, geometric quantities in the model space, such as distance and curvature, play a key role. In fact, Efron and Tibshirani (1998) shows that we can compute a more accurate $p$ -value using the signed distance $v(H|y)$ from the data point $y$ to the region $H$ . More precisely, the $p$ -value $p_{\mathrm{sign}}(H|y):=P_{1}(v(H|Y^{\ast})\geq v(H|y)\mid\hat{\mu}(y))$ is proposed, where $\hat{\mu}(y)\in\partial H$ is the projection point of $y$ onto $\partial H$ . This $p$ -value $p_{\mathrm{sign}}(H|y)$ has the third-order accuracy (Efron, 1985; Efron and Tibshirani, 1998).

However, in most practical situations, it is difficult to access the model space and to obtain the explicit formula of the hypothesis region in the model space. Thus, we cannot compute the signed distance $v(H|y)$ in general. To overcome this difficulty, Shimodaira (2002, 2004, 2008) propose a new bootstrap method, called multiscale bootstrap. In multiscale bootstrap, the geometric quantities such as the signed distance $v(H|y)$ and the mean curvature of $\partial H$ are estimated based on the scaling law of the bootstrap probabilities, and an accurate $p$ -value is computed based on these estimated quantities.

We consider the bootstrap probability with scale $\sigma>0$

[TABLE]

which reduces to $p_{\mathrm{BP}}(H|y)$ when $\sigma=1$ . When we change $n^{\prime}$ in the data space, and in effect, change $\sigma$ in the model space, the bootstrap probability changes. This change is simply expressed in terms of the normalized bootstrap $z$ -value defined as

[TABLE]

where $\Phi(x)$ is the cumulative distribution function of the standard normal distribution and $\bar{\Phi}(x)=1-\Phi(x)$ , i.e., $\bar{\Phi}^{-1}(\alpha)$ is the upper $\alpha$ -value. Shimodaira (2002, 2004) show the following scaling law of the bootstrap probabilities:

[TABLE]

where $\gamma(H|y)$ is the mean curvature of the boundary surface $\partial H$ at $\hat{\mu}$ . This scaling law can be modelled as the simple linear regression $\theta_{H,0}+\theta_{H,1}\sigma^{2}$ with $\sigma^{2}$ as the predictor. We will denote by $\varphi_{H}(\sigma^{2}|\theta_{H})$ the model for the normalized bootstrap $z$ -value, such as $\theta_{H,0}+\theta_{H,1}\sigma^{2}$ with parameter $\theta_{H}=(\theta_{H,0},\theta_{H,1})$ . The bootstrap probabilities with several values $\{\sigma_{j}\}$ of scale can be computed by using the bootstrap samples with different sample sizes, say $n^{\prime}_{j}\in\{\lceil 0.5n\rceil,\cdots,\lceil 1.0n\rceil,\cdots,\lceil 1.5n\rceil\}$ . Let $B$ be the number of bootstrap replicates, and $C_{H}=\#\{Y^{\ast}\in H\}$ be the frequency to be $Y^{\ast}\in H$ . Let $\hat{\psi}_{\sigma_{j}^{2}}(H|y)$ be the estimated normalized bootstrap $z$ -value by using the estimated bootstrap probability $\hat{\alpha}_{\sigma^{2}}(H|y)=C_{H}/B$ . We can estimate the values of $v(H|y)=\theta_{H,0}$ and $\gamma(H|y)=\theta_{H,1}$ by the simple regression for the observed $\{(\sigma_{j}^{2},\hat{\psi}_{\sigma_{j}^{2}}(H|y))\}$ .

Shimodaira (2002) proposes the following $p$ -value:

[TABLE]

This $p$ -value $p_{\mathrm{AU}}(H|y)$ has the second-order accuracy (Shimodaira, 2004; Efron and Tibshirani, 1998):

[TABLE]

It becomes third-order accurate erring only $O(n^{-3/2})$ when $\varphi_{H}(\sigma^{2}|\theta_{H})$ is properly estimated from observed values of $\psi_{\sigma^{2}}(H|y)$ including terms of order $O_{p}(n^{-1})$ .

In the classical large sample theory, the shape of $H$ in the model space is magnified by $\sqrt{n}$ , and thus the key property is that the smooth boundary surface $\partial H$ approaches a flat surface in a neighborhood of any point on $\partial H$ . In contrast, for non-smooth surfaces, this key property is not satisfied. For example, if the region $H$ is cone-shaped, the shape of $H$ is scale-invariant in a neighborhood of the vertex of $H$ . It is well known that there is no unbiased test for a hypothesis region with a non-smooth boundary (Lehmann, 1952). To deal with general regions with non-smooth boundary surfaces, Shimodaira (2008) develops a new theoretical framework, called the asymptotic theory of nearly flat surfaces. In this framework, we require that the magnitude of boundary surfaces is small. Thus, this framework works well even for non-smooth boundary surfaces when the magnitude of boundary surfaces is not so large, at least locally. We can interpret that the given surface is on the way to approaching the flat surface in this framework. This idea is similar to one behind the local alternative framework (Lehmann, 1999), but the rescaling is applied only to the direction normal to the boundary surface while the scale is fixed for the other directions. A brief introduction of this theory is provided in Appendix A.

3.3 General selective inference via multiscale bootstrap

Here, we describe an extended framework of the problem of regions for the selective inference. In the model space, two regions $H=\{(u,v)\mid v\leq-h(u),\;u\in\mathbb{R}^{m},v\in\mathbb{R}\}$ and $S=\{(u,v)\mid v>-s(u),\;u\in\mathbb{R}^{m},v\in\mathbb{R}\}$ are considered. Suppose that the selection event is represented as $\{y\in S\}$ , and we consider the selective inference in which the null hypothesis $H_{0}:\mu\in H$ is selected if and only if $y\in S$ . In this setting, for a given significance level $\alpha$ , we want to compute selective $p$ -values $p(H|S,y)$ satisfying

[TABLE]

In other words, $p(H|S,Y)\sim U(0,1)$ conditioned on $Y\in S$ when $\mu\in\partial H$ . Terada and Shimodaira (2017) proposes the following approximately unbiased selective $p$ -value $p_{\mathrm{SI}}(H|S,y)$ for regions $H$ and $S$ with smooth boundary surfaces:

[TABLE]

where $\varphi_{S}$ is the model for the normalized bootstrap $z$ -value related to the selective region $S$ , and $\theta_{S}$ is the parameter of the model $\varphi_{S}$ .

Theorem 1.

(Theorem 4.3 in Terada and Shimodaira (2017)) The boundary surfaces $\partial H$ and $\partial S$ are assumed to be sufficiently smooth and nearly parallel in the sense that the first derivatives of $h$ and $s$ differ only $O(n^{-1})$ . Then, the selective $p$ -value $p_{\mathrm{SI}}(H|S,y)$ has the second-order accuracy:

[TABLE]

The detailed calculation of $p_{\mathrm{SI}}(H|S,y)$ is provided as Algorithm 1.

In Terada and Shimodaira (2017), the selective $p$ -value for the regions with non-smooth boundary surfaces is also proposed, and the theoretical justification of this $p$ -value is provided using the asymptotic theory of nearly flat surfaces. For more details about the case in which the regions $H$ and $S$ have possibly non-smooth boundary surfaces, see Appendix A. Since the boundary surface $\partial S$ of the selection region is generally not smooth in the selective inference after feature selection, the theory of nearly flat surfaces is used to derive the properties of the proposed method described in Section 4.3. Roughly speaking, both theories essentially assume that the boundary surfaces $\partial H$ and $\partial S$ are more or less flat and parallel to each other, at least locally around the data point. In analogy with the local alternative framework, we consider that given surfaces are on the way to approaching mutually parallel flat surfaces.

4 Selective inference after feature selection via multiscale bootstrap

4.1 Model space for regression analysis

Here, we describe the setting of the selective inference after feature selection in regression analysis. We employ the general assumption used in Berk et al. (2013), Lee et al. (2016), and Tibshirani et al. (2016). Consider the response variable $Z=(Z_{1},\dots,Z_{n})$ drawn from the multivariate Gaussian distribution:

[TABLE]

where $\xi\in\mathbb{R}^{n}$ is an unknown parameter, $I_{n}$ is the $n$ -dimensional identity matrix, and $\tau^{2}$ is assumed to be known. We will denote by $z\in\mathbb{R}^{n}$ the observed value of $Z$ . Let $X=(x_{1},\dots,x_{p})=(x_{ij})_{n\times p}$ be a non-random full rank matrix whose columns represent the features. Note that the error variance $\tau^{2}$ can be estimated if $\xi$ is modeled as a function of features $x_{1},\dots,x_{p}\in\mathbb{R}^{n}$ . Assuming a specific feature selection method, such as Lasso and MCP, is applied to $(X,z)\in\mathbb{R}^{n\times p}\times\mathbb{R}^{n}$ , let $\hat{M}\subseteq\{1,\dots,p\}$ be the set of selected features, and $\hat{s}_{j}\in\{+,-\}$ be the sign of the estimated coefficient $\hat{\beta}_{j}$ of the feature $j\in\hat{M}$ .

First, note that the general selective inference approach described as Algorithm 1 cannot be directly applied to regression analysis. Since we need to change the sample size $n^{\prime}$ of bootstrap samples in the usual multiscale bootstrap, it is assumed that the hypothesis and selective regions can be represented as specific regions, which are independent of $n^{\prime}$ , in the model space. For the selective inference in regression analysis, however, the shape of the selective region inevitably depends on $n^{\prime}$ because it is the dimensionality of the model space as explained below.

We recall that it is assumed that $Z\sim N_{n}(\xi,\tau^{2}I_{n})$ and that the selection event can be represented as the region of the space of $Z$ . Then, it is realized that the normalized space of

[TABLE]

can be considered as the model space described in Section 3 with $m+1=n$ . Thus, the selective region for multiscale bootstrap inevitably depends on $n^{\prime}$ . Another choice of model space is given by

[TABLE]

where $B=(X^{T}X)^{-1/2}X^{T}\in\mathbb{R}^{p\times n}$ . The selective region represented in this model space also depends on $n^{\prime}$ because feature selection algorithms take account of sample size. Although the latter model space is preferable for the asymptotic theory because it has the fixed dimensionality $m+1=p$ , we use the former model (3) below for easy illustration.

4.2 Appropriate selection event

Recently, Liu, Markovic and Tibshirani (2018) suggests the use of the selection event $\{j\in\hat{M}\}$ for a specified $j\in\{1,\ldots,p\}$ , which increases the statistical power and thus leading to shorter confidence intervals. This is explained by the monotonicity of the selective error provided in Fithian, Sun and Taylor (2014), as mentioned in Section 2. The event $\{\hat{M}=M\}$ for a specified $M\subset\{1,\ldots,p\}$ is over-conditioning and reducing the statistical power because the other features $M\setminus\{j\}$ are not relevant for the null hypothesis $H_{0}:\beta_{j}=0$ of the feature $j$ in the full-model view.

Here, we actually consider testing feature $j$ with its sign. More precisely, whenever the feature $j$ is selected and $\hat{\beta}_{j}>0$ (or $<0$ ), the hypothesis $H_{0}:\beta_{j}\leq 0$ (or $\geq 0$ ) is tested. The minimal selection event is then $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ where $s_{j}\in\{+,-\}$ . Hence, the main goal of our selective inference is to compute the unbiased selective $p$ -value $p_{j}(y)$ , which satisfies

[TABLE]

for any $\mu$ with $\beta_{j}=0$ .

4.3 Computing selective $p$ -values by multiscale bootstrap

In this section, we describe our proposed method. We develop a new algorithm to compute the approximately unbiased selective $p$ -value, which approximately satisfies the equation $(\ref{eq:unbiased})$ . We will update the computation of $\psi_{\sigma^{2}}(H|y)$ and $\psi_{\sigma^{2}}(S|y)$ in Algorithm 1 to obtain Algorithm 2.

For the feature selection via Lasso, the selection event $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}\;(j\in\{1,\ldots,p\},\,s_{j}\in\{+,-\})$ can be represented as a union of polyhedra in the $n$ -dimensional space of the response variable (Lee et al., 2016). The left panel of Figure 2 shows the relationship between the selected model by Lasso and the corresponding region in the response vector space when $n=2$ . In contrast, for more complicated feature selection methods such as MCP and SCAD, the region $S$ of the selective event $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ will become complicated, and the explicit shape of the selective region $S$ may not be obtained easily. The right panel of Figure 2 shows the relationship between the selected model by MCP and the corresponding region in the response vector space when $n=2$ . We had to numerically evaluate which features are selected at each point since no explicit representation of the selection event is available. Although Lee et al. (2016) and Tibshirani et al. (2016) consider exact selective inference for Lasso, it is difficult to consider exact selective inference for these complicated feature selection methods.

As shown in Figure 2, the selective region $S$ which represents the selection event $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ could be complicated and has generally non-smooth boundary surfaces. In contrast, for $\eta\in\mathbb{R}$ , the hypotheses $H_{0}:\beta_{j}\lesseqgtr\eta$ , namely the two cases of $\beta_{j}\leq\eta$ and $\beta_{j}\geq\eta$ , can be represented as the following regions in the space of $Y$ :

[TABLE]

Since the hypothesis region $H$ has a flat boundary surface with mean curvature $\gamma(H|y)=0$ , we can easily obtain the expression of $\psi_{\sigma^{2}}(H|y)$ without multiscale bootstrap. In particular, for $H_{0}:\beta_{j}\lesseqgtr 0$ ,

[TABLE]

where $v(H|y)$ is the signed distance from $y$ to the hypothesis region $H$ . Thus, once we obtain the expression of $\psi_{\sigma^{2}}(S|y)$ , we can compute the selective $p$ -value $p_{\mathrm{SI}}(H|S,y)$ .

Next, we consider how to obtain the expression of $\psi_{\sigma^{2}}(S|y)$ for the selective region $S$ by multiscale bootstrap; the bootstrap probability of $S$ is computed at several scales via an adaptation of Step 2 of Algorithm 1. In the usual setting of multiscale bootstrap, changing $n^{\prime}$ in the data space corresponds to changing $\sigma$ in the model space. Thus, based on bootstrap probabilities with several $n^{\prime}$ , we can estimate the expression of $\psi_{\sigma^{2}}(S|y)$ . However, as described in Section 4.1, this framework is not applicable here. Fortunately, since we can access the model space, we directly change the scale $\sigma$ in the model space and compute the bootstrap probabilities with several scales. With the normality of response $Z$ , the parametric bootstrap method, i.e., sampling directly from $N_{n}(y,\sigma^{2}I_{n})$ , could be applied to the computation of bootstrap probabilities $\alpha_{\sigma^{2}}(S|y)$ at several scales $\sigma>0$ . To relax the Gaussian assumption, here we consider the resampling of residuals with scale change. More formally, we resample the scaled residuals to compute $\alpha_{\sigma^{2}}(S|y)$ at several $\sigma>0$ as follows. Let $\hat{\beta}^{(\mathrm{LS})}=(X^{T}X)^{-1}X^{T}z$ be the least-squares estimator based on the full-model. Write $\hat{e}:=z-X\hat{\beta}^{(\mathrm{LS})}$ and $(h_{1},\dots,h_{n}):=\mathrm{diag}(X(X^{T}X)^{-1}X^{T})$ . Then, the adjusted residuals $\hat{\epsilon}=(\hat{\epsilon}_{1},\dots,\hat{\epsilon}_{n})^{T}$ are defined as $\hat{\epsilon}_{i}=\hat{e}_{i}/\sqrt{1-h_{i}}$ . To compute the bootstrap probability $\alpha_{\sigma^{2}}(S|y)$ at $\sigma>0$ , we use the following bootstrap sample:

[TABLE]

where $\hat{\epsilon}^{\ast}=(\hat{\epsilon}_{1}^{\ast},\dots,\hat{\epsilon}_{n}^{\ast})^{T}$ is a bootstrap sample with size $n$ from $(\hat{\epsilon}_{1},\dots,\hat{\epsilon}_{n})$ . For each $\sigma>0$ , we generate $z_{\sigma}^{\ast}$ for $B$ times, and apply a particular model selection procedure to them for computing $\alpha_{\sigma^{2}}(S|y)=C_{S}/B$ by counting the frequency of the selective event $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ .

With the updated computation of $\psi_{\sigma^{2}}(H|y)$ and $\psi_{\sigma^{2}}(S|y)$ for regression analysis, we may use Step 3 and Step 4 of Algorithm 1 to compute the selective $p$ -value. This becomes Algorithm 2 for computing an approximately unbiased selective $p$ -value for the selected feature $j\in\hat{M}$ . It is worth noting that Algorithm 2 can be applied to almost any feature selection methods, including MCP and SCAD, in addition to Lasso. Note also that the multiscale bootstrap is not very sensitive to the choice of the scales. For $B$ , several thousand replications are enough in practice. The computational cost is the same order as the classical bootstrap method. Also note that, in actual implementation of Algorithm 2, the Steps 1 to 3 are shared by all the features $j\in\hat{M}$ . Thus, this algorithm works even for large $p$ such as $p>20$ .

Now, we provide the theoretical justification of the proposed algorithm. Since the boundary surface of the selective region is generally non-smooth as shown in Figure 2, we consider the asymptotic theory of nearly flat surfaces. In the model space, we take the coordinate system $(u,v)\in\mathbb{R}^{n-1}\times\mathbb{R}$ such that the hypothesis region $H$ can be written by $\{(u,v)\mid v\leq 0\}$ . Using this coordinate system, let us denote the selective region by $S=\{(u,v)\mid v_{s}-s(u)\leq v\}$ at least locally in a neighborhood of $y$ , where $v_{s}\in\mathbb{R}$ and $s$ is a function from $\mathbb{R}^{n-1}$ to $\mathbb{R}$ which represents the boundary surface of the selective region. Here, the $L^{1}$ -norm and $L^{\infty}$ -norm of function $s$ are defined as $\|s\|_{1}=\int|s(u)|\,\mathrm{d}u$ and $\|s\|_{\infty}=\sup_{u}|s(u)|$ , respectively. Let $\lambda=\|s\|_{\infty}$ be the magnitude of the boundary surface $\partial S$ of the selective region. Even in regression analysis, we assume that the selection event can be written as $S=\{(u,v)\mid v_{s}-s(u)\leq v\}$ and the magnitude of the boundary surface $\partial S$ is relatively small at least around the data point $y$ in the model space. That is, we will consider the asymptotic theory in which $\lambda\rightarrow 0$ . In the same way as the local alternative framework, this assumption can be interpreted as that the given surface is on the way to approach the flat surface which is parallel to $\partial H$ . In this paper, this assumption is called nearly flatness of the boundary surface. In the asymptotic theory of nearly flat surfaces, the proposed $p$ -value $p_{\mathrm{SI}}(H|S,y)$ has the second-order accuracy.

Theorem 2.

Let us denote the selective region as $S=\{(u,v)\mid v\geq v_{s}-s(u)\}$ . Let $\tilde{s}$ be the Fourier transform of the function $s$ . Suppose that the $L^{1}$ -norms $\|s\|_{1}$ and $\|\tilde{s}\|_{1}$ are bounded. Let us assume that $\lambda=\|s\|_{\infty}$ is sufficiently small. Then, the selective $p$ -value described in Algorithm 2 has the second-order accuracy:

[TABLE]

Theorem 2 can be considered as a special case of Theorem 5.3 in Terada and Shimodaira (2017). In the current situation, we can directly obtain the signed distance from $y$ to the boundary surface $\partial H$ . Thus, the proof of Theorem 2 is much simpler than Theorem 5.3 in Terada and Shimodaira (2017). The proof is given in Appendix B. This theorem provides a theoretical justification for the proposed $p$ -value when the magnitude of the boundary surface is small, at least locally around the data point. The assumption requires that the boundary surface $\partial S$ of the selection event is nearly flat and its limiting hyperplane is parallel to the surface $\partial H$ of the hypothesis region, at least locally around the data point. Through the numerical simulations in the next section, it seems that this assumption is reasonably satisfied in practice. Of course, when the two surfaces $\partial H$ and $\partial S$ are not very parallel to each other, the proposed $p$ -value has a non-negligible bias. Solving this problem is an important future work of this research.

Remark 3.

In contrast with the $p$ -value $p_{\mathrm{SI}}(H|S,y)$ in Algorithm 2, we also propose a simple selective $p$ -value based on the classical bootstrap probability $\alpha_{1}(S|y)$ . We replace $z_{S}=\varphi_{S}(0|\hat{\theta}_{S}(y))$ in Step 5 by $z_{S}^{\prime}=\bar{\Phi}^{-1}(\alpha_{1}(S|y))$ to define

[TABLE]

It is also computed with $z_{S}^{\prime}=\psi_{1}(S|y)\approx\varphi_{S}(1|\hat{\theta}_{S}(y))$ . If the boundary surface of $S$ is flat, $z_{S}^{\prime}=z_{S}$ and thus $p_{\text{SI-BP}}=p_{\mathrm{SI}}$ . Under the assumption of Theorem 2, we can obtain

[TABLE]

indicating that $p_{\text{SI-BP}}$ has a larger bias than $p_{\mathrm{SI}}$ . This result can be proved in much the same way as Theorem 2.

5 Numerical experiments

Here, we show some numerical experiments to demonstrate the usefulness of our method. For Lasso, the exact unbiased selective test conditioned on $j\in\hat{M}$ and $\hat{s}_{j}=s_{j}\;(j\in\{1,\ldots,p\},\,s_{j}\in\{+,-\})$ can be constructed (Lee et al., 2016; Liu, Markovic and Tibshirani, 2018). At first, we will show that our selective $p$ -value with multiscale bootstrap (i.e., $p_{\mathrm{SI}}$ ) approximates the exact selective $p$ -value for Lasso. Here, Lasso is defined as $\min_{\beta\in\mathbb{R}^{p}}\|z-X\beta\|_{2}^{2}/2+\sum_{j=1}^{p}\rho|\beta_{j}|$ . Set $(n,p)=(50,25)$ and $\beta=(2,2,2,2,2,0,\dots,0)^{T}\in\mathbb{R}^{p}$ ; $\beta_{j}=0$ for the features $j=6,\ldots,25$ . The elements $x_{ij}$ of the input matrix $X$ were independently generated from the standard normal distribution $N(0,1)$ . Then, the response $z\in\mathbb{R}^{n}$ was generated as $z=X\beta+\epsilon$ , where $\epsilon$ was generated from the $n$ -dimensional standard normal distribution $N_{n}(0_{n},I_{n})$ . In our algorithm, we used $\sigma^{2}=0.5,0.6,\dots,1.5$ as the scales and $B=10^{4}$ as the number of bootstrap replicates. Here, we note that the choice of scales has only little effect on the stability of the result. The experiment about this point can be found in Appendix C. We simulated $2000$ independent datasets. We set the significance level $\alpha=5\%$ . For computing false positive rates accurately, the variables with zero coefficients need to be selected several hundred times. Here, we used Lasso with the penalty parameter $\rho=10$ as the feature selection method. Each variable with zero coefficient was selected approximately $250$ times out of $2000$ datasets in this experiment. In each dataset, we performed the classical $t$ -test, the selective (one-sided) test conditioned on $\hat{M}=M$ and $\hat{s}_{M}=s_{M}$ for $M\subset\{1,\ldots,p\},\,s_{M}\in\{+,-\}^{|M|}$ (Lee et al., 2016; only for Lasso), the selective (two-sided) test conditioned on $j\in\hat{M}$ for $j\in\{1,\ldots,p\}$ (Liu, Markovic and Tibshirani, 2018; only for Lasso) and our approximately unbiased (one-sided) test with multiscale bootstrap for each selected feature. We count how many times, say $N_{j}$ , the feature $j$ is selected. For each test, we also count how many times (say $R_{j}$ ) the null hypothesis $H_{0}:\beta_{j}\lesseqgtr\eta$ is rejected, and the selective rejection probability is estimated by $R_{j}/N_{j}$ .

The panel (a) of Figure 3 shows the selective rejection probabilities of each feature for Lasso, where the four test methods are compared. In this plot, we can see that the selective rejection probabilities of our test with $p_{\mathrm{SI}}$ for the features $6$ to $25$ with $\beta_{j}=0$ are around $5\%$ . Thus, it is shown that our multiscale bootstrap method approximately satisfies the unbiasedness in the sense of the equation $(\ref{eq:unbiased})$ . We can also see that the classical inference does not provide a valid inference after feature selection; the classical $t$ -test gives more false positives than expected from the specified $\alpha$ level. Moreover, instead of the Lasso penalty $\rho|\beta_{j}|$ , we also used the following MCP and SCAD penalties with the tuning parameter $(\rho,\gamma)=(10,3.7)$ as the feature selection methods:

[TABLE]

For non-convex penalties, we have local minimum and multiple global minimum issues. In general, we assume that the selection event can be represented as the fixed set in the data space. Thus, both issues have unexpected effects in selective inference. To overcome these issues, we used the fixed initial values for MCP and SCAD 111Another approach is extending the data space to include the space of initial values.. In this experiment, we used the R package plus (Zhang and Melnik, 2012) for MCP and SCAD. The algorithm of this package generates a piecewise linear path of coefficients, starting with zero coefficients for infinity penalty.

Our multiscale bootstrap method works well not only for Lasso but also for more complicated feature selection methods such as MCP and SCAD. The panels (b) and (c) of Figure 3 show the selective rejection probabilities in the cases of MCP and SCAD, respectively. Here, we note that the existing selective inference methods (Green, Blue) cannot be applied to MCP and SCAD. Each feature with zero coefficient was selected approximately $250$ times by MCP and SCAD in this experiment. In this setting, whereas no exact unbiased selective inference is proposed, the selective rejection probabilities of our test for the features $6$ to $25$ are around $5\%$ . In general, we can get a more accurate result as the sample size increases. In fact, the panel (d) of Figure 3 is the result of the same experiment about MCP with larger sample size $n=100$ . Compared with the case of $n=50$ of the panel (b), we can see the more accurate result (i.e., less variations) in the case of $n=100$ .

Moreover, we also computed the simpler selective $p$ -value $p_{\text{SI-BP}}$ based on the classical bootstrap in both settings with MCP and SCAD. For MCP with $n=50$ , the selective rejection probabilities of $p_{\text{SI}}$ and $p_{\text{SI-BP}}$ under the null hypotheses are $4.62\%$ and $3.45\%$ , respectively. For SCAD with $n=50$ , the selective rejection probabilities of $p_{\text{SI}}$ and $p_{\text{SI-BP}}$ under the null hypotheses are $4.62\%$ and $3.52\%$ , respectively. These results show that the multiscale bootstrap method is more accurate than the classical bootstrap method in accordance with theory. In the setting with $n=100$ , the selective rejection probabilities of $p_{\text{SI}}$ and $p_{\text{SI-BP}}$ under the null hypotheses are $5.16\%$ and $5.02\%$ , respectively. As the sample size increases, both $p_{\text{SI}}$ and $p_{\text{SI-BP}}$ provide almost unbiased results. It could be because the boundary surface of the selection event is almost flat, at least locally around the data point, in a larger sample case.

In addition, set $\beta=(\theta,\theta,\theta,\theta,\theta,0,\dots,0)^{T}\in\mathbb{R}^{p}$ and $(n,p)=(50,10)$ . We compare the true positive rates (TPRs; i.e., statistical powers) and the false-positive rates (FPRs; i.e., type-I errors) of these tests with changes of $\theta$ in the case of Lasso. Here, TPR is defined by the proportion of selected non-zero features that are correctly identified, and FPR is defined by the proportion of selected zero-features that are incorrectly detected. Figure 4 shows that both the proposed method via multiscale bootstrap and the exact selective test conditioned on $j\in\hat{M}$ (Liu, Markovic and Tibshirani, 2018) not only have desirable high TPRs but also control FPRs at the significance level $\alpha=5\%$ . For the non-selective $t$ -test, the FPR is not controlled, whereas the highest TPR is attained. The $t$ -test (the black line) does not control the false positive rate. Thus it is not valid. Focusing on the TPR of the over-conditioning selective test (Lee et al., 2016), we can see that the unnecessarily restrictive selection event $\{\hat{M}=M,\hat{s}_{M}=s_{M}\}$ leads to the lower statistical power.

Next, we deal with the prostate cancer data (Stamey et al., 1989), which is available in the R package ElemStatLearn (Halvorsen, 2015). Stamey et al. (1989) studied the relation between the level of prostate-specific antigen (PSA) and $8$ clinical measures: the log cancer volume (lcavol), the log prostate weight (lweight), and so on. Here, we consider a linear regression model to the log of PSA (lpsa) with $8$ clinical measures. In this application, we prepossessed the data so that each feature has a mean zero and unit variance.

The main purpose is to provide the selective confidence intervals (CIs) for the coefficients of the $6$ selected features by Lasso with the penalty $\rho=5$ . Here, we also set $\alpha=5\%$ . We computed four types of confidence intervals with confidence level $1-\alpha$ as shown in Figure 5: the non-selective CI $[L_{j}^{(a)},U_{j}^{(a)}]$ using $t$ -distribution, the selective CI $[L_{j}^{(b)},U_{j}^{(b)}]$ conditioned on $\{\hat{M}=M,\hat{s}_{M}=s_{M}\}$ (Lee et al., 2016), the selective CI $[L_{j}^{(c)},U_{j}^{(c)}]$ conditioned on $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ (Liu, Markovic and Tibshirani, 2018), and the approximate selective CI $[L_{j}^{(d)},U_{j}^{(d)}]$ based on our approximately unbiased $p$ -values $p_{\mathrm{SI}}$ based on the multiscale bootstrap. We note that the first three CIs satisfy the following equations, respectively, for $j\in\{1,\ldots,p\}$ , $s_{j}\in\{+,-\}$ , $M\subseteq\{1,\dots,p\}$ , $s_{M}\in\{+,-\}^{|M|}$ :

[TABLE]

From the plot, we can see that our selective CIs $[L_{j}^{(d)},U_{j}^{(d)}]$ approximates the exact selective CIs $[L_{j}^{(c)},U_{j}^{(c)}]$ very well. Moreover, the over-conditioning of the selection event $\{\hat{M}=M,\hat{s}_{M}=s_{M}\}$ made CIs $[L_{j}^{(b)},U_{j}^{(b)}]$ wider than $[L_{j}^{(c)},U_{j}^{(c)}]$ , and this indicates that the less restrictive selection event $\{j\in\hat{M},\hat{s}_{j}=s_{j}\}$ is preferable.

6 Discussion

A new multiscale bootstrap method (Algorithm 2) is proposed to compute approximately unbiased selective $p$ -values and confidence intervals for regression coefficients after feature selection. The new method is useful in particular for complicated feature selection algorithms such as MCP and SCAD, while existing methods are only available for simpler feature selection algorithms such as Lasso. The new method also computes shorter confidence intervals than most existing methods by minimally-conditioning on each selected feature instead of over-conditioning on all selected features.

The proposed method is closely related to the exact selective inference such as Lee et al. (2016) and Liu, Markovic and Tibshirani (2018). Here, in addition to the Gaussian assumption, we assume that the boundary surface of the hypothesis region is flat. Let us consider the line passing through the point $y$ and perpendicular to the boundary of $H$ . By setting the projection point $\hat{\mu}(y)\in\partial H$ of $y$ as the origin, we can consider the one-dimensional coordinate system $z$ on the line. If we know the distance from $y$ to $\partial H$ as well as intervals representing the intersection of the line and the selective region $S$ , we can perform the exact selective inference. For example, in Figure 6, the distance is $z_{H}$ and the interval is $[z_{S},\infty)$ . As with the polyhedral lemma, the following $p$ -value provides the exact selective inference:

[TABLE]

The explicit forms of the intervals can be obtained for the Lasso case, but it may not be possible for more complicated cases. In the proposed method, we estimate the geometric quantities $z_{H}$ and $z_{S}$ indirectly via the multiscale bootstrap. Alternating to this approach, we can use the grid search on the one-dimensional coordinate system $z$ to obtain the intervals. In association with this approach, Duy and Takeuchi (2021) proposes a parametric programming-based method that can perform the exact selective inference for the Lasso very efficiently without conditioning on signs.

Our method is applicable to the case of $p>n$ , provided that $(X^{T}X)^{+}X^{T}\mu$ is the target of inference, where $A^{+}$ denotes the pseudo-inverse matrix of $A$ . However, in this case, it is difficult to estimate the error variance tau or the residuals reasonably well. Thus, the selective inference in the case of $p>n$ is important future work. In theory, we may consider the application of the proposed framework for the submodel setting. However, the selection event $\{\hat{M}=M\}$ is often too small, i.e., the bootstrap probability becomes very small for the selection event, and thus the proposed framework does not work well. Combining with the randomized response method by Tian and Taylor (2018), we may propose an appropriate selective inference based on the multiscale bootstrap method for the submodel setting.

In this paper, the multiscale bootstrap is used only for the selective region, because the hypothesis region with a flat boundary surface (i.e., $\beta_{j}=0$ ) is easily expressed in the model space. However, Algorithm 1 is valid even for a general hypothesis region with a curved surface. Therefore, we may extend our method for, say, non-linear regression or multiple comparisons of regression coefficients in future work.

Acknowledgments

The authors would like to thank an associate editor and two reviewers for valuable comments and suggestions that improve the quality of the paper considerably. This research was supported in part by JSPS KAKENHI Grant (JP16K16024, JP20K19756, and JP20H00601 to YT; JP16H02789, JP20H04148, and JP20H04243 to HS) and MEXT Project for Seismology toward Research Innovation with Data of Earthquake (STAR-E) Grant Number JPJ010217 (to YT).

Appendix A Asymptotic theory for nearly flat surfaces

In Section 3, we only describe the multiscale bootstrap method for the hypothesis and selective regions with smooth boundaries. In the classical large sample theory, an important point is that the smooth boundary surface of the hypothesis region approaches a flat surface in a neighborhood of any point on its boundary surface. However, this claim cannot be true for regions with non-smooth boundaries since cone-shaped regions are scale-invariant in the neighborhood of the vertex. In many practical situations, the hypothesis and selective regions could have non-smooth surfaces. Thus, Shimodaira (2008) develops a new theoretical framework, called the asymptotic theory of nearly flat surfaces. In this theory, we consider the situation that the magnitude of boundary surfaces, say $\lambda$ , becomes small, that is, any boundary surfaces approach flat surfaces at least locally in a neighborhood. The artificial parameter $\lambda$ is introduced, and consider the situation of $\lambda\rightarrow 0$ instead of $n\rightarrow\infty$ . More precisely, suppose that the $L^{1}$ -norms of function $h$ and its Fourier transform $\tilde{h}$ , i.e., $\|h\|_{1}$ and $\|\tilde{h}\|_{1}$ , are bounded and that the $L^{\infty}$ -norm $\|h\|_{\infty}$ of $h$ has the same order as $\lambda$ . Here, the function satisfying these properties is called nearly flat. Then, we consider the asymptotic theory as $\lambda\rightarrow 0$ . Note that $\lambda$ in this theory is corresponding to $1/\sqrt{n}$ in the classical large sample theory. Here, we assume that the hypothesis and selective regions are defined as follows, respectively:

[TABLE]

where $h$ and $s$ are nearly flat functions, and $v_{s}\in\mathbb{R}$ .

Even in this theory, the bootstrap probability also has the first-order accuracy:

[TABLE]

Write $y=(u,v)\in\mathbb{R}^{m}\times\mathbb{R}$ . Then, the distribution of the bootstrap sample $Y^{\ast}=(U^{\ast},V^{\ast})$ with scale $\sigma^{2}$ is given as

[TABLE]

Let $\mathcal{E}_{\sigma^{2}}$ denote the expectation operator related to $U^{\ast}$ , that is,

[TABLE]

where $E_{\sigma^{2}}[\cdot|u]$ is the expectation related to $U^{\ast}$ and $\mathcal{F}^{-1}$ is the inverse Fourier transform operator. For the normalized bootstrap $z$ -value, we have the following scaling-law which is parallel to one of the large sample theory:

[TABLE]

We note that, for $\sigma_{1}^{2},\sigma_{2}^{2}>0$ , it follows that $\mathcal{E}_{\sigma_{1}^{2}}\mathcal{E}_{\sigma_{2}^{2}}h(u)=\mathcal{E}_{\sigma_{1}^{2}+\sigma_{2}^{2}}h(u)$ . Hence, at least formally, the expected value with a negative variance is defined as

[TABLE]

Note that $\mathcal{E}_{-\sigma^{2}}h(u)$ may not be well-defined in general. For a detailed discussion about $\mathcal{E}_{-\sigma^{2}}h(u)$ , we refer the reader to Shimodaira (2008). If $\mathcal{E}_{-1}h(u)$ can be defined, the $p$ -value $p_{\mathrm{AU}}(H|y)=\bar{\Phi}(\psi_{-1}(H|y))=\bar{\Phi}(v+\mathcal{E}_{-1}h(u))$ has the second-order accuracy for non-selective test (Shimodaira, 2008):

[TABLE]

As with the classical large sample theory, if $\mathcal{E}_{-1}h^{2}(u)$ also exits, it can be shown that $p_{\mathrm{AU}}(H|y)$ has the third-order accuracy with bias only $O(\lambda^{3})$ .

For the smooth $h$ , it follows that $\mathcal{E}_{\sigma^{2}}h(u)=\sum_{j=0}^{\infty}\sigma^{2j}\theta_{j}(u)$ . That is, letting $\theta_{H,0}=v+\theta_{0}(u)$ and $\theta_{H,j}=\theta_{j}(u)\;(j\geq 1)$ , $\varphi_{H}(\sigma^{2}|\theta_{H})$ can be modeled as $\theta_{H,0}+\theta_{H,1}\sigma^{2}+\theta_{H,2}(\sigma^{2})^{2}+\cdots$ . Thus, using a polynomial regression with predictor $\sigma^{2}$ , we can compute the $p$ -value $p_{\mathrm{AU}}(H|y)=\bar{\Phi}(\theta_{H,0}-\theta_{H,1}+\theta_{H,2}-\cdots)$ by formally letting $\sigma^{2}=-1$ . In contrast, for a cone-shaped region $H$ , it is shown that $\mathcal{E}_{\sigma^{2}}h(u)=\sum_{j=0}^{\infty}\sigma^{1-j}\theta_{j}(u)$ . Since we have $\beta_{j}(u)=O(\|u\|^{j})$ as $\|u\|\rightarrow 0$ , focusing on first two terms, we obtain

[TABLE]

In this model, we cannot take $\sigma^{2}=-1$ , and $\mathcal{E}_{-1}h(u)$ does not exist for a cone-shaped region $H$ . This observation is related to the important fact proved by Lehmann (1952) that an unbiased test cannot exist for a cone-shaped hypothesis region. Set $\theta_{H,0}=v+\theta_{1}(u)$ and $\theta_{H,1}=\theta_{0}(u)$ , and then the normalized bootstrap $z$ -value $\psi_{\sigma^{2}}(H|y)$ can be approximated by the model $\theta_{H,0}+\theta_{H,1}\sigma$ ; note the predictor is $\sigma=\sqrt{\sigma^{2}}$ instead of $\sigma^{2}$ . Here, we also denote by $\varphi_{H}(\sigma^{2}|\theta_{H})$ the model which approximates the normalized bootstrap $z$ -value $\psi_{\sigma^{2}}(H|y)$ . For fixed $\sigma_{0}^{2}>0$ , let $\varphi_{H,k}(\sigma^{2}|\theta_{H},\sigma_{0}^{2})$ be the truncated Taylor expansion of $\varphi_{H}(\sigma^{2}|\theta_{H})$ with the first $k$ terms at $\sigma_{0}^{2}$ :

[TABLE]

We can always use the above formula for extrapolating $\psi_{\sigma^{2}}(H|y)$ to $\sigma^{2}\leq 0$ . Therefore, Algorithm 1 is updated to Algorithm 3. In practice, Algorithm 3 with $k=2$ can be simply implemented as Algorithm 1 with the linear model $\theta_{H,0}+\theta_{H,1}\sigma^{2}$ and a narrow range of $\sigma^{2}=n/n^{\prime}$ values around $\sigma^{2}_{0}$ .

For fixed $k\in\mathbb{N}$ , we consider the following $p$ -value:

[TABLE]

Under some regularity conditions, Shimodaira (2008) proves that

[TABLE]

at each $\mu\in\partial H$ . Moreover, for general selective inference with possibly non-smooth boundary surfaces, Terada and Shimodaira (2017) proposes the following selective $p$ -value:

[TABLE]

where $\sigma_{-1}^{2},\sigma_{0}^{2}>0$ . In addition, it is shown that the selective $p$ -value has the second-order accuracy:

[TABLE]

at each $\mu\in\partial H$ .

Appendix B Proof of Theorem 2

Proof.

First, we show that, for given $\alpha\in(0,1)$ , there exists a nearly flat function $r$ such that

[TABLE]

where $R=\{(u,v)\mid v>v_{r}-r(u)\}$ and $v_{r}=\bar{\Phi}^{-1}(\alpha\bar{\Phi}(v_{s}))$ . By Lemma 5.1 in Terada and Shimodaira (2017) or equivalently eq. (5.3) in Shimodaira (2008), we have

[TABLE]

for $\mu=(\theta,0)\in\partial H$ . Let us temporarily assume that $r$ is nearly flat. Then, we also have

[TABLE]

From Eq. (8), it follows that

[TABLE]

Since $\bar{\Phi}(v_{r})=\alpha\bar{\Phi}(v_{s})$ , we have $\mathcal{E}_{1}r(\theta)=\alpha C\mathcal{E}_{1}s(\theta)+O(\lambda^{2})$ , where $C=\phi(v_{s})/\phi(v_{r})$ . Thus, applying the inverse operator $\mathcal{E}_{-1}$ to both sides, we obtain $r(u)=\alpha Cs(u)+O(\lambda^{2})$ . Since $s$ is nearly flat, $r$ should be nearly flat. Similarly, we can show that the above $r$ actually satisfies Eq. (8). Combining $\phi(v_{r})r(u)=\alpha\phi(v_{s})s(u)+O(\lambda^{2})$ with $\bar{\Phi}(v_{r})=\alpha\bar{\Phi}(v_{s})$ , we obtain $\bar{\Phi}(v_{r})+\phi(v_{r})r(u)=\alpha\left\{\bar{\Phi}(v_{s})+\phi(v_{s})s(u)\right\}+O(\lambda^{2})$ . By Taylor’s theorem, we deduce that

[TABLE]

Now, we consider the rejection region $R$ based on $p_{\mathrm{SI}}$ , that is, $R=\{(u,v)\mid p_{\mathrm{SI}}(H|S,y)<\alpha\}$ . By Lemma 5.1 in Terada and Shimodaira (2017) or equivalently eq. (5.3) in Shimodaira (2008), for $y=(u,v)$ , we have $\psi_{\sigma^{2}}(S|y)=-v+v_{s}-\mathcal{E}_{\sigma^{2}}s(u)+O(\lambda^{2})$ . Since $\mathcal{E}_{0}s(u)=s(u)$ , we have $z_{S}=\psi_{0}(S|y)=-v+v_{s}-s(u)+O(\lambda^{2})$ . Let us recall that $\psi_{\sigma^{2}}(H|y)=v(H|y)=v$ for $H=\{(u^{\prime},v^{\prime})\mid u^{\prime}\in\mathbb{R}^{m},v^{\prime}\leq 0\}$ , and so $z_{H}=v.$ Thus, we obtain

[TABLE]

By substituting $v=v_{r}-r(u)$ above, it follows from Eq. (9) that $p_{\mathrm{SI}}(H|S,y)=\alpha+O(\lambda^{2})$ for $y=(u,v_{r}-r(u))\in\partial R$ . This finishes the proof. ∎

Appendix C Choice of the tuning parameters in multiscale bootstrap

The multiscale bootstrap is not very sensitive to the choice of the scales. For confirming this fact, we additionally performed experiments with two settings of scales, as shown in Figure 7. We choose $\ell$ ( $=5$ and 10) scales from the interval between 0.1 and 2 with equal spaces in the log-scale. We also changed the number of replications $B$ ( $=500,1000,5000,10000$ ) in multiscale bootstrap. The other parameters are the same as the experiment in Section 5. For a simulated data, we computed selective $p$ -values under the various settings ( $B=500,1000,5000,10000;\ell=5,10$ ) of multiscale bootstrap. Under each setting, we computed the selective $p$ -value 10 times for two selected features No. 10 and No. 11 whose true coefficients are zero.

In Figure 7, the red box plots correspond to the setting $\ell=5$ and the blue ones to the setting $\ell=10$ . The averages of 10 $p$ -values are almost the same among all settings, and thus the $p$ -values are not sensitive to $\ell$ . On the other hand, the variance of the $p$ -values decreases as $B$ increases. Several thousand replications are enough in practice.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Berk et al. (2013) {barticle} [author] \bauthor \bsnm Berk, \bfnm R. \binits R., \bauthor \bsnm Brown, \bfnm L. \binits L., \bauthor \bsnm Buja, \bfnm A. \binits A., \bauthor \bsnm Zhang, \bfnm K. \binits K. and \bauthor \bsnm Zhao, \bfnm L. \binits L. ( \byear 2013). \btitle Valid post-selection inference. \bjournal Annals of Statistics \bvolume 41 \bpages 802–837. \endbibitem
2Cox (1975) {barticle} [author] \bauthor \bsnm Cox, \bfnm D. R. \binits D. R. ( \byear 1975). \btitle A note on data-splitting for the evaluation of significance levels. \bjournal Biometrika \bvolume 62 \bpages 441–444. \endbibitem
3Duy and Takeuchi (2021) {binproceedings} [author] \bauthor \bsnm Duy, \bfnm Vo Nguyen Le \binits V. N. L. and \bauthor \bsnm Takeuchi, \bfnm Ichiro \binits I. ( \byear 2021). \btitle Parametric Programming Approach for More Powerful and General Lasso Selective Inference. In \bbooktitle Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021) \bpages 901–909. \endbibitem
4Efron (1985) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B. ( \byear 1985). \btitle Bootstrap Confidence Intervals for a Class of Parametric Problems. \bjournal Biometrika \bvolume 72 \bpages 45–58. \endbibitem
5Efron, Halloran and Holmes (1996) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B., \bauthor \bsnm Halloran, \bfnm Elizabeth \binits E. and \bauthor \bsnm Holmes, \bfnm Susan \binits S. ( \byear 1996). \btitle Bootstrap confidence levels for phylogenetic trees. \bjournal Proceedings of the National Academy of Sciences \bvolume 93 \bpages 13429-13434. \endbibitem
6Efron and Tibshirani (1998) {barticle} [author] \bauthor \bsnm Efron, \bfnm B. \binits B. and \bauthor \bsnm Tibshirani, \bfnm R. \binits R. ( \byear 1998). \btitle The problem of regions. \bjournal Annals of Statistics \bvolume 26 \bpages 1687–1718. \endbibitem
7Fan and Li (2001) {barticle} [author] \bauthor \bsnm Fan, \bfnm J. \binits J. and \bauthor \bsnm Li, \bfnm R. \binits R. ( \byear 2001). \btitle Variable selection via nonconcave penalized likelihood and its oracle properties. \bjournal Journal of the American Statistical Association \bvolume 96 \bpages 1348–1360. \endbibitem
8Felsenstein (1985) {barticle} [author] \bauthor \bsnm Felsenstein, \bfnm Joseph \binits J. ( \byear 1985). \btitle Confidence limits on phylogenies: an approach using the bootstrap. \bjournal Evolution \bvolume 39 \bpages 783-791. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Selective inference after feature selection via multiscale bootstrap

Abstract

1 Introduction

2 Selective inference after feature selection

3 An overview of multiscale bootstrap

3.1 The problem of regions

3.2 Basic idea of multiscale bootstrap

3.3 General selective inference via multiscale bootstrap

Theorem 1**.**

4 Selective inference after feature selection via multiscale bootstrap

4.1 Model space for regression analysis

4.2 Appropriate selection event

4.3 Computing selective ppp-values by multiscale bootstrap

Theorem 2**.**

Remark 3**.**

5 Numerical experiments

6 Discussion

Acknowledgments

Appendix A Asymptotic theory for nearly flat surfaces

Appendix B Proof of Theorem 2

Proof.

Appendix C Choice of the tuning parameters in multiscale bootstrap

Theorem 1.

4.3 Computing selective $p$ -values by multiscale bootstrap

Theorem 2.

Remark 3.