Selective inference after feature selection via multiscale bootstrap
Yoshikazu Terada, Hidetoshi Shimodaira

TL;DR
This paper introduces a multiscale bootstrap method for selective inference that provides more accurate and less biased p-values after feature selection, applicable to complex algorithms beyond traditional methods like Lasso.
Contribution
It proposes a novel resampling approach using multiscale bootstrap to compute unbiased p-values for feature selection, overcoming limitations of existing methods.
Findings
Multiscale bootstrap yields more accurate p-values than classical bootstrap.
The method is effective for complex feature selection algorithms like non-convex regularization.
Numerical experiments confirm the method's robustness and applicability.
Abstract
It is common to show the confidence intervals or -values of selected features, or predictor variables in regression, but they often involve selection bias. The selective inference approach solves this bias by conditioning on the selection event. Most existing studies of selective inference consider a specific algorithm, such as Lasso, for feature selection, and thus they have difficulties in handling more complicated algorithms. Moreover, existing studies often consider unnecessarily restrictive events, leading to over-conditioning and lower statistical power. Our novel and widely-applicable resampling method via multiscale bootstrap addresses these issues to compute an approximately unbiased selective -value for the selected features. As a simplification of the proposed method, we also develop a simpler method via the classical bootstrap. We prove that the -value computed byā¦
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference
Selective inference after feature selection via multiscale bootstrap
Yoshikazu Teradalabel=e1][email protected] [
Graduate School of Engineering Science, Osaka University
1-3 Machikaneyama-cho, Toyonaka, Osaka 560-8531, Japan
āā
Hidetoshi Shimodairalabel=e2][email protected] [
Graduate School of Informatics, Kyoto University
Yoshida Honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
Abstract
It is common to show the confidence intervals or -values of selected features, or predictor variables in regression, but they often involve selection bias. The selective inference approach solves this bias by conditioning on the selection event. Most existing studies of selective inference consider a specific algorithm, such as Lasso, for feature selection, and thus they have difficulties in handling more complicated algorithms. Moreover, existing studies often consider unnecessarily restrictive events, leading to over-conditioning and lower statistical power. Our novel and widely-applicable resampling method via multiscale bootstrap addresses these issues to compute an approximately unbiased selective -value for the selected features. As a simplification of the proposed method, we also develop a simpler method via the classical bootstrap. We prove that the -value computed by our multiscale bootstrap method is more accurate than the classical bootstrap method. Furthermore, numerical experiments demonstrate that our algorithm works well even for more complicated feature selection methods such as non-convex regularization.
\startlocaldefs\endlocaldefs
t1 Jointly affiliated at RIKEN Center for Advanced Intelligence Project (AIP), 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
and
1 Introduction
In the classical statistical inference, the specification of a hypothesis is assumed to be independent of obtained data. In recent years, since big and complicated data have been common in various fields, it is difficult to set hypotheses in advance. Thus, in modern data analysis, we commonly find useful hypotheses from obtained data using exploratory data analysis, and then we perform the classical inference for the selected hypotheses. However, we often ignore the effects of the hypothesis selection in the classical inference, and thus this naive approach will not provide a valid statistical inference. Recently, the selective (or post-selection) inference, which deals with the hypothesis selection effect appropriately, has drawn considerable attention not only in the statistical community but also in the machine learning community (e.g., Yang etĀ al., 2016; Suzumura etĀ al., 2017; Slim etĀ al., 2019; Lim etĀ al., 2020).
In this paper, we focus on the selective inference after the feature selection, i.e., predictor variable selection, in regression analysis. The most intuitive and straightforward approach of selective inference is proposed by Cox (1975) and called data splitting. In data splitting, an i.i.d.Ā sample is divided into two subsamples: one is used for the feature selection, and the other is used for the inference of the selected features. However, this approach reduces available data for both feature selection and inference. Fithian, Sun and Taylor (2014) provides the theoretical foundation to consider the optimality of the selective inference in the sense of statistical power. In Berk etĀ al. (2013), without assuming a specific feature selection method, a valid selective inference after feature selection for the submodel parameters is developed on the regression problem. Importantly, Berk etĀ al. (2013) also introduces both submodel view and full-model view of the targets of selective inference after feature selection. Under the setting of Berk etĀ al. (2013), Lee etĀ al. (2016) characterizes the selection event in which a specific model is selected by Lasso (Tibshirani, 1996). More precisely, this selection event is represented as a union of polyhedra in the space of the response variable. In addition, based on this fact, Lee etĀ al. (2016) proposes the exact selective inference for the feature selection via Lasso. The significance levels conditioned on the selection event are computed by truncated normal distributions, justified by the polyhedral lemma. Tibshirani etĀ al. (2016) develops a general framework to perform selective inference after any selection event that is represented as a response vector falling into a polyhedral set. Tibshirani etĀ al. (2018) proves that this selective inference is asymptotically valid even for non-normal error distributions.
On the other hand, the exact selective inference approaches such as Lee etĀ al. (2016) and Tibshirani etĀ al. (2016) assume that the selection event is explicitly represented as a union of polyhedra in the space of the response variable. Although the idea of the polyhedral lemma is, in fact, valid for any selective sets beyond a union of polyhedra (Liu, Markovic and Tibshirani, 2018), the existing approaches have computational difficulties in handling more complicated algorithms with non-convex penalties such as MCP (minimax concave penalty; Zhang, 2010) and SCAD (smoothly clipped absolute deviation; Fan and Li, 2001), where the selective sets become more complicated than the ordinary Lasso. Although the selective inference of Berk etĀ al. (2013) is not limited to specific feature selection methods, the computation cost may be prohibitive for the number of variables over . In addition, it controls type-I errors simultaneously under all submodels, thus leading to very conservative confidence intervals. Moreover, most existing selective inference with the full-model view is unnecessarily over-conditioning and lower statistical power because the inference is conditioned on a selected model, whereas it could be minimally conditioned on a selected feature. The selective set of the minimally conditioning event becomes more complicated and computationally difficult, and thus its valid post-selection inference is implemented recently by Liu, Markovic and Tibshirani (2018) first time but only for the ordinary Lasso case.
Recently, Terada and Shimodaira (2017) extends the general hypothesis testing framework, called the problem of regions (Efron and Tibshirani, 1998), to the selective inference, and proposes a new selective inference approach via multiscale bootstrap of Shimodaira (2002, 2004, 2008). This approach is not based on the polyhedral lemma, and we can easily compute approximately unbiased selective -values for hypotheses conditioned on complicated selective sets. Moreover, Terada and Shimodaira (2017) provides the theoretical justification for this approach in two asymptotic theories. In this framework, we consider the general setting in which the hypothesis and the selection event are represented as regions in some parameter space. This approach can be widely applied because we do not need to know the shapes of these regions, but only need to prepare functions that tell whether these regions include a realization of the parameter estimate. In fact, Shimodaira and Terada (2019) describes an application of this approach for testing trees and edges in phylogenetics. Moreover, based on our idea described in this paper and Terada and Shimodaira (2017), Lim etĀ al. (2020) develops the powerful selective inference after feature selection using the Hilbert Schmidt Independence Criterion and the Maximum Mean Discrepancy.
In the original form of multiscale bootstrap method, we change the sample size of bootstrap samples and then compute a bias-corrected -value using geometric quantities (curvature and signed distance of the region) estimated from the scaling-law of the bootstrap probability of the hypothesis and selection event. However, this multiscale bootstrap method cannot be directly applied to selective inference after feature selection since the shape of the selective region is unwillingly related to the sample size in the feature selection problem. To overcome this difficulty, we propose the use of the resampling of the residuals with scale change. The advantage of our method is that it can be applied to almost any feature selection algorithm. In addition, the computational complexity of our method is the same order as the classical bootstrap method.
This paper is organized as follows. In SectionĀ 2, we describe the setting of selective inference after feature selection. In SectionĀ 3, we give a brief exposition of multiscale bootstrap and the general selective inference via multiscale bootstrap. In SectionĀ 4, we develop a new selective inference algorithm via multiscale bootstrap in regression analysis. In SectionĀ 5, the usefulness of our approach is demonstrated through numerical experiments.
2 Selective inference after feature selection
We briefly describe the selective inference after feature selection in linear regression; we will give the setting in SectionĀ 4 with details. Let be the response variable with mean and variance . Let be non-random features (i.e., predictor variables) and . At first, we need to clarify the target of statistical inference. If we assume the following first-order correctness:
[TABLE]
the target of the estimators is clearly the ātrueā coefficients . However, in real data analysis, it is difficult to assume this correctness, as mentioned in Boxās famous quote āAll models are wrongā. Even under the first-order correctness, the selected models may be wrong. Without the first-order correctness, the target of inference is ambiguous. Berk etĀ al. (2013) clarifies the target of statistical inference in linear regression, and there are two possible types of the target:
- ā¢
Let be a specified set of features. We consider a submodel using the features in . The target for the submodel view with respect to is
[TABLE]
where is the predictor matrix consisting of the features of . The true value of in submodel view depends on .
- ā¢
The target for the full-model view is with . Thus
[TABLE]
The true value of in the full-model view does not depend on . In this paper, the target for our method is the coefficients in the full-model view, while both views are discussed below.
Now, we describe the basic concepts of selective inference in regression. Let be a significant level in selective inference. When the null hypothesis is selected based on data, we should control the selective type I error rate:
[TABLE]
where is a probability distribution under . The event is called the selection event. After feature selection based on data, we obtain the selected model . Then, depending on whether the target is in the submodel view or the full-model view, the null hypotheses for is or , respectively. Here is a simplified notation of with . Moreover, we may consider two different types of events or as the selection event.
However, the event is not an appropriate selection event for the hypothesis of submodel view , because this hypothesis depends on the other selected features and thus the probability (1) does not make sense. Therefore, for the hypothesis of submodel view, the event or more restrictive events are appropriate as a selection event; the event is sometimes considered as the selection event (e.g., Lee etĀ al., 2016) for computational reason.
On the other hand, we can consider the two different types of conditioning and as a selection event for the hypothesis of full-model view . Since both of these two events are appropriate, we may wonder which of them is more desirable. This is answered by the argument of the monotonicity of selective error in PropositionĀ 3 of Fithian, Sun and Taylor (2014). Here we see its adaptation to our setting. Since , we have
[TABLE]
If we control the selective type-I error at level for all models , the selective type-I error is also automatically controlled at level . Thus, this monotonicity tells us that the over-conditioning leads to a loss of information and that, for the hypothesis , the minimal selection event (i.e., the minimally conditioning and thus the maximal event set) is the most desirable in the sense of statistical power.
3 An overview of multiscale bootstrap
First, we describe the framework of the problem of regions in SectionĀ 3.1, followed by the basic idea of multiscale bootstrap for non-selective inference (Shimodaira, 2002, 2004, 2008) in SectionĀ 3.2. Then we briefly introduce the general selective inference framework proposed by Terada and Shimodaira (2017) in SectionĀ 3.3.
3.1 The problem of regions
The general statistical inference framework, in which the hypothesis is represented by a general region in some parameter space, is called the problem of regions (Efron and Tibshirani, 1998). This framework is an abstraction of many applications, e.g., phylogenetic inference, in which a confidence level is assigned for each clade of the estimated phylogenetic tree (Felsenstein, 1985; Efron, Halloran and Holmes, 1996).
Let be a data with sample size . In the problem of regions, it is assumed that there exists a transform of such that the transformed data follows the -dimensional Gaussian distribution with unknown mean parameter and covariance identity :
[TABLE]
Typically, involves multiplying the factor to a form of sample average so that the covariance matrix of is properly rescaled. Here, the -dimensional space of will be referred to as the model space in this paper. FigureĀ 1 shows the image of the model space. In addition, let be an observed value of , and suppose that a bootstrap sample with sample size is represented as a realization of the following Gaussian distribution in the model space:
[TABLE]
We will denote by the probability measure of the bootstrap sample with scale . This framework is a simplification of reality and is justified by the central limit theorem in many situations.
Let be a general region and let us consider as a null hypothesis. It is assumed that the region , in a neighbourhood of the model space, is locally represented as using some continuous function . Let be the boundary surface of the region . In this setting, our main goal is to compute an approximately unbiased -value for the null hypothesis against the alternative hypothesis . The approximately unbiased -value should satisfy
[TABLE]
for a given significance level . In other words, the -value is approximately distributed as uniform over (0,1), i.e., when . The difference between and in (2) is called bias (or error). The bootstrap probability
[TABLE]
is considered as the most simple -value satisfying ; see Efron, Halloran and Holmes (1996); Efron and Tibshirani (1998). More formally, in the classical large sample theory, if the region has a smooth boundary surface, the bootstrap probability has the first-order accuracy: However, in many practical situations, the bootstrap probability often has a non-negligible bias.
3.2 Basic idea of multiscale bootstrap
To obtain more accurate -values, geometric quantities in the model space, such as distance and curvature, play a key role. In fact, Efron and Tibshirani (1998) shows that we can compute a more accurate -value using the signed distance from the data point to the region . More precisely, the -value is proposed, where is the projection point of onto . This -value has the third-order accuracy (Efron, 1985; Efron and Tibshirani, 1998).
However, in most practical situations, it is difficult to access the model space and to obtain the explicit formula of the hypothesis region in the model space. Thus, we cannot compute the signed distance in general. To overcome this difficulty, Shimodaira (2002, 2004, 2008) propose a new bootstrap method, called multiscale bootstrap. In multiscale bootstrap, the geometric quantities such as the signed distance and the mean curvature of are estimated based on the scaling law of the bootstrap probabilities, and an accurate -value is computed based on these estimated quantities.
We consider the bootstrap probability with scale
[TABLE]
which reduces to when . When we change in the data space, and in effect, change in the model space, the bootstrap probability changes. This change is simply expressed in terms of the normalized bootstrap -value defined as
[TABLE]
where is the cumulative distribution function of the standard normal distribution and , i.e., is the upper -value. Shimodaira (2002, 2004) show the following scaling law of the bootstrap probabilities:
[TABLE]
where is the mean curvature of the boundary surface at . This scaling law can be modelled as the simple linear regression with as the predictor. We will denote by the model for the normalized bootstrap -value, such as with parameter . The bootstrap probabilities with several values of scale can be computed by using the bootstrap samples with different sample sizes, say . Let be the number of bootstrap replicates, and be the frequency to be . Let be the estimated normalized bootstrap -value by using the estimated bootstrap probability . We can estimate the values of and by the simple regression for the observed .
Shimodaira (2002) proposes the following -value:
[TABLE]
This -value has the second-order accuracy (Shimodaira, 2004; Efron and Tibshirani, 1998):
[TABLE]
It becomes third-order accurate erring only when is properly estimated from observed values of including terms of order .
In the classical large sample theory, the shape of in the model space is magnified by , and thus the key property is that the smooth boundary surface approaches a flat surface in a neighborhood of any point on . In contrast, for non-smooth surfaces, this key property is not satisfied. For example, if the region is cone-shaped, the shape of is scale-invariant in a neighborhood of the vertex of . It is well known that there is no unbiased test for a hypothesis region with a non-smooth boundary (Lehmann, 1952). To deal with general regions with non-smooth boundary surfaces, Shimodaira (2008) develops a new theoretical framework, called the asymptotic theory of nearly flat surfaces. In this framework, we require that the magnitude of boundary surfaces is small. Thus, this framework works well even for non-smooth boundary surfaces when the magnitude of boundary surfaces is not so large, at least locally. We can interpret that the given surface is on the way to approaching the flat surface in this framework. This idea is similar to one behind the local alternative framework (Lehmann, 1999), but the rescaling is applied only to the direction normal to the boundary surface while the scale is fixed for the other directions. A brief introduction of this theory is provided in AppendixĀ A.
3.3 General selective inference via multiscale bootstrap
Here, we describe an extended framework of the problem of regions for the selective inference. In the model space, two regions and are considered. Suppose that the selection event is represented as , and we consider the selective inference in which the null hypothesis is selected if and only if . In this setting, for a given significance level , we want to compute selective -values satisfying
[TABLE]
In other words, conditioned on when . Terada and Shimodaira (2017) proposes the following approximately unbiased selective -value for regions and with smooth boundary surfaces:
[TABLE]
where is the model for the normalized bootstrap -value related to the selective region , and is the parameter of the model .
Theorem 1**.**
(Theorem 4.3 in Terada and Shimodaira (2017)) The boundary surfaces and are assumed to be sufficiently smooth and nearly parallel in the sense that the first derivatives of and differ only . Then, the selective -value has the second-order accuracy:
[TABLE]
The detailed calculation of is provided as AlgorithmĀ 1.
In Terada and Shimodaira (2017), the selective -value for the regions with non-smooth boundary surfaces is also proposed, and the theoretical justification of this -value is provided using the asymptotic theory of nearly flat surfaces. For more details about the case in which the regions and have possibly non-smooth boundary surfaces, see AppendixĀ A. Since the boundary surface of the selection region is generally not smooth in the selective inference after feature selection, the theory of nearly flat surfaces is used to derive the properties of the proposed method described in SectionĀ 4.3. Roughly speaking, both theories essentially assume that the boundary surfaces and are more or less flat and parallel to each other, at least locally around the data point. In analogy with the local alternative framework, we consider that given surfaces are on the way to approaching mutually parallel flat surfaces.
4 Selective inference after feature selection via multiscale bootstrap
4.1 Model space for regression analysis
Here, we describe the setting of the selective inference after feature selection in regression analysis. We employ the general assumption used in Berk etĀ al. (2013), Lee etĀ al. (2016), and Tibshirani etĀ al. (2016). Consider the response variable drawn from the multivariate Gaussian distribution:
[TABLE]
where is an unknown parameter, is the -dimensional identity matrix, and is assumed to be known. We will denote by the observed value of . Let be a non-random full rank matrix whose columns represent the features. Note that the error variance can be estimated if is modeled as a function of features . Assuming a specific feature selection method, such as Lasso and MCP, is applied to , let be the set of selected features, and be the sign of the estimated coefficient of the feature .
First, note that the general selective inference approach described as AlgorithmĀ 1 cannot be directly applied to regression analysis. Since we need to change the sample size of bootstrap samples in the usual multiscale bootstrap, it is assumed that the hypothesis and selective regions can be represented as specific regions, which are independent of , in the model space. For the selective inference in regression analysis, however, the shape of the selective region inevitably depends on because it is the dimensionality of the model space as explained below.
We recall that it is assumed that and that the selection event can be represented as the region of the space of . Then, it is realized that the normalized space of
[TABLE]
can be considered as the model space described in SectionĀ 3 with . Thus, the selective region for multiscale bootstrap inevitably depends on . Another choice of model space is given by
[TABLE]
where . The selective region represented in this model space also depends on because feature selection algorithms take account of sample size. Although the latter model space is preferable for the asymptotic theory because it has the fixed dimensionality , we use the former model (3) below for easy illustration.
4.2 Appropriate selection event
Recently, Liu, Markovic and Tibshirani (2018) suggests the use of the selection event for a specified , which increases the statistical power and thus leading to shorter confidence intervals. This is explained by the monotonicity of the selective error provided in Fithian, Sun and Taylor (2014), as mentioned in SectionĀ 2. The event for a specified is over-conditioning and reducing the statistical power because the other features are not relevant for the null hypothesis of the feature in the full-model view.
Here, we actually consider testing feature with its sign. More precisely, whenever the feature is selected and (or ), the hypothesis (or ) is tested. The minimal selection event is then where . Hence, the main goal of our selective inference is to compute the unbiased selective -value , which satisfies
[TABLE]
for any with .
4.3 Computing selective -values by multiscale bootstrap
In this section, we describe our proposed method. We develop a new algorithm to compute the approximately unbiased selective -value, which approximately satisfies the equation . We will update the computation of and in AlgorithmĀ 1 to obtain AlgorithmĀ 2.
For the feature selection via Lasso, the selection event can be represented as a union of polyhedra in the -dimensional space of the response variable (Lee etĀ al., 2016). The left panel of FigureĀ 2 shows the relationship between the selected model by Lasso and the corresponding region in the response vector space when . In contrast, for more complicated feature selection methods such as MCP and SCAD, the region of the selective event will become complicated, and the explicit shape of the selective region may not be obtained easily. The right panel of FigureĀ 2 shows the relationship between the selected model by MCP and the corresponding region in the response vector space when . We had to numerically evaluate which features are selected at each point since no explicit representation of the selection event is available. Although Lee etĀ al. (2016) and Tibshirani etĀ al. (2016) consider exact selective inference for Lasso, it is difficult to consider exact selective inference for these complicated feature selection methods.
As shown in FigureĀ 2, the selective region which represents the selection event could be complicated and has generally non-smooth boundary surfaces. In contrast, for , the hypotheses , namely the two cases of and , can be represented as the following regions in the space of :
[TABLE]
Since the hypothesis region has a flat boundary surface with mean curvature , we can easily obtain the expression of without multiscale bootstrap. In particular, for ,
[TABLE]
where is the signed distance from to the hypothesis region . Thus, once we obtain the expression of , we can compute the selective -value .
Next, we consider how to obtain the expression of for the selective region by multiscale bootstrap; the bootstrap probability of is computed at several scales via an adaptation of Step 2 of AlgorithmĀ 1. In the usual setting of multiscale bootstrap, changing in the data space corresponds to changing in the model space. Thus, based on bootstrap probabilities with several , we can estimate the expression of . However, as described in SectionĀ 4.1, this framework is not applicable here. Fortunately, since we can access the model space, we directly change the scale in the model space and compute the bootstrap probabilities with several scales. With the normality of response , the parametric bootstrap method, i.e., sampling directly from , could be applied to the computation of bootstrap probabilities at several scales . To relax the Gaussian assumption, here we consider the resampling of residuals with scale change. More formally, we resample the scaled residuals to compute at several as follows. Let be the least-squares estimator based on the full-model. Write and . Then, the adjusted residuals are defined as . To compute the bootstrap probability at , we use the following bootstrap sample:
[TABLE]
where is a bootstrap sample with size from . For each , we generate for times, and apply a particular model selection procedure to them for computing by counting the frequency of the selective event .
With the updated computation of and for regression analysis, we may use Step 3 and Step 4 of AlgorithmĀ 1 to compute the selective -value. This becomes AlgorithmĀ 2 for computing an approximately unbiased selective -value for the selected feature . It is worth noting that AlgorithmĀ 2 can be applied to almost any feature selection methods, including MCP and SCAD, in addition to Lasso. Note also that the multiscale bootstrap is not very sensitive to the choice of the scales. For , several thousand replications are enough in practice. The computational cost is the same order as the classical bootstrap method. Also note that, in actual implementation of AlgorithmĀ 2, the Steps 1 to 3 are shared by all the features . Thus, this algorithm works even for large such as .
Now, we provide the theoretical justification of the proposed algorithm. Since the boundary surface of the selective region is generally non-smooth as shown in FigureĀ 2, we consider the asymptotic theory of nearly flat surfaces. In the model space, we take the coordinate system such that the hypothesis region can be written by . Using this coordinate system, let us denote the selective region by at least locally in a neighborhood of , where and is a function from to which represents the boundary surface of the selective region. Here, the -norm and -norm of function are defined as and , respectively. Let be the magnitude of the boundary surface of the selective region. Even in regression analysis, we assume that the selection event can be written as and the magnitude of the boundary surface is relatively small at least around the data point in the model space. That is, we will consider the asymptotic theory in which . In the same way as the local alternative framework, this assumption can be interpreted as that the given surface is on the way to approach the flat surface which is parallel to . In this paper, this assumption is called nearly flatness of the boundary surface. In the asymptotic theory of nearly flat surfaces, the proposed -value has the second-order accuracy.
Theorem 2**.**
Let us denote the selective region as . Let be the Fourier transform of the function . Suppose that the -norms and are bounded. Let us assume that is sufficiently small. Then, the selective -value described in AlgorithmĀ 2 has the second-order accuracy:
[TABLE]
TheoremĀ 2 can be considered as a special case of TheoremĀ 5.3 in Terada and Shimodaira (2017). In the current situation, we can directly obtain the signed distance from to the boundary surface . Thus, the proof of TheoremĀ 2 is much simpler than TheoremĀ 5.3 in Terada and Shimodaira (2017). The proof is given in AppendixĀ B. This theorem provides a theoretical justification for the proposed -value when the magnitude of the boundary surface is small, at least locally around the data point. The assumption requires that the boundary surface of the selection event is nearly flat and its limiting hyperplane is parallel to the surface of the hypothesis region, at least locally around the data point. Through the numerical simulations in the next section, it seems that this assumption is reasonably satisfied in practice. Of course, when the two surfaces and are not very parallel to each other, the proposed -value has a non-negligible bias. Solving this problem is an important future work of this research.
Remark 3**.**
In contrast with the -value in AlgorithmĀ 2, we also propose a simple selective -value based on the classical bootstrap probability . We replace in StepĀ 5 by to define
[TABLE]
It is also computed with . If the boundary surface of is flat, and thus . Under the assumption of TheoremĀ 2, we can obtain
[TABLE]
indicating that has a larger bias than . This result can be proved in much the same way as TheoremĀ 2.
5 Numerical experiments
Here, we show some numerical experiments to demonstrate the usefulness of our method. For Lasso, the exact unbiased selective test conditioned on and can be constructed (Lee etĀ al., 2016; Liu, Markovic and Tibshirani, 2018). At first, we will show that our selective -value with multiscale bootstrap (i.e., ) approximates the exact selective -value for Lasso. Here, Lasso is defined as . Set and ; for the features . The elements of the input matrix were independently generated from the standard normal distribution . Then, the response was generated as , where was generated from the -dimensional standard normal distribution . In our algorithm, we used as the scales and as the number of bootstrap replicates. Here, we note that the choice of scales has only little effect on the stability of the result. The experiment about this point can be found in AppendixĀ C. We simulated independent datasets. We set the significance level . For computing false positive rates accurately, the variables with zero coefficients need to be selected several hundred times. Here, we used Lasso with the penalty parameter as the feature selection method. Each variable with zero coefficient was selected approximately times out of datasets in this experiment. In each dataset, we performed the classical -test, the selective (one-sided) test conditioned on and for (Lee etĀ al., 2016; only for Lasso), the selective (two-sided) test conditioned on for (Liu, Markovic and Tibshirani, 2018; only for Lasso) and our approximately unbiased (one-sided) test with multiscale bootstrap for each selected feature. We count how many times, say , the feature is selected. For each test, we also count how many times (say ) the null hypothesis is rejected, and the selective rejection probability is estimated by .
The panel (a) of FigureĀ 3 shows the selective rejection probabilities of each feature for Lasso, where the four test methods are compared. In this plot, we can see that the selective rejection probabilities of our test with for the features to with are around . Thus, it is shown that our multiscale bootstrap method approximately satisfies the unbiasedness in the sense of the equationĀ . We can also see that the classical inference does not provide a valid inference after feature selection; the classical -test gives more false positives than expected from the specified level. Moreover, instead of the Lasso penalty , we also used the following MCP and SCAD penalties with the tuning parameter as the feature selection methods:
[TABLE]
For non-convex penalties, we have local minimum and multiple global minimum issues. In general, we assume that the selection event can be represented as the fixed set in the data space. Thus, both issues have unexpected effects in selective inference. To overcome these issues, we used the fixed initial values for MCP and SCAD 111Another approach is extending the data space to include the space of initial values.. In this experiment, we used the R package plus (Zhang and Melnik, 2012) for MCP and SCAD. The algorithm of this package generates a piecewise linear path of coefficients, starting with zero coefficients for infinity penalty.
Our multiscale bootstrap method works well not only for Lasso but also for more complicated feature selection methods such as MCP and SCAD. The panels (b) and (c) of FigureĀ 3 show the selective rejection probabilities in the cases of MCP and SCAD, respectively. Here, we note that the existing selective inference methods (Green, Blue) cannot be applied to MCP and SCAD. Each feature with zero coefficient was selected approximately times by MCP and SCAD in this experiment. In this setting, whereas no exact unbiased selective inference is proposed, the selective rejection probabilities of our test for the features to are around . In general, we can get a more accurate result as the sample size increases. In fact, the panel (d) of FigureĀ 3 is the result of the same experiment about MCP with larger sample size . Compared with the case of of the panel (b), we can see the more accurate result (i.e., less variations) in the case of .
Moreover, we also computed the simpler selective -value based on the classical bootstrap in both settings with MCP and SCAD. For MCP with , the selective rejection probabilities of and under the null hypotheses are and , respectively. For SCAD with , the selective rejection probabilities of and under the null hypotheses are and , respectively. These results show that the multiscale bootstrap method is more accurate than the classical bootstrap method in accordance with theory. In the setting with , the selective rejection probabilities of and under the null hypotheses are and , respectively. As the sample size increases, both and provide almost unbiased results. It could be because the boundary surface of the selection event is almost flat, at least locally around the data point, in a larger sample case.
In addition, set and . We compare the true positive rates (TPRs; i.e., statistical powers) and the false-positive rates (FPRs; i.e., type-I errors) of these tests with changes of in the case of Lasso. Here, TPR is defined by the proportion of selected non-zero features that are correctly identified, and FPR is defined by the proportion of selected zero-features that are incorrectly detected. FigureĀ 4 shows that both the proposed method via multiscale bootstrap and the exact selective test conditioned on (Liu, Markovic and Tibshirani, 2018) not only have desirable high TPRs but also control FPRs at the significance level . For the non-selective -test, the FPR is not controlled, whereas the highest TPR is attained. The -test (the black line) does not control the false positive rate. Thus it is not valid. Focusing on the TPR of the over-conditioning selective test (Lee etĀ al., 2016), we can see that the unnecessarily restrictive selection event leads to the lower statistical power.
Next, we deal with the prostate cancer data (Stamey etĀ al., 1989), which is available in the R package ElemStatLearn (Halvorsen, 2015). Stamey etĀ al. (1989) studied the relation between the level of prostate-specific antigen (PSA) and clinical measures: the log cancer volume (lcavol), the log prostate weight (lweight), and so on. Here, we consider a linear regression model to the log of PSA (lpsa) with clinical measures. In this application, we prepossessed the data so that each feature has a mean zero and unit variance.
The main purpose is to provide the selective confidence intervals (CIs) for the coefficients of the selected features by Lasso with the penalty . Here, we also set . We computed four types of confidence intervals with confidence level as shown in FigureĀ 5: the non-selective CI using -distribution, the selective CI conditioned on (Lee etĀ al., 2016), the selective CI conditioned on (Liu, Markovic and Tibshirani, 2018), and the approximate selective CI based on our approximately unbiased -values based on the multiscale bootstrap. We note that the first three CIs satisfy the following equations, respectively, for , , , :
[TABLE]
From the plot, we can see that our selective CIs approximates the exact selective CIs very well. Moreover, the over-conditioning of the selection event made CIs wider than , and this indicates that the less restrictive selection event is preferable.
6 Discussion
A new multiscale bootstrap method (AlgorithmĀ 2) is proposed to compute approximately unbiased selective -values and confidence intervals for regression coefficients after feature selection. The new method is useful in particular for complicated feature selection algorithms such as MCP and SCAD, while existing methods are only available for simpler feature selection algorithms such as Lasso. The new method also computes shorter confidence intervals than most existing methods by minimally-conditioning on each selected feature instead of over-conditioning on all selected features.
The proposed method is closely related to the exact selective inference such as Lee etĀ al. (2016) and Liu, Markovic and Tibshirani (2018). Here, in addition to the Gaussian assumption, we assume that the boundary surface of the hypothesis region is flat. Let us consider the line passing through the point and perpendicular to the boundary of . By setting the projection point of as the origin, we can consider the one-dimensional coordinate system on the line. If we know the distance from to as well as intervals representing the intersection of the line and the selective region , we can perform the exact selective inference. For example, in FigureĀ 6, the distance is and the interval is . As with the polyhedral lemma, the following -value provides the exact selective inference:
[TABLE]
The explicit forms of the intervals can be obtained for the Lasso case, but it may not be possible for more complicated cases. In the proposed method, we estimate the geometric quantities and indirectly via the multiscale bootstrap. Alternating to this approach, we can use the grid search on the one-dimensional coordinate system to obtain the intervals. In association with this approach, Duy and Takeuchi (2021) proposes a parametric programming-based method that can perform the exact selective inference for the Lasso very efficiently without conditioning on signs.
Our method is applicable to the case of , provided that is the target of inference, where denotes the pseudo-inverse matrix of . However, in this case, it is difficult to estimate the error variance tau or the residuals reasonably well. Thus, the selective inference in the case of is important future work. In theory, we may consider the application of the proposed framework for the submodel setting. However, the selection event is often too small, i.e., the bootstrap probability becomes very small for the selection event, and thus the proposed framework does not work well. Combining with the randomized response method by Tian and Taylor (2018), we may propose an appropriate selective inference based on the multiscale bootstrap method for the submodel setting.
In this paper, the multiscale bootstrap is used only for the selective region, because the hypothesis region with a flat boundary surface (i.e., ) is easily expressed in the model space. However, Algorithm 1 is valid even for a general hypothesis region with a curved surface. Therefore, we may extend our method for, say, non-linear regression or multiple comparisons of regression coefficients in future work.
Acknowledgments
The authors would like to thank an associate editor and two reviewers for valuable comments and suggestions that improve the quality of the paper considerably. This research was supported in part by JSPS KAKENHI Grant (JP16K16024, JP20K19756, and JP20H00601 to YT; JP16H02789, JP20H04148, and JP20H04243 to HS) and MEXT Project for Seismology toward Research Innovation with Data of Earthquake (STAR-E) Grant Number JPJ010217 (to YT).
Appendix A Asymptotic theory for nearly flat surfaces
In SectionĀ 3, we only describe the multiscale bootstrap method for the hypothesis and selective regions with smooth boundaries. In the classical large sample theory, an important point is that the smooth boundary surface of the hypothesis region approaches a flat surface in a neighborhood of any point on its boundary surface. However, this claim cannot be true for regions with non-smooth boundaries since cone-shaped regions are scale-invariant in the neighborhood of the vertex. In many practical situations, the hypothesis and selective regions could have non-smooth surfaces. Thus, Shimodaira (2008) develops a new theoretical framework, called the asymptotic theory of nearly flat surfaces. In this theory, we consider the situation that the magnitude of boundary surfaces, say , becomes small, that is, any boundary surfaces approach flat surfaces at least locally in a neighborhood. The artificial parameter is introduced, and consider the situation of instead of . More precisely, suppose that the -norms of function and its Fourier transform , i.e., and , are bounded and that the -norm of has the same order as . Here, the function satisfying these properties is called nearly flat. Then, we consider the asymptotic theory as . Note that in this theory is corresponding to in the classical large sample theory. Here, we assume that the hypothesis and selective regions are defined as follows, respectively:
[TABLE]
where and are nearly flat functions, and .
Even in this theory, the bootstrap probability also has the first-order accuracy:
[TABLE]
Write . Then, the distribution of the bootstrap sample with scale is given as
[TABLE]
Let denote the expectation operator related to , that is,
[TABLE]
where is the expectation related to and is the inverse Fourier transform operator. For the normalized bootstrap -value, we have the following scaling-law which is parallel to one of the large sample theory:
[TABLE]
We note that, for , it follows that . Hence, at least formally, the expected value with a negative variance is defined as
[TABLE]
Note that may not be well-defined in general. For a detailed discussion about , we refer the reader to Shimodaira (2008). If can be defined, the -value has the second-order accuracy for non-selective test (Shimodaira, 2008):
[TABLE]
As with the classical large sample theory, if also exits, it can be shown that has the third-order accuracy with bias only .
For the smooth , it follows that . That is, letting and , can be modeled as . Thus, using a polynomial regression with predictor , we can compute the -value by formally letting . In contrast, for a cone-shaped region , it is shown that . Since we have as , focusing on first two terms, we obtain
[TABLE]
In this model, we cannot take , and does not exist for a cone-shaped region . This observation is related to the important fact proved by Lehmann (1952) that an unbiased test cannot exist for a cone-shaped hypothesis region. Set and , and then the normalized bootstrap -value can be approximated by the model ; note the predictor is instead of . Here, we also denote by the model which approximates the normalized bootstrap -value . For fixed , let be the truncated Taylor expansion of with the first terms at :
[TABLE]
We can always use the above formula for extrapolating to . Therefore, AlgorithmĀ Ā 1 is updated to AlgorithmĀ 3. In practice, AlgorithmĀ Ā 3 with can be simply implemented as AlgorithmĀ Ā 1 with the linear model and a narrow range of values around .
For fixed , we consider the following -value:
[TABLE]
Under some regularity conditions, Shimodaira (2008) proves that
[TABLE]
at each . Moreover, for general selective inference with possibly non-smooth boundary surfaces, Terada and Shimodaira (2017) proposes the following selective -value:
[TABLE]
where . In addition, it is shown that the selective -value has the second-order accuracy:
[TABLE]
at each .
Appendix B Proof of TheoremĀ 2
Proof.
First, we show that, for given , there exists a nearly flat function such that
[TABLE]
where and . By Lemma 5.1 in Terada and Shimodaira (2017) or equivalently eq.Ā (5.3) in Shimodaira (2008), we have
[TABLE]
for . Let us temporarily assume that is nearly flat. Then, we also have
[TABLE]
From Eq.Ā (8), it follows that
[TABLE]
Since , we have , where . Thus, applying the inverse operator to both sides, we obtain . Since is nearly flat, should be nearly flat. Similarly, we can show that the above actually satisfies Eq.Ā (8). Combining with , we obtain . By Taylorās theorem, we deduce that
[TABLE]
Now, we consider the rejection region based on , that is, . By Lemma 5.1 in Terada and Shimodaira (2017) or equivalently eq.Ā (5.3) in Shimodaira (2008), for , we have . Since , we have . Let us recall that for , and so Thus, we obtain
[TABLE]
By substituting above, it follows from Eq.Ā (9) that for . This finishes the proof. ā
Appendix C Choice of the tuning parameters in multiscale bootstrap
The multiscale bootstrap is not very sensitive to the choice of the scales. For confirming this fact, we additionally performed experiments with two settings of scales, as shown in FigureĀ 7. We choose ( and 10) scales from the interval between 0.1 and 2 with equal spaces in the log-scale. We also changed the number of replications () in multiscale bootstrap. The other parameters are the same as the experiment in SectionĀ 5. For a simulated data, we computed selective -values under the various settings () of multiscale bootstrap. Under each setting, we computed the selective -value 10 times for two selected features No. 10 and No. 11 whose true coefficients are zero.
In FigureĀ 7, the red box plots correspond to the setting and the blue ones to the setting . The averages of 10 -values are almost the same among all settings, and thus the -values are not sensitive to . On the other hand, the variance of the -values decreases as increases. Several thousand replications are enough in practice.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Berk et al. (2013) {barticle} [author] \bauthor \bsnm Berk, \bfnm R. \binits R., \bauthor \bsnm Brown, \bfnm L. \binits L., \bauthor \bsnm Buja, \bfnm A. \binits A., \bauthor \bsnm Zhang, \bfnm K. \binits K. and \bauthor \bsnm Zhao, \bfnm L. \binits L. ( \byear 2013). \btitle Valid post-selection inference. \bjournal Annals of Statistics \bvolume 41 \bpages 802ā837. \endbibitem
- 2Cox (1975) {barticle} [author] \bauthor \bsnm Cox, \bfnm D. R. \binits D. R. ( \byear 1975). \btitle A note on data-splitting for the evaluation of significance levels. \bjournal Biometrika \bvolume 62 \bpages 441ā444. \endbibitem
- 3Duy and Takeuchi (2021) {binproceedings} [author] \bauthor \bsnm Duy, \bfnm Vo Nguyen Le \binits V. N. L. and \bauthor \bsnm Takeuchi, \bfnm Ichiro \binits I. ( \byear 2021). \btitle Parametric Programming Approach for More Powerful and General Lasso Selective Inference. In \bbooktitle Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021) \bpages 901ā909. \endbibitem
- 4Efron (1985) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B. ( \byear 1985). \btitle Bootstrap Confidence Intervals for a Class of Parametric Problems. \bjournal Biometrika \bvolume 72 \bpages 45ā58. \endbibitem
- 5Efron, Halloran and Holmes (1996) {barticle} [author] \bauthor \bsnm Efron, \bfnm Bradley \binits B., \bauthor \bsnm Halloran, \bfnm Elizabeth \binits E. and \bauthor \bsnm Holmes, \bfnm Susan \binits S. ( \byear 1996). \btitle Bootstrap confidence levels for phylogenetic trees. \bjournal Proceedings of the National Academy of Sciences \bvolume 93 \bpages 13429-13434. \endbibitem
- 6Efron and Tibshirani (1998) {barticle} [author] \bauthor \bsnm Efron, \bfnm B. \binits B. and \bauthor \bsnm Tibshirani, \bfnm R. \binits R. ( \byear 1998). \btitle The problem of regions. \bjournal Annals of Statistics \bvolume 26 \bpages 1687ā1718. \endbibitem
- 7Fan and Li (2001) {barticle} [author] \bauthor \bsnm Fan, \bfnm J. \binits J. and \bauthor \bsnm Li, \bfnm R. \binits R. ( \byear 2001). \btitle Variable selection via nonconcave penalized likelihood and its oracle properties. \bjournal Journal of the American Statistical Association \bvolume 96 \bpages 1348ā1360. \endbibitem
- 8Felsenstein (1985) {barticle} [author] \bauthor \bsnm Felsenstein, \bfnm Joseph \binits J. ( \byear 1985). \btitle Confidence limits on phylogenies: an approach using the bootstrap. \bjournal Evolution \bvolume 39 \bpages 783-791. \endbibitem
