Selective Inference via Marginal Screening for High Dimensional   Classification

Yuta Umezu; Ichiro Takeuchi

arXiv:1906.11382·stat.ME·June 28, 2019

Selective Inference via Marginal Screening for High Dimensional Classification

Yuta Umezu, Ichiro Takeuchi

PDF

Open Access

TL;DR

This paper develops a selective inference framework for high-dimensional binary classification using logistic regression after marginal screening, enabling valid hypothesis testing with controlled error rates.

Contribution

It introduces a novel selective inference method for logistic regression post-marginal screening in high dimensions, extending existing Gaussian linear model techniques.

Findings

01

The method asymptotically controls selective type I error.

02

Simulation studies confirm the statistical power of the proposed test.

03

Compared favorably with data splitting and other approaches.

Abstract

Post-selection inference is a statistical technique for determining salient variables after model or variable selection. Recently, selective inference, a kind of post-selection inference framework, has garnered the attention in the statistics and machine learning communities. By conditioning on a specific variable selection procedure, selective inference can properly control for so-called selective type I error, which is a type I error conditional on a variable selection procedure, without imposing excessive additional computational costs. While selective inference can provide a valid hypothesis testing procedure, the main focus has hitherto been on Gaussian linear regression models. In this paper, we develop a selective inference framework for binary classification problem. We consider a logistic regression model after variable selection based on marginal screening, and derive the high…

Tables1

Table 1. Table 1: Method comparison using simulated data based on 1,000 Monte-Carlo runs. Each cell denotes an average with standard deviations of ( 14 ) in parentheses.

			sample size
	$d$	method	50	100	200	500	1,000	1,500
Case 1	200	ASICs	.029 (.168)	.049 (.216)	.038 (.191)	.031 (.173)	.028 (.165)	.033 (.179)
		DS	.012 (.109)	.015 (.122)	.004 (.063)	.004 (.063)	.011 (.104)	.011 (.104)
		NT	.184 (.388)	.226 (.418)	.219 (.414)	.261 (.439)	.255 (.436)	.256 (.437)
	500	ASICs	.028 (.165)	.043 (.203)	.039 (.194)	.039 (.194)	.032 (.176)	.036 (.186)
		DS	.012 (.109)	.006 (.077)	.008 (.089)	.009 (.094)	.005 (.071)	.008 (.089)
		NT	.267 (.044)	.273 (.446)	.304 (.460)	.301 (.459)	.326 (.469)	.325 (.469)
	1,000	ASICs	.041 (.198)	.044 (.205)	.023 (.150)	.032 (.176)	.038 (.191)	.044 (.205)
		DS	.006 (.077)	.011 (.104)	.010 (.100)	.009 (.094)	.013 (.113)	.010 (.100)
		NT	.294 (.456)	.345 (.476)	.390 (.488)	.402 (.491)	.411 (.492)	.405 (.491)
Case 2	200	ASICs	.038 (.191)	.038 (.191)	.040 (.196)	.032 (.176)	.028 (.165)	.031 (.173)
		DS	.012 (.109)	.007 (.083)	.012 (.109)	.010 (.100)	.012 (.109)	.004 (.063)
		NT	.177 (.382)	.207 (.405)	.234 (.424)	.211 (.408)	.219 (.414)	.210 (.408)
	500	ASICs	.049 (.216)	.038 (.191)	.030 (.171)	.030 (.171)	.039 (.194)	.034 (.181)
		DS	.007 (.083)	.006 (.077)	.010 (.100)	.009 (.094)	.007 (.083)	.007 (.083)
		NT	.247 (.431)	.269 (.443)	.291 (.454)	.295 (.456)	.309 (.462)	.318 (.466)
	1,000	ASICs	.049 (.216)	.047 (.212)	.031 (.173)	.034 (.181)	.024 (.153)	.046 (.210)
		DS	.009 (.094)	.008 (.089)	.013 (.113)	.006 (.077)	.006 (.077)	.010 (.100)
		NT	.290 (.454)	.350 (.477)	.375 (.484)	.396 (.489)	.407 (.492)	.414 (.493)

Equations185

y = X β^{*} + ε,

y = X β^{*} + ε,

H_{0, j} : β_{S, j}^{*} = 0 vs. H_{1, j} : β_{S, j}^{*} \neq = 0.

H_{0, j} : β_{S, j}^{*} = 0 vs. H_{1, j} : β_{S, j}^{*} \neq = 0.

{y; A y \leq b} = {y; L (z) \leq η^{⊤} y \leq U (z), N (z) \geq 0},

{y; A y \leq b} = {y; L (z) \leq η^{⊤} y \leq U (z), N (z) \geq 0},

L (z) = j : (A c)_{j} < 0 max \frac{b _{j} - ( A z ) _{j}}{( A c ) _{j}}, U (z) = j : (A c)_{j} > 0 min \frac{b _{j} - ( A z ) _{j}}{( A c ) _{j}}

L (z) = j : (A c)_{j} < 0 max \frac{b _{j} - ( A z ) _{j}}{( A c ) _{j}}, U (z) = j : (A c)_{j} > 0 min \frac{b _{j} - ( A z ) _{j}}{( A c ) _{j}}

F_{μ, σ^{2}}^{[L, U]} (x) = \frac{Φ (( x - μ ) / σ ) - Φ (( L - μ ) / σ )}{Φ (( U - μ ) / σ ) - Φ (( L - μ ) / σ )},

F_{μ, σ^{2}}^{[L, U]} (x) = \frac{Φ (( x - μ ) / σ ) - Φ (( L - μ ) / σ )}{Φ (( U - μ ) / σ ) - Φ (( L - μ ) / σ )},

[F_{η^{⊤} μ, η^{⊤} Σ η}^{[L (z), U (z)]} (η^{⊤} y) ∣ A y \leq b] \sim Unif (0, 1),

[F_{η^{⊤} μ, η^{⊤} Σ η}^{[L (z), U (z)]} (η^{⊤} y) ∣ A y \leq b] \sim Unif (0, 1),

P_{j} = 1 - F_{0, η^{⊤} Σ η}^{[L (z_{0}), U (z_{0})]} (η^{⊤} y),

P_{j} = 1 - F_{0, η^{⊤} Σ η}^{[L (z_{0}), U (z_{0})]} (η^{⊤} y),

\tilde{P}_{j} = 2 min {P_{j}, 1 - P_{j}},

\tilde{P}_{j} = 2 min {P_{j}, 1 - P_{j}},

P (H_{0, j} is falsely rejected ∣ \hat{S} = S) = P (\tilde{P}_{j} \leq α ∣ \hat{S} = S)

P (H_{0, j} is falsely rejected ∣ \hat{S} = S) = P (\tilde{P}_{j} \leq α ∣ \hat{S} = S)

P (P_{j} (\hat{S}) \leq α for any \hat{S} \subseteq {1, \dots, d}) \leq α .

P (P_{j} (\hat{S}) \leq α for any \hat{S} \subseteq {1, \dots, d}) \leq α .

P (P_{j} (\hat{S}) \leq α for any \hat{S} \subseteq {1, \dots, d})

P (P_{j} (\hat{S}) \leq α for any \hat{S} \subseteq {1, \dots, d})

= S \subseteq {1, \dots, d} \sum P (P_{j} (S) \leq α ∣ \hat{S} = S) P (\hat{S} = S) .

\hat{S} = {j; ∣ z_{j} ∣ is among the first K largest of all} .

\hat{S} = {j; ∣ z_{j} ∣ is among the first K largest of all} .

∣ z_{j} ∣ \geq ∣ z_{k} ∣, \forall (j, k) \in S \times S^{⊥},

∣ z_{j} ∣ \geq ∣ z_{k} ∣, \forall (j, k) \in S \times S^{⊥},

- s_{j} z_{j} \leq z_{k} \leq s_{j} z_{j}, s_{j} z_{j} \geq 0, \forall (j, k) \in S \times S^{⊥} .

- s_{j} z_{j} \leq z_{k} \leq s_{j} z_{j}, s_{j} z_{j} \geq 0, \forall (j, k) \in S \times S^{⊥} .

ℓ_{n} (β_{S}) = i = 1 \sum n {y_{i} x_{S, i}^{⊤} β_{S} - ψ (x_{S, i}^{⊤} β_{S})},

ℓ_{n} (β_{S}) = i = 1 \sum n {y_{i} x_{S, i}^{⊤} β_{S} - ψ (x_{S, i}^{⊤} β_{S})},

\hat{β}_{S} = β_{S} \in B arg max ℓ_{n} (β_{S}),

\hat{β}_{S} = β_{S} \in B arg max ℓ_{n} (β_{S}),

ψ^{'} (x_{S, i}^{⊤} β_{S}^{*}) = ψ^{'} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}), ψ^{''} (x_{S, i}^{⊤} β_{S}^{*}) = ψ^{''} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}),

ψ^{'} (x_{S, i}^{⊤} β_{S}^{*}) = ψ^{'} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}), ψ^{''} (x_{S, i}^{⊤} β_{S}^{*}) = ψ^{''} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}),

P (y_{i} = 1) = E [y_{i}] = ψ^{'} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}), V [y_{i}] = ψ^{''} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}) .

P (y_{i} = 1) = E [y_{i}] = ψ^{'} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}), V [y_{i}] = ψ^{''} (x_{S^{*}, i}^{⊤} β_{S^{*}}^{*}) .

Ξ_{S, n} = \frac{1}{n} X_{S}^{⊤} X_{S} = \frac{1}{n} i = 1 \sum n x_{S, i} x_{S, i}^{⊤} \in R^{K \times K},

Ξ_{S, n} = \frac{1}{n} X_{S}^{⊤} X_{S} = \frac{1}{n} i = 1 \sum n x_{S, i} x_{S, i}^{⊤} \in R^{K \times K},

0 < C_{1} < λ_{min} (Ξ_{S, n}) \leq λ_{max} (Ξ_{S, n}) < C_{2} < \infty,

0 < C_{1} < λ_{min} (Ξ_{S, n}) \leq λ_{max} (Ξ_{S, n}) < C_{2} < \infty,

B = {β_{S} \in R^{K}; i max ∣ x_{S, i}^{⊤} β_{S} ∣ < \tilde{ξ}}

B = {β_{S} \in R^{K}; i max ∣ x_{S, i}^{⊤} β_{S} ∣ < \tilde{ξ}}

\displaystyle\Bigl{\|}\frac{1}{\sqrt{n}}X_{S^{\bot}}^{\top}X_{S}\Bigr{\|}={\rm O}(K).

\displaystyle\Bigl{\|}\frac{1}{\sqrt{n}}X_{S^{\bot}}^{\top}X_{S}\Bigr{\|}={\rm O}(K).

s_{n} (β_{S})

s_{n} (β_{S})

Σ_{n} (β_{S})

∥ \hat{β}_{S} - β_{S}^{*} ∥ = O_{p} (K / n) .

∥ \hat{β}_{S} - β_{S}^{*} ∥ = O_{p} (K / n) .

0 = ℓ_{n}^{'} (\hat{β}_{S}) \approx n s_{n} - n Σ_{n} (\hat{β}_{S} - β_{S}^{*}),

0 = ℓ_{n}^{'} (\hat{β}_{S}) \approx n s_{n} - n Σ_{n} (\hat{β}_{S} - β_{S}^{*}),

n (\hat{β}_{S} - β_{S}^{*}) \approx Σ_{n}^{- 1} s_{n} .

n (\hat{β}_{S} - β_{S}^{*}) \approx Σ_{n}^{- 1} s_{n} .

E [s_{n}] = \frac{1}{n} i = 1 \sum n x_{S, i} (E [y_{i}] - ψ^{'} (x_{S, i}^{⊤} β_{S}^{*})) = 0 .

E [s_{n}] = \frac{1}{n} i = 1 \sum n x_{S, i} (E [y_{i}] - ψ^{'} (x_{S, i}^{⊤} β_{S}^{*})) = 0 .

V [s_{n}] = \frac{1}{n} i = 1 \sum n V [y_{i}] x_{S, i} x_{S, i}^{⊤} = Σ_{n} .

V [s_{n}] = \frac{1}{n} i = 1 \sum n V [y_{i}] x_{S, i} x_{S, i}^{⊤} = Σ_{n} .

n σ_{n}^{- 1} η^{⊤} (\hat{β}_{S} - β_{S}^{*}) = σ_{n}^{- 1} η^{⊤} Σ_{n}^{- 1} s_{n} + o_{p} (1) \to d N (0, 1),

n σ_{n}^{- 1} η^{⊤} (\hat{β}_{S} - β_{S}^{*}) = σ_{n}^{- 1} η^{⊤} Σ_{n}^{- 1} s_{n} + o_{p} (1) \to d N (0, 1),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Statistical Methods and Bayesian Inference

Full text

∎

11institutetext: Yuta Umezu 22institutetext: Nagoya Institute of Technology, Aichi, Japan

33institutetext: Ichiro Takeuchi 44institutetext: Nagoya Institute of Technology, Aichi, Japan/ RIKEN Center for Advanced Intelligence Project, Tokyo, Japan/ Center for Materials Research by Information Integration, National Institute for Materials Science, Ibaraki, Japan

44email: [email protected]

Selective Inference via Marginal Screening for High Dimensional Classification

Yuta Umezu

Ichiro Takeuchi

(Received: date / Accepted: date)

Abstract

Post-selection inference is a statistical technique for determining salient variables after model or variable selection. Recently, selective inference, a kind of post-selection inference framework, has garnered the attention in the statistics and machine learning communities. By conditioning on a specific variable selection procedure, selective inference can properly control for so-called selective type I error, which is a type I error conditional on a variable selection procedure, without imposing excessive additional computational costs. While selective inference can provide a valid hypothesis testing procedure, the main focus has hitherto been on Gaussian linear regression models. In this paper, we develop a selective inference framework for binary classification problem. We consider a logistic regression model after variable selection based on marginal screening, and derive the high dimensional statistical behavior of the post-selection estimator. This enables us to asymptotically control for selective type I error for the purposes of hypothesis testing after variable selection. We conduct several simulation studies to confirm the statistical power of the test, and compare our proposed method with data splitting and other methods.

Keywords:

High Dimensional Asymptotics Hypothesis Testing Logistic Regression Post-Selection Inference Marginal Screening

1 Introduction

Discovering statistically significant variables in high dimensional data is an important problem for many applications such as bioinformatics, materials informatics, and econometrics, to name a few. To achieve this, for example in a regression model, data analysts often attempt to reduce the dimensionality of the model by utilizing a particular model selection or variable selection method. For example, the Lasso (Tibshirani, 1996) and marginal screening (Fan and Lv, 2008) are frequently used in model selection contexts. In many applications, data analysts conduct statistical inference based on the selected model as if it is known a priori, but this practice has been referred to as “a quiet scandal in the statistical community” in Breiman (1992). If we select a model based on the available data, then we have to pay heed to the effect of model selection when we conduct a statistical inference. This is because the selected model is no longer deterministic, i.e., random, and statistical inference after model selection is affected by selection bias. In hypothesis testing of the selected variables, the validity of the inference is compromised when a test statistic is constructed without taking account of the model selection effect. This means that, as a consequence, we can no longer effectively control type I error or the false positive rate. This kind of problem falls under the banner of post-selection inference in the statistical community and is recently attracted a lot of attention (see, e.g., Berk et al., 2013; Efron, 2014; Barber and Candès, 2016; Lee et al., 2016).

Post-selection inference consists of the following two steps:

Selection:

The analyst chooses a model or subset of variables and constructs hypothesis, based on the data.

Inference:

The analyst tests the hypothesis by using the selected model.

Broadly speaking, the selection step determines what issue to address, i.e., a hypothesis selected from the data, and the inference step conducts hypothesis testing to enable a conclusion to be drawn about the issue under consideration. To navigate the issue of selection bias, there are several approaches for conducting the inference step.

Data splitting is the most common procedure for selection bias correction. In a high dimensional linear regression model, Wasserman and Roeder (2009) and Meinshausen et al. (2009) succeed in assigning a $p$ -value for each selected variable by splitting the data into two subsets. Specifically, they first reduce the dimensionality of the model using the first subset, and then make the final selection using the second subset of the data, by assigning a $p$ -value based on a classical least square estimation. While such a data splitting method is mathematically valid straightforward to implement, it leads to low power for extracting truly significant variables because only sub-samples, whose size is obviously smaller than that of the full sample, can be used in each of the selection and inference steps.

As an alternative, simultaneous inference, which takes account all possible subsets of variables, has been developed for correcting selection bias. Berk et al. (2013) showed that the type I error can be successfully controlled even if the full sample is used in both the selection and inference steps by adjusting multiplicity of model selection. Since the number of all possible subsets of variables increases exponentially, computational costs associated with this method become excessive when the dimension of parameters is greater than 20.

On the other hand, selective inference, which only takes the selected model into account, is another approach for post-selection inference, and provides a new framework for combining selection and hypothesis testing. Since hypothesis testing is conducted only for the selected model, it makes sense to condition on an event that “a certain model is selected”. This event is referred to as a selection event, and we conduct hypothesis testing conditional on the event. Thus, we can avoid having to compare coefficients across two different models. Recently, Lee et al. (2016) succeeded in using this method to conduct hypothesis testing through constructing confidence intervals for selected variables by the Lasso in s linear regression modeling context. When a specific confidence interval is constructed, the corresponding hypothesis testing can be successfully conducted They also show that the type I error, which is also conditioned on the selection event and is called selective type I error, can be appropriately controlled. It is noteworthy that by conditioning on the selection event in a certain class, we can construct exact $p$ -values in the meaning of conditional inference based on a truncated normal distribution.

Almost all studies which have followed since the seminal work by Lee et al. (2016), however, focus on linear regression models. Particularly, normality of the noise is crucial to control selective type I error. To relax this assumption, Tian and Taylor (2017) developed an asymptotic theory for selective inference in a generalized linear modeling context. Although their results can be available for high dimensional and low sample size data, we can only test a global null hypothesis, that is, a hypothesis that all regression hypothesis is zero, just like with covariance test (Lockhart et al., 2014). On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses in a logistic regression model with the Lasso. By debiasing the Lasso estimator for both the active and inactive variables, they require a joint asymptotic distribution of the debiased Lasso estimator and conduct hypothesis testing for regression coefficients individually. However, the method is justified only for low dimensional scenarios since they exploit standard fixed dimensional asymptotics.

Our main contribution is that, by utilizing marginal screening as a variable selection method, we can show that the selective type I error rate for logistic regression model is appropriately controlled even in a high dimensional asymptotic scenario. In addition, our method is applicable not only with respect to testing the global null hypothesis but also hypotheses pertaining to individual regression coefficients. Specifically, we first utilize marginal screening for the selection step in a similar way to Lee and Taylor (2014). Then, by considering a logistic regression model for the selected variables, we derive a high dimensional asymptotic property of a maximum likelihood estimator. Using the asymptotic results, we can conduct selective inference of a high dimensional logistic regression, i.e., valid hypothesis testing for the selected variables from high dimensional data.

The rest of the paper is organized as follows. Section 2 briefly describes the notion of selective inference and intruduces several related works. In Section 3, the model setting and assumptions are described. An asymptotic property of the maximum likelihood estimator of our model is discussed in Section 4. In Section 5, we conduct several simulation studies to explore the performance of the proposed method before application to real world empirical data sets in Section 6. Theorem proofs are relegated to Section 7. Finally, Section 8 offers concluding remarks and suggestions for future research in this domain.

Notation

Throughout the paper, row and column vectors of $X\in\mathbb{R}^{n\times d}$ are denoted by $\bm{x}_{i}~{}(i=1,\ldots,n)$ and $\tilde{\bm{x}}_{j},~{}(j=1,\ldots,d)$ , respectively. An $n\times n$ identity matrix is denoted by $I_{n}$ . The $\ell_{2}$ -norm of a vector is denoted by $\|\cdot\|$ provided there is no confusion. For any subset $J\subseteq\{1,\ldots,d\}$ , its complement is denoted by $J^{\bot}=\{1,\ldots,d\}\backslash S$ . We also denote $\bm{v}_{J}=(v_{i})_{i\in J}\in\mathbb{R}^{|J|}$ and $X_{J}=(\bm{x}_{J,1},\ldots,\bm{x}_{J,n})^{\top}\in\mathbb{R}^{n\times|J|}$ as a sub-vector of $\bm{v}$ and a sub-matrix of $X$ , respectively. For a differentiable function $f$ , we denote $f^{\prime}$ and $f^{\prime\prime}$ as the first and second derivatives and so on.

2 Selective Inference and Related Works

In this section, we overview fundamental notion of selective inference through a simple linear regression model (Lee et al., 2016). We also review related existing works on selective inference.

2.1 Selective Inference in Linear Regression Model

Let $\bm{y}\in\mathbb{R}^{n}$ and $X\in\mathbb{R}^{n\times d}$ be a response and non-random regressor, respectively, and let us consider a linear regression model

[TABLE]

where $\bm{\beta}^{*}$ is the true regression coefficient vector and $\bm{\varepsilon}$ is distributed according to ${\rm N}(\bm{0},\sigma^{2}I_{n})$ with known variance $\sigma^{2}$ . Suppose that a subset of variables $S$ is selected in the selection step (e.g., Lasso or marginal screening as in Lee et al. (2016); Lee and Taylor (2014)) and let us consider hypothesis testing for $j\in\{1,\ldots,|S|\}$ :

[TABLE]

If $S$ is non-random, a maximum likelihood estimator $\hat{\bm{\beta}}_{S}=(X_{S}^{\top}X_{S})^{-1}X_{S}^{\top}\bm{y}$ is distributed according to ${\rm N}(\bm{\beta}_{S}^{*},\sigma^{2}(X_{S}^{\top}X_{S})^{-1})$ , as is well-known. However, we cannot use this sampling distribution when $S$ is selected based on the data, and the selected variable $S$ is also random.

If a subset of variables, i.e., the active set, $\hat{S}$ is selected by the Lasso or marginal screening, the event $\{\hat{S}=S\}$ can be written as an affine set with respect to $\bm{y}$ , that is, in the form of $\{\bm{y};A\bm{y}\leq\bm{b}\}$ for some non-random matrix $A$ and vector $\bm{b}$ (Lee et al., 2016; Lee and Taylor, 2014), in which the event $\{\hat{S}=S\}$ is called a selection event. Lee et al. (2016) showed that if $\bm{y}$ follows a normal distribution and the selection event can be written as an affine set, the following lemma holds:

Lemma 1 (Polyhedral Lemma; Lee et al. (2016))

Suppose $\bm{y}\sim{\rm N}(\bm{\mu},\Sigma)$ . Let $\bm{c}=\Sigma\bm{\eta}(\bm{\eta}^{\top}\Sigma\bm{\eta})^{-1}$ for any $\bm{\eta}\in\mathbb{R}^{n}$ , and let $\bm{z}=(I_{n}-\bm{c}\bm{\eta}^{\top})\bm{y}$ . Then we have

[TABLE]

where

[TABLE]

and $N(\bm{z})=\max_{j:(A\bm{c})_{j}=0}b_{j}-(A\bm{z})_{j}$ . In addition, $(L(\bm{z}),U(\bm{z}),N(\bm{z}))$ is independent of $\bm{\eta}^{\top}\bm{y}$ .

By using the lemma, we can find that the distribution of the pivotal quantity for $\bm{\eta}^{\top}\bm{\mu}$ is given by a truncated normal distribution. Specifically, let $F^{[L,U]}_{\mu,\sigma^{2}}$ be a cumulative distribution function of a truncated normal distribution ${\rm TN}(\mu,\sigma^{2},L,U)$ , that is,

[TABLE]

where $\Phi$ is a cumulative distribution function of a standard normal distribution. Then, for any value of $\bm{z}$ , we have

[TABLE]

where $L(\bm{z})$ and $U(\bm{z})$ are defined in the above lemma. This pivotal quantity allows us to construct a so-called selective $p$ -value. Precisely, by choosing $\bm{\eta}=X_{S}(X_{S}^{\top}X_{S})^{-1}\bm{e}_{j}$ , we can construct a right-side selective $p$ -value as

[TABLE]

where $\bm{e}_{j}\in\mathbb{R}^{|S|}$ is a unit vector whose $j$ -th element is 1 and 0 otherwise, and $\bm{z}_{0}$ is a realization of $\bm{z}$ . Note that the value of $P_{j}$ represents a right-side $p$ -value conditional on the selection event under the null hypothesis $\text{H}_{0,j}:\beta_{S,j}^{*}=\bm{\eta}^{\top}\bm{\mu}=0$ in (1). In addition, for the $j$ -th test in (1), a two-sided selective $p$ -value can be defined as

[TABLE]

which also follows from standard uniform distribution under the null hypothesis. Therefore, we reject the $j$ -th null hypothesis at level $\alpha$ when $\tilde{P}_{j}\leq\alpha$ , and the probability

[TABLE]

is referred to as a selective type I error.

2.2 Related Works

In selective inference, we use the same data in variable selection and statistical inference. Therefore, the selected model is not deterministic and we can not apply classical hypothesis testing due to selection bias.

To navigate this problem, data splitting has been commonly utilized. In data splitting, the data are randomly divided into two disjoint sets, and one of them is used for variable selection and the other is used for hypothesis testing. This is a particularly versatile method and is widely applicable if we can divide the data randomly (see e.g., Cox, 1975; Wasserman and Roeder, 2009; Meinshausen et al., 2009). Since the data are split randomly, i.e., independent of the data, we can conduct hypothesis testing in the inference step independent of the selection step. Thus, we do not need to concerned with selection bias. It is noteworthy that data splitting can be viewed as a method of selective inference because the inference is conducted only for the selected variables in the selection step. However, a drawback of data splitting is that only a part of the data are available for each split, precisely because the essence of this approach involves rendering some data available for the selection step and the remainder for the inference step. Because only a subset of the data can be used in variable selection, the risk of failing to select truly important variables increases. Similarly, the power of hypothesis testing would decrease since inference proceeds on the basis of a subset of the total data. In addition, since data splitting is executed at random, it is possible and plausible that the final results and conclusions will vary non-trivially depending on exactly how this split is manifested.

On the other hand, in the traditional statistical community, simultaneous inference has been developed for correcting selection bias (see e.g., Berk et al., 2013; Dickhaus, 2014). In simultaneous inference, type I error is controlled at level $\alpha$ by considering all possible subsets of variables. Specifically, let $\hat{S}\subseteq\{1,\ldots,d\}$ be the set of variables selected by a certain variable selection method and $P_{j}(\hat{S})$ be a $p$ -value for the $j$ -th selected variable in $\hat{S}$ . Then, in simultaneous inference, the following type I error should be adequately controlled:

[TABLE]

To examine the relationship between selective inference and simultaneous inference, note that the left-hand side in (3) can be rewritten as

[TABLE]

The right-hand side in the above equality is simply a weighted sum of selective type I errors over all possible subsets of variables. Therefore, if we control selected type I errors for all possible subsets of variables, we can also control type I errors in the sense of simultaneous inference. However, because the number of all possible subsets of variables is $2^{d}$ , it becomes overly cumbersome to compute the left-hand side in (3) even for $d=20$ . In contrast to simultaneous inference, selective inference only considers the selected variables, and thus the computational cost is low compared to simultaneous inference.

Following the seminal work of Lee et al. (2016), selective inference for variable selection has been intensively studied (e.g., Fithian et al., 2014; Lee and Taylor, 2014; Taylor et al., 2016; Tian et al., 2018). All these methods, however, rely on the assumption of normality of the data.

2.3 Beyond Normality

It is important to relax the assumption of the normality for applying selective inference to more general cases such as generalized linear models. To the best of our knowledge, there is death of research into selective inference in such a generalized setting. Here, we discuss the few studies which do exist in this respect.

Fithian et al. (2014) derived an exact post-selection inference for a natural parameter of exponential family, and obtained the uniformly most powerful unbiased test in the framework of selective inference. However, as suggested in their paper, the difficulty in constructing exact inference in generalized linear models emanates from the discreteness of the response distribution.

Focusing on an asymptotic behavior in a generalized linear model context with the Lasso penalty, Tian and Taylor (2017) directly considered the asymptotic property of a pivotal quantity. Although their work can be applied in high dimensional scenarios, we can only test a global null, that is, ${\rm H}_{0}:\bm{\beta}^{*}=\bm{0}$ , except for the linear regression model case. This is because that, when we conduct selective inference for individual coefficient, the selection event does not form a simple structure such as an affine set.

On the other hand, Taylor and Tibshirani (2018) proposed a procedure to test individual hypotheses fin logistic regression model context based on the Lasso. Their approach is fundamentally based on solving the Lasso by approximating the log-likelihood up to the second order, and on debiasing the Lasso estimator. Because the objective function now becomes quadratic as per the linear regression model, the selection event reduces to a relatively simple affine set. After debiasing the Lasso estimator, they derive an asymptotic joint distribution of active and inactive estimators. However, since they required $d$ dimensional asymptotics, high dimensional scenarios can not be supported in their theory.

In this paper, we extend selective inference for logistic regression in Taylor and Tibshirani (2018) to high dimensional settings in the case where variable selection is conducted by marginal screening. We do not consider asymptotics for a $d$ dimensional original parameter space, but for a $K$ dimensional selected parameter space. Unfortunately, however, we cannot apply this asymptotic result directly to the polyhedral lemma (Lemma 1) in Lee et al. (2016). To tackle this problem, we consider a score function for constructing a test statistic for our selective inference framework. We first define a function $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ based on a score function as a “source” for constructing a test statistic. To apply the polyhedral lemma to $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ , we need to asymptotically ensure that i) the selection event is represented by affine constraints with respect to $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ , and ii) the function in the form of $\bm{\eta}^{\top}\bm{T}_{n}(\bm{\beta}_{S}^{*})$ is independent of the truncation points. Our main technical contribution herein is that, by carefully analyzing problem configuration and by introducing reasonable additional assumptions, we can show that those two requirements for the polyhedral lemma are satisfied asymptotically.

Figure 1 shows the asymptotic distribution of selective $p$ -values in our setting and in Taylor and Tibshirani (2018) based on 1,000 Monte-Carlo simulation. While the theory in Taylor and Tibshirani (2018) does not support high dimensionality, their selective $p$ -value (red solid line) appears to effective in high dimensional scenarios, although it is slightly mode conservative compared to the approach developed in this paper (black solid line). Our high dimensional framework means that the number of selected variables grows with the sample size in an appropriate order, and a proposed method allows us to test (1) individually even in high dimensional contexts.

3 Setting and Assumptions

As already noted, our objective herein is to develop a selective inference approach applicable to logistic regression models when the variables are selected by marginal screening. Let $(y_{i},\bm{x}_{i})$ be the $i$ -th pair of the response and regressor. We assume that the $y_{i}$ ’s are independent random variables which take values in $\{0,1\}$ , and the $\bm{x}_{i}$ ’s are a $d$ dimensional vector of known constants. Further, let $X=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\top}\in\mathbb{R}^{n\times d}$ and $\bm{y}=(y_{1},\ldots,y_{n})^{\top}\in\{0,1\}^{n}$ . Unlike Taylor and Tibshirani (2018), we do not require that the dimension $d$ be fixed, that is, $d$ may increase, as well as the sample size $n$ .

3.1 Marginal Screening and Selection Event

In this study, we simply select variables based on a score between the regressor and response $\bm{z}=X^{\top}\bm{y}$ as per a linear regression problem. Specifically, we select the top $K$ coordinates of absolute values in $\bm{z}$ , that is,

[TABLE]

To avoid computational issues, we consider the event $\{(\hat{S},s_{\hat{S}})=(S,s_{S})\}$ as a selection event (see, e.g., Lee and Taylor (2014); Tian and Taylor (2017); Lee et al. (2016)). Here, $\bm{s}_{S}$ is a vector of sign $z_{j}~{}(j\in S)$ . Then, the selection event $\{(\hat{S},\bm{s}_{\hat{S}})=(S,\bm{s}_{S})\}$ can be rewritten as

[TABLE]

which is equivalent to

[TABLE]

Therefore, $\{(\hat{S},s_{\hat{S}})=(S,s_{S})\}$ is reduced to an affine set $\{\bm{z};\;A\bm{z}\leq\bm{0}\}$ for an appropriate $\{2K(d-K)+K\}\times d$ dimensional matrix $A$ .

In the following, we assume that a sure screening property holds. This is desirable property for variable selection (see e.g., Fan and Lv, 2008; Fan and Song, 2010) and the statement is as follows:

(C0)

For the true active set $S^{*}=\{j;\beta_{j}^{*}\neq 0\}$ , the probability ${\rm P}(\hat{S}\supset S^{*})$ converges to 1 as $n$ goes to infinity.

In the above assumption, we denote $\bm{\beta}^{*}\in\mathbb{R}^{d}$ as a true value of the coefficient vector. This assumption requires that the set of selected variables contain the set of true active variables with probability tending to 1. In the linear regression model, (C0) holds under some regularity conditions in high dimensional settings (see, e.g., Fan and Lv, 2008). The sufficient condition concerning about high dimensionality for (C0) is $\log d=O(n^{\xi})$ for some $\xi\in(0,1/2)$ , and thus we allow $d$ to be exponentially large. Because (C0) is not directly related in selective inference, we do not further discuss it.

3.2 Selective Test

For a subset of variables $\hat{S}~{}(=S)$ selected by marginal screening, we consider $K$ selective tests (1) for each variable $\beta_{j}^{*},~{}j\in S$ . Let us define the loss function of logistic regression with the selected variables as follows:

[TABLE]

where $\psi(\bm{x}_{S,i}^{\top}\bm{\beta}_{S})=\log(1+\exp(\bm{x}_{S,i}^{\top}\bm{\beta}_{S}))$ is a cumulant generating function. Observe that $\ell_{n}(\bm{\beta}_{S})$ is concave with respect to $\bm{\beta}_{S}$ . Thus we can define the maximum likelihood estimator of $\bm{\beta}_{S}$ as the optimal solution that attains the maximum of the following optimization problem:

[TABLE]

where ${\cal B}\subseteq\mathbb{R}^{K}$ is a parameter space.

Remark 1

Suppose that $S~{}(\supset S^{*})$ is fixed. Then, it holds that

[TABLE]

and thus, we have

[TABLE]

We construct test statistics for (1) by deriving an asymptotic distribution of $\hat{\bm{\beta}}_{S}$ . To develop our asymptotic theory, we further assume the following conditions in addition to (C0) for a fixed $S$ with $|S|=K$ :

(C1)

$\max_{i}\|\bm{x}_{S,i}\|={\rm O}(\sqrt{K})$ . In addition, for a $K\times K$ dimensional matrix

[TABLE]

the following holds:

[TABLE]

where $C_{1}$ and $C_{2}$ are constants that depend on neither $n$ nor $K$ .

(C2)

There exists a constant $\xi\;(<\infty)$ such that $\max_{i}|\bm{x}_{S,i}^{\top}\bm{\beta}_{S}^{*}|<\xi$ . In addition, parameter space ${\cal B}$ is

[TABLE]

for some constant $\tilde{\xi}\;(\in(\xi,\infty))$ .

(C3)

$K^{3}/n={\rm o}(1)$ .

(C4)

For any $p\times q$ dimensional matrix $A$ , we denote the spectral norm of $A$ by $\|A\|=\sup_{\bm{v}\neq\bm{0}}\|A\bm{v}\|/\|\bm{v}\|$ . Then the following holds:

[TABLE]

The condition (C1) pertains to the design matrix. Note that we only consider a high dimensional and small sample size setting for the original data set, and not for selected variables. This assumption is reasonable for high dimensional and large sample scenarios. (C2) requires that ${\rm P}(y_{i}=1)$ not converge to 0 or 1 for any $i=1,\ldots,n$ . Observe that the parameter space ${\cal B}$ is an open and convex set with respect to $\bm{\beta}_{S}$ . This assumption naturally holds when the space of regressors is compact and $\bm{\beta}_{S}$ does not diverge. In addition, if the maximum likelihood estimator $\hat{\bm{\beta}}_{S}$ is $\sqrt{n/K}$ -consistent, then $\hat{\bm{\beta}}_{S}$ lies in ${\cal B}$ with probability converging to 1. The condition (C3) represents the relationship between the sample size and the number of selected variables for high dimensional asymptotics in our model. As related conditions, Fan and Peng (2004) employs $K^{5}/n\to 0$ , and Dasgupta et al. (2014) employs $K^{6+\delta}/n\to 0$ for some $\delta>0$ to derive an asymptotic expansion of a posterior distribution in a Bayesian setting. Furthermore, Huber (1973) employs the same condition as in (C3) in the scenario for $M$ -estimation. Finally, (C4) requires that regressors of selected variables and those of unselected variables be only weakly correlated. A similar assumption is required in Huang et al. (2008) for deriving an asymptotic distribution for a bridge estimator. This type of assumption, e.g., a restricted eigenvalue condition (Bickel et al., 2009), is essential for handling high dimensional behavior of the estimator.

4 Proposed Method

In this section, we present the proposed method for selective inference for high dimensional logistic regression with marginal screening. We first consider a subset of features $\hat{S}=S(\supset S^{*})$ as a fixed set, and derive an asymptotic distribution of $\hat{\bm{\beta}}_{S}$ under the assumptions (C1) – (C3). Then, we introduce the “source” of the test statistic $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ , which is defined by a score function, and apply it to the polyhedral lemma, where we will show that the truncation points are independent of the $\bm{\eta}^{\top}\bm{T}_{n}(\bm{\beta}_{S}^{*})$ with the assumption (C4).

To extend the selective inference framework to logistic regression, we first consider a subset of variables $\hat{S}=S~{}(\supset S^{*})$ as a fixed set. From (4), let us define a score function and observed information matrix by

[TABLE]

respectively. To simplify the notation, we denote $\bm{s}_{n}(\bm{\beta}_{S}^{*})$ and $\Sigma_{n}(\bm{\beta}_{S}^{*})$ by $\bm{s}_{n}$ and $\Sigma_{n}$ , respectively, for the true value of $\bm{\beta}_{S}^{*}$ . Because $\psi^{\prime\prime}(\bm{x}_{S,i}^{\top}\bm{\beta}_{S}^{*})$ is uniformly bounded on ${\cal B}$ from (C2), $\Sigma_{n}$ is a symmetric and positive definite matrix when (C1) holds. Then, by the same argument as in Fan and Peng (2004), if $K^{2}/n\to 0$ , we have

[TABLE]

By using Taylor’s theorem, we have

[TABLE]

and thus

[TABLE]

As per Remark 1, $S\supset S^{*}$ implies

[TABLE]

In addition, because the $y_{i}$ ’s are independent of each other, we observe that

[TABLE]

Therefore, by recalling asymptotic normality of the score function, we expect that a distribution of $\Sigma_{n}^{-1}\bm{s}_{n}$ can be approximated by a normal distribution with mean $\bm{0}$ and covariance matrix $\Sigma_{n}^{-1}$ . Indeed, if $S$ is fixed, this approximation is true under the conditions (C1) – (C3):

Theorem 1

Suppose that the conditions (C1) – (C3) hold. Then, for any fixed $S~{}(\supset S^{*})$ and $\bm{\eta}\in\mathbb{R}^{K}$ with $\|\bm{\eta}\|<\infty$ , we have

[TABLE]

where $\sigma_{n}^{2}=\bm{\eta}^{\top}\Sigma_{n}^{-1}\bm{\eta}$ and ${\rm o}_{{\rm p}}(1)$ is a term that converges to 0 in probability uniformly with respect to $\bm{\eta}$ and $S$ .

Note that, under the conditions (C1), (C2) and $d^{3}/n\to 0$ , Theorem 1 also holds when we do not enforce variable selection (see e.g., Fan and Peng (2004)). To formulate a selective test, let us consider

[TABLE]

as a “source” of a test statistic, where $\bm{\psi}^{\prime}(\bm{\beta}_{S}^{*})=(\psi^{\prime}(\bm{x}_{S,i}^{\top}\bm{\beta}_{S}^{*}))_{i=1,\ldots,n}$ . The term “source” means that we cannot use it as a test statistic directly because $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ depends on $\bm{\beta}_{S}^{*}$ . In the following, for notational simplicity, we denote $\bm{T}_{n}(\bm{\beta}_{S}^{*})$ and $\bm{\psi}^{\prime}(\bm{\beta}_{S}^{*})$ by $\bm{T}_{n}$ and $\bm{\psi}^{\prime}$ , respectively.

As noted in Section 3.1, by using an appropriate non-random matrix $A\in\mathbb{R}^{K(2d-2K+1)\times d}$ , the marginal screening selection event can be expressed as an affine constraint with respect to $\bm{z}=X^{\top}\bm{y}$ , that is, $\{\bm{z};A\bm{z}\leq\bm{0}\}$ . Then, by appropriately dividing $A$ and $X$ based on the selected $S$ , we can rewrite it as follows:

[TABLE]

The last inequality is an affine constraint with respect to $\bm{T}_{n}$ , where

[TABLE]

Unlike the polyhedral lemma in Section 2.1, $\tilde{\bm{b}}$ depends on $\bm{y}$ and so is a random vector. By using (C4), we can prove that $\tilde{\bm{b}}$ is asymptotically independent of $\bm{\eta}^{\top}\bm{T}_{n}$ , which implies the polyhedral lemma holds asymptotically.

Theorem 2

Suppose that (C1) – (C4) all hold. Let $\bm{c}=\Sigma_{n}^{-1}\bm{\eta}/\sigma_{n}^{2}$ for any $\bm{\eta}\in\mathbb{R}^{K}$ with $\|\bm{\eta}\|<\infty$ , and $\bm{w}=(I_{K}-\bm{c}\bm{\eta}^{\top})\bm{T}_{n}$ , where $\sigma_{n}^{2}=\bm{\eta}^{\top}\Sigma_{n}^{-1}\bm{\eta}$ . Then, for any fixed $S~{}(\supset S^{*})$ , the selection event can be expressed as

[TABLE]

where

[TABLE]

and $N_{n}=\max_{l:(\tilde{A}\bm{c})_{l}=0}\tilde{b}_{l}-(\tilde{A}\bm{w})_{l}$ . In addition, $(L_{n},U_{n},N_{n})$ is asymptotically independent of $\bm{\eta}^{\top}\bm{T}_{n}$ .

As a result of Theorem 1, Theorem 2 and (C0), we can asymptotically identify a pivotal quantity as a truncated normal distribution, that is, by letting $\bm{\eta}=\bm{e}_{j}\in\mathbb{R}^{K}$ ,

[TABLE]

for any $\bm{w}$ , under ${\rm H}_{0,j}$ . Therefore, we can define an asymptotic selective $p$ -value for selective test (1) under ${\rm H}_{0,j}$ as follows:

[TABLE]

where $L_{n}$ and $U_{n}$ are evaluated at the realization of $\bm{w}=\bm{w}_{0}$ . Unfortunately, because $\bm{T}_{n}$ , $\Sigma_{n}$ , $L_{n}$ and $U_{n}$ are still dependent on the true value of $\bm{\beta}_{S}^{*}$ , we construct a test statistic by introducing a maximum likelihood estimator (5), which is a consistent estimator of $\bm{\beta}_{S}^{*}$ .

4.1 Computing Truncation Points

In practice, we need to compute truncation points in (9). When we utilize marginal screening for variable selection, it becomes difficult to compute $L_{n}$ and $U_{n}$ because $\tilde{A}$ becomes a $\{2K(d-K)+K\}\times K$ dimensional matrix. For example, even when $d=1{,}000$ and $K=20$ , we need to handle a 39,220 dimensional vector. To reduce the computational burden, we derive a simple form of (9) in this section.

We first derive $A_{S}$ . As notedd in Section 3.1, selection event $\{(\hat{S},s_{\hat{S}})=(S,s_{S})\}$ can be rewritten as

[TABLE]

where $s_{j}={\rm sgn}(z_{j})$ is the sign of the $j$ -th element of $\bm{z}=X^{\top}\bm{y}$ . Let $S=\{j_{1},\ldots,j_{K}\}$ and $q=2(d-K)+1$ . Then, by a simple calculation, we have

[TABLE]

where $J$ is a $K\times K$ dimensional diagonal matrix whose $j$ -th diagonal element is $s_{j}$ and $\otimes$ denotes a Kronecker product. Since $\tilde{A}=A_{S}\Sigma_{n}$ and $\bm{c}=\Sigma_{n}^{-1}\bm{\eta}/\sigma_{n}^{2}$ , the denominator in (9) reduces to $\tilde{A}\bm{c}=A_{S}\bm{\eta}//\sigma_{n}^{2}$ . For $\bm{\eta}=\bm{e}_{j}$ , we can further evaluate $A_{S}\bm{\eta}$ as

[TABLE]

Further, by the definition of $\tilde{A},~{}\tilde{\bm{b}}$ , and $\bm{w}$ , we have

[TABLE]

Because $\sigma_{n}^{2}$ , the $j$ -th diagonal element of $\Sigma_{n}^{-1}$ , is positive, it is straightforward to observe that

[TABLE]

for $j=1,\ldots,K$ . Note that, for each $j=1,\ldots,K$ , $(A\bm{z})_{l=(j-1)q+1,\ldots,jq}$ consists of $q$ elements of $z_{j}$ and $z_{j}\pm z_{k}$ for any $k\in S^{\bot}$ . Therefore, for each $j=1,\ldots,K$ , we have

[TABLE]

As a consequence, we obtain

[TABLE]

if $s_{j}=1$ , and $L_{n}=-\infty$ , otherwise. Similarly, we obtain

[TABLE]

if $s_{j}=-1$ , and $U_{n}=\infty$ , otherwise. Because of this simple form, we can calculate truncation points efficiently. We summarize the algorithm to compute selective $p$ -values of the $K$ selective test in Algorithm 1.

4.2 Controlling Family-wise Error Rate

Since selective test (1) consists of $K$ hypotheses, we may be concerned about multiplicity when $K>1$ . In this case, instead of selective type I error, we control the family-wise error rate (FWER) in the sense of selective inference and we term it selective FWER.

For the selected variable $\hat{S}=S$ , let us denote a family of true null by ${\cal H}=\{{\rm H}_{0,j}:{\rm H}_{0,j}(j\in S)$ is true null $\}$ . Then, let us define the selective FWER by

[TABLE]

in the same way as the classic FWER. Next, we asymptotically control the selective FWER at level $\alpha$ by utilizing Bonferroni correction for $K$ selective tests. Specifically, we adjust selective $p$ -values (10) as follows. Let us define $\tilde{\alpha}=\alpha/K$ . Since selective $p$ -value $P_{n,j}$ is asymptotically distributed according to ${\rm Unif}(0,1)$ , we have that a limit superior of (13) can be bounded as follows:

[TABLE]

In the last inequality, we simply use $|{\cal H}|\leq K$ . Accordingly, letting $p_{n,j}$ be a realization of (10), we reject a null hypothesis when $\{p_{n,j}\leq\tilde{\alpha}\}$ . In the following, we refer to $\tilde{p}_{n,j}=\min\{1,Kp_{n,j}\}$ as an adjusted selective $p$ -value. Note that we can utilize not only Bonferroni’s method but also other methods for correcting multiplicity such as Scheff $\acute{\text{e}}$ ’s method, Holm’s method, and so on. We use Bonferroni’s method for expository purposes.

5 Simulation Study

Through simulation studies, we explore the performance of the proposed method in Section 4, which we term ASICs (Asymptotic Selective Inference for Classification) here.

We first identify if the ASICs can control selective type I error. We also check the selective type I error when data splitting (DS) and nominal test (NT) methods are used. In DS, we first randomly divide the data into two disjoint sets. Then, after selecting $\hat{S}=S$ with $|S|=K$ by using one of these sets, we construct a test statistic $\bm{T}_{n}(\hat{\bm{\beta}}_{S})$ based on the other sets and reject the $j$ -th selective test (1) when $|T_{n,j}/\sigma_{n}|\geq z_{\alpha/2}$ , where $z_{\alpha/2}$ is an upper $\alpha/2$ -percentile of a standard normal distribution. In NT, we cannot control type I errors since selection bias is ignored: it selects $K$ variables by marginal screening first, then rejects the $j$ -th selective test (1) when $|T_{n,j}/\sigma_{n}|\geq z_{\alpha/2}$ , where the entire data set is used for both selection and inference steps. Finally, we explore whether the ASICs can effectively control selective FWER, and at the same time, confirm its statistical power by comparing it with that of DS.

The simulation settings are as follows. As $d$ dimensional regressor $\bm{x}_{i}$ ( $i=1,\ldots,n$ ), we used vectors obtained from ${\rm N}(\bm{0},\Sigma)$ , where $\Sigma$ is a $d\times d$ dimensional covariance matrix whose $(j,k)$ -th element is set to $\rho^{|j-k|}$ . We set $\rho=0$ or $0.5$ in Case 1 and Case 2, respectively. Note that each element of $\bm{x}_{i}$ is independent in Case 1 but correlated in Case 2. Then, for each $\bm{x}_{i}$ , we generate $y_{i}$ from ${\rm Bi}(\psi^{\prime}(\bm{x}_{i}^{\top}\bm{\beta}^{*}))$ , where $\bm{\beta}^{*}$ is a $d$ dimensional true coefficient vector and ${\rm Bi}(p)$ is a Bernoulli distribution with parameter $p$ . In the following, we conduct simulations using 1,000 Monte-Carlo runs. We use the glm package in R for parameter estimation.

5.1 Controlling Selective Type I Error

To check if ASICs can control selective type I error, we consider a selective test (1). Specifically, we first select $K=1$ variable by marginal screening and then conduct a selective test at the 5% level. By setting $\bm{\beta}^{*}=\bm{0}\in\mathbb{R}^{d}$ , we can confirm selective type I error because the selective null is always true. Therefore, we assess the following index as an estimator of the selective type I error: letting $\beta$ be the selected variable in each simulation, we evaluate an average and standard deviation of

[TABLE]

where $I$ is an indicator function and ${\rm H}_{0}:\beta^{*}=0$ is a selective null. We construct a selective test at the 5% level in all simulations. In the same manner as classical type I error, it is desirable when the above index is less than or equal to 0.05, with particularly small values indicating that the selective test is overly conservative.

Table 1 presents averages and standard deviations of (14) based on 1,000 runs. It is clear that NT cannot control selective type I error; it becomes larger as the dimension $d$ increases. In addition, NT does not improve even if the sample size becomes large, because there exist selection bias in the selection step. On the other hand, both ASICs and DS adequately control selective type I error, although the latter appears slightly more conservative than the former. Moreover, unlike NT, these two methods can adequately control selective type I error, even when the covariance structure of $\bm{x}_{i}$ and the number of dimensions change.

5.2 FWER and Power

Here, we explore selective FWER and statistical power with respect to ASICs and DS for $K$ selective tests (1), where we set $K=5,10,15$ , and $20$ . Note that, as discussed in the above section, NT is disregarded here because it does no adequately control selective type I error. We adjust multiplicity by utilizing Bonferroni’s method as noted in Section 4.2.

The true coefficient vector is set to be $\bm{\beta}^{*}=(2\times\bm{1}_{5}^{\top},\bm{0}_{d-5}^{\top})^{\top}$ and $\bm{\beta}^{*}=(2\times\bm{1}_{5}^{\top},-2\times\bm{1}_{5}^{\top},\bm{0}_{d-10}^{\top})^{\top}$ in Model 1 and Model 2, respectively. In the following, we assess the indices as an estimator of selective FWER and power. Letting $\hat{S}=S$ be the subset of selected variables for each simulation, we evaluate an average of

[TABLE]

and

[TABLE]

where, for each $j\in S$ , ${\rm H}_{0,j}:\beta_{j}^{*}=0$ is the selective null and ${\cal H}$ is a family of true nulls. Note that, by using Bonferroni’s method, we use $\tilde{\alpha}=\alpha/K$ as an adjusted significance level for $\alpha=0.05$ . Similar to the selective type I error, it is desirable when (15) is less than or equal to $\alpha$ . In addition, higher values of (16) are desiable in the same manner as per classical power. We evaluate (16) as the proportion of rejected hypotheses for false nulls to that of true active variables. We employ this performance index because it is important to identify how many truly active variables are extracted in practice.

Figure 2 shows the average (15) for each method. ASICs and DS are both evaluated with respect to four values of $K$ , thus eight lines are plotted in each graph. Because of the randomness of simulation, some of the ASICs results are larger than 0.05 especially in small sample size and large variable dimension cases. For both methods, it is clear that selective FWER tends to be controlled at the desired significance level, although DS is more conservative than ASICs. To accord with our asymptotic theory, the number of selected variables must be $K={\rm o}(n^{1/3})$ , which means that the normal approximation is not ensured in the case of $K=15$ and $20$ . However, we observe that selective FWER is correctly controlled even in these cases, which suggests that assumptions (C3) and (C4) can be relaxed.

Figures 3 and 4 show the average of (16) for each method and settings in Model 1 and Model 2, respectively. In Case 1 of Figure 3, ASICs and DS have almost the same power for each $K$ and $d$ . In addition, ASICs is clearly superior to DS in Case 2. This is reasonable since DS uses only the half of the data for inference. On the other hand, in all cases, the power of ASICs becomes higher as the number of selected variables $K$ decreases. This can be explained by the condition (C3), that is, we need a much larger sample size when $K$ becomes large for assuring the asymptotic result in Theorem 2. In Figure 4, it is clear that the power of ASICs is superior in almost all settings. However, neither AISCs nor DS appears to perform well when $K=5$ . In this case, the power of ASICs and DS cannot be improved by $50\%$ or more. This is because we can only select at most $5$ true nonzero variables, while there are $10$ true nonzero variables.

6 Empirical Applications

We further explore the performance of the proposed method by applying it to several empirical data sets, all of which are available at LIBSVM111https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. In all experiments, we standardize the design matrix $X$ to make the scale of each variable the same. We report adjusted selective $p$ -values for selected variables. To explore the selection bias, we also report naive adjusted $p$ -values. That is, we first compute $p$ -values for selected variables based on NT, then we adjust these $p$ -values by multiplying the number of selected variables. The results are plotted in Figures 5 – 7. The result shows that almost all adjusted nominal $p$ -values are smaller than those of selective inference, and the difference between these $p$ -values is interpreted as the effect of selection bias.

7 Theoretical Analysis

In this section, we provide proofs of the theoretical results derived herein. We use the notation $p\lesssim q$ , which means that, if for any $p,q\in\mathbb{R}$ , there exists a constant $r>0$ such that $p\leq rq$ , and $p\gtrsim q$ is defined similarly. All proofs are based on fixed $S~{}(\supset S^{*})$ ; thus we simply denote $\hat{\bm{\beta}}_{S}$ and $X_{S}$ by $\hat{\bm{\beta}}$ and $X$ , respectively. This is because we need to verify several asymptotic condition before selections in the same way as in Tian and Taylor (2017); Taylor and Tibshirani (2018).

7.1 Proof of (6)

Let $\alpha_{n}=\sqrt{K/n}$ and define a $K$ dimensional vector $\bm{u}$ satisfying $\|\bm{u}\|=C$ for a sufficiently large $C>0$ . The concavity of $\ell_{n}$ implies

[TABLE]

and thus we need to show that for any $\varepsilon>0$ , there exists a sufficiently large $C>0$ such that

[TABLE]

In fact, the above inequality implies that $\hat{\bm{\beta}}\in\{\bm{\beta}+\alpha_{n}\bm{u};\;\|\bm{u}\|\leq C\}$ , that is, $\|\hat{\bm{\beta}}-\bm{\beta}^{*}\|={\rm O}_{\rm p}(\alpha_{n})$ .

Observe that $|\psi^{\prime}(\bm{x}_{i}^{\top}\bm{\beta})|,|\psi^{\prime\prime}(\bm{x}_{i}^{\top}\bm{\beta})|$ and $|\psi^{\prime\prime\prime}(\bm{x}_{i}^{\top}\bm{\beta})|$ are bounded uniformly with respect to $\bm{\beta}\in{\cal B}$ and $i$ . By using Taylor’s theorem, we have

[TABLE]

where for $i=1,2,\ldots,n$ , $\theta_{i}$ is in the line segment between $\bm{x}_{i}^{\top}\bm{\beta}^{*}$ and $\bm{x}_{i}^{\top}(\bm{\beta}^{*}+\alpha_{n}\bm{u})$ . From (C1) and (C2), we observe that

[TABLE]

and thus we have $|I_{1}|={\rm O}_{\rm p}(\alpha_{n}\sqrt{n}\|\bm{u}\|)={\rm O}_{\rm p}(\sqrt{K}\|\bm{u}\|)$ . Next, by using (C1) again, $I_{2}$ can be bounded as

[TABLE]

Finally, for $I_{3}$ , we have

[TABLE]

Combining all the above, if $K^{2}/n\to 0$ is satisfied, we observe that for sufficiently large $C$ , $I_{1}$ and $I_{2}$ are dominated by $I_{2}~{}(<0)$ . As a result, we obtain (17).

Remark 2

From (6) and (2), we have

[TABLE]

and thus, with probability tending to 1, $\hat{\bm{\beta}}\in{\cal B}$ holds.

7.2 Proof of Theorem 1

First, we prove that $\sqrt{n}(\hat{\bm{\beta}}-\bm{\beta}^{*})$ is asymptotically equivalent to $\Sigma_{n}^{-1}\bm{s}_{n}$ . By using Taylor’s theorem, we have

[TABLE]

where for $i=1,2,\ldots,n$ , $\tilde{\theta}_{i}$ is in the line segment between $\bm{x}_{i}^{\top}\bm{\beta}^{*}$ and $\bm{x}_{i}^{\top}\hat{\bm{\beta}}$ . In addition, (18) can be rewritten as

[TABLE]

where

[TABLE]

Noting that, from (C1),

[TABLE]

(C1), (C3) and (6) imply

[TABLE]

Now we can prove the asymptotic normality of $\sigma_{n}^{-1}\Sigma_{n}^{-1}\bm{s}_{n}$ . For any $K$ dimensional vector $\bm{\eta}$ with $\|\bm{\eta}\|<\infty$ , define $\sigma_{n}^{2}=\bm{\eta}^{\top}\Sigma_{n}^{-1}\bm{\eta}$ and $\omega_{n}$ such that

[TABLE]

Then, since $S\supset S^{*}$ , we observe that

[TABLE]

and

[TABLE]

To state the asymptotic normality of $\sigma_{n}^{-1}\Sigma_{n}^{-1}\bm{s}_{n}$ , we check the Lindeberg condition for $\omega_{n}$ : for any $\varepsilon>0$ ,

[TABLE]

For any $\varepsilon>0$ , we have

[TABLE]

By using the Cauchy-Schwarz inequality and (C1),

[TABLE]

Noting that each $y_{i}$ is distributed according to a Bernoulli distribution with parameter $\psi^{\prime}$ , ${\rm E}[(y_{i}-\psi^{\prime}_{i})^{4}]$ is uniformly bounded on ${\cal B}$ for any $i=1,\ldots,n$ by a simple calculation. Thus, by using the Cauchy-Schwarz inequality and Chebyshev’s inequality, we have

[TABLE]

Finally, since

[TABLE]

we have

[TABLE]

From (C3), this implies the Lindeberg condition (19).

7.3 Proof of Theorem 2

First, we prove that, for any $K$ dimensional vector $\bm{\eta}$ , the selection event can be expressed as an inequality with respect to $\bm{\eta}^{\top}\bm{T}_{n}$ . Let us define $\bm{w}=(I_{K}-\bm{c}\bm{\eta}^{\top})\bm{T}_{n}$ , where $\bm{c}=\Sigma_{n}^{-1}\bm{\eta}/\sigma_{n}^{2}$ . Then, since $\bm{T}_{n}=(\bm{\eta}^{\top}\bm{T}_{n})\bm{c}+\bm{w}$ , we have

[TABLE]

and this implies the former result in Theorem 2.

To prove the theorem, we need to verify asymptotic independency between $(L_{n},U_{n},N_{n})$ and $\bm{\eta}^{\top}\bm{T}_{n}$ . By the definition of $\bm{w}$ and Theorem 1,

[TABLE]

is asymptotically distributed according to a Gaussian distribution. Thus, $\bm{w}$ and $\bm{\eta}^{\top}\bm{T}_{n}$ are asymptotically independent since

[TABLE]

Now we only need to prove asymptotic independency between $\tilde{\bm{b}}$ and $\bm{\eta}^{\top}\bm{T}_{n}$ . Letting $\bm{\psi}^{\prime}=\bm{\psi}^{\prime}(\bm{\beta}^{*})$ , the definition of $\bm{T}_{n}$ and $\Sigma_{n}$ imply

[TABLE]

and thus

[TABLE]

Then, we observe that

[TABLE]

Since $\tilde{\bm{b}}$ can be expressed as a linear combination of $\bm{T}_{n}$ as well as $\bm{w}$ , the theorem holds when the covariance between $\tilde{\bm{b}}$ and $\bm{\eta}^{\top}\bm{T}_{n}$ converges to 0 as $n$ goes to infinity. By noting that $\Sigma_{n}=X_{S}^{\top}\Psi X_{S}/n$ , we have

[TABLE]

In addition, letting $\bm{a}=(1,-1)^{\top}$ , it is straightforward that

[TABLE]

by the definition of the selection event, where $\tilde{J}=(\bm{0}_{d-K},I_{d-K}\otimes\bm{a}^{\top})^{\top}$ . This implies $A_{S^{\bot}}^{\top}A_{S^{\bot}}=2KI_{d-K}$ . Finally, (C1), (C3), and (C4) imply

[TABLE]

and this proves the asymptotic independency between $\tilde{\bm{b}}$ and $\bm{\eta}^{\top}\bm{T}_{n}$ .

8 Concluding Remarks and Future Research

Recently, methods for data driven science such as selective inference and adaptive data analysis have become increasingly important as described by Barber and Candès (2016). Although there are several approaches for carrying out post-selection inference, we have developed a selective inference method for high dimensional classification problems, based on the work in Lee et al. (2016). In the same way as that seminal work, the polyhedral lemma (Lemma 1) plays an important role in our study. By considering high dimensional asymptotics concerning sample size and the number of selected variables, we have shown that a similar result to the polyhedral lemma holds even for high dimensional logistic regression problems. As a result, we could construct a pivotal quantity whose sampling distribution is represented as a truncated normal distribution which converges to a standard uniform distribution. In addition, through simulation experiments, it has been shown that the performance of our proposed method, in almost all cases, superior to other methods such as data splitting.

As suggested by the results from the simulation experiments, conditions might be relaxed to accommodate more general settings. In terms of future research in this domain, while we considered the logistic model in this paper, it is important to extend the results to other models, for example, generalized linear models. Further, higher order interaction models are also crucial in practice. In this situation, the size of the matrix in the selection event becomes very large, and thus it is cumbersome to compute truncation points in the polyhedral lemma. Suzumura et al. (2017) have shown that selective inference can be constructed in such a model by utilizing a pruning algorithm. In this respect, it is desirable to extend their result not only to linear regression modeling contexts but also to other models.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Barber and Candès (2016) Barber, R. F. and Candès, E. J. (2016) “A knockoff filter for high-dimensional selective inference,” ar Xiv preprint ar Xiv:1602.03574.
3Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013) “Valid post-selection inference,” The Annals of Statistics , Vol. 41, pp. 802–837.
4Bickel et al. (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009) “Simultaneous analysis of Lasso and Dantzig selector,” The Annals of Statistics , Vol. 37, pp. 1705–1732.
5Breiman (1992) Breiman, L. (1992) “The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error,” Journal of the American Statistical Association , Vol. 87, pp. 738–754.
6Cox (1975) Cox, D. (1975) “A note on data-splitting for the evaluation of significance levels,” Biometrika , Vol. 62, pp. 441–444.
7Dasgupta et al. (2014) Dasgupta, S., Khare, K., and Ghosh, M. (2014) “Asymptotic expansion of the posterior density in high dimensional generalized linear models,” Journal of Multivariate Analysis , Vol. 131, pp. 126–148.
8Dickhaus (2014) Dickhaus, T. (2014) Simultaneous statistical inference. With applications in the life sciences , Heidelberg: Springer.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Selective Inference via Marginal Screening for High Dimensional Classification

Abstract

Keywords:

1 Introduction

Notation

2 Selective Inference and Related Works

2.1 Selective Inference in Linear Regression Model

Lemma 1** (Polyhedral Lemma; Lee et al. (2016))**

2.2 Related Works

2.3 Beyond Normality

3 Setting and Assumptions

3.1 Marginal Screening and Selection Event

3.2 Selective Test

Remark 1

4 Proposed Method

Theorem 1

Theorem 2

4.1 Computing Truncation Points

4.2 Controlling Family-wise Error Rate

5 Simulation Study

5.1 Controlling Selective Type I Error

5.2 FWER and Power

6 Empirical Applications

7 Theoretical Analysis

7.1 Proof of (6)

Remark 2

7.2 Proof of Theorem 1

7.3 Proof of Theorem 2

8 Concluding Remarks and Future Research

Lemma 1 (Polyhedral Lemma; Lee et al. (2016))