The cumulative Kolmogorov filter for model-free screening in ultrahigh   dimensional data

Arlene K. H. Kim; Seung Jun Shin

arXiv:1701.01560·stat.ME·August 20, 2019

The cumulative Kolmogorov filter for model-free screening in ultrahigh dimensional data

Arlene K. H. Kim, Seung Jun Shin

PDF

TL;DR

This paper introduces a cumulative Kolmogorov filter that enhances model-free screening in ultrahigh-dimensional data by improving theoretical properties and demonstrating better finite sample performance.

Contribution

It develops a new cumulative Kolmogorov filter that extends the fused Kolmogorov filter with cumulative slicing, offering improved asymptotic results and practical performance.

Findings

01

Enhanced finite sample performance demonstrated numerically

02

Improved asymptotic results under relaxed assumptions

03

Extension of the fused Kolmogorov filter with cumulative slicing

Abstract

We propose a cumulative Kolmogorov filter to improve the fused Kolmogorov filter proposed by Zou (2015) via cumulative slicing. We establish an improved asymptotic result under relaxed assumptions and numerically demonstrate its enhanced finite sample performance.

Figures1

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Average number of minimum variables needed to keep all informative ones over 100 independent repetitions. Standard deviations are in parentheses.

Model	$d$	SIS		DCS		FKF		CKF
1	2	2.00	(0.00)	2.00	(0.00)	3.79	(6.28)	2.00	(0.00)
2	2	2038.12	(1348.05)	1985.10	(1460.82)	4.62	(9.14)	2.00	(0.00)
3	2	891.22	(1071.58)	350.88	(794.67)	3.88	(6.96)	2.00	(0.00)
4	10	10.04	(0.20)	10.04	(0.20)	10.26	(1.09)	10.06	(0.24)
5	10	150.10	(351.46)	12.50	(10.42)	10.23	(0.49)	10.11	(0.35)
6	10	1618.50	(1423.11)	927.16	(916.20)	10.81	(4.27)	10.03	(0.17)
7	2	1051.14	(1473.43)	682.47	(965.43)	2.00	(0.00)	2.00	(0.00)
8	3	2980.23	(1494.26)	277.43	(606.47)	9.05	(18.69)	6.66	(11.27)
9	8	3562.30	(1252.76)	231.63	(526.51)	60.84	(126.12)	38.59	(52.58)

Equations57

S^{*} = {j : F (y ∣ X) \mbox f u n c t i o na l l y d e p e n d so n X_{j} \mbox f or so m e y},

S^{*} = {j : F (y ∣ X) \mbox f u n c t i o na l l y d e p e n d so n X_{j} \mbox f or so m e y},

\kappa_{j}=\sup_{x}|P(X_{j}\leq x|Y=1)-P(X_{j}\leq x|Y=-1)|,\quad\mbox{$j=1,\ldots,p$,}

\kappa_{j}=\sup_{x}|P(X_{j}\leq x|Y=1)-P(X_{j}\leq x|Y=-1)|,\quad\mbox{$j=1,\ldots,p$,}

\kappa_{j}^{{\mathcal{G}}}=\max_{l,m}\sup_{x}|P(X_{j}\leq x|\tilde{Y}=m)-P(X_{j}\leq x|\tilde{Y}=l)|,\quad\mbox{$j=1,\ldots,p$}.

\kappa_{j}^{{\mathcal{G}}}=\max_{l,m}\sup_{x}|P(X_{j}\leq x|\tilde{Y}=m)-P(X_{j}\leq x|\tilde{Y}=l)|,\quad\mbox{$j=1,\ldots,p$}.

k_{j} (x) = y sup ∣ F (y ∣ X_{j} > x) - F (y ∣ X_{j} \leq x) ∣, j = 1, \dots, p .

k_{j} (x) = y sup ∣ F (y ∣ X_{j} > x) - F (y ∣ X_{j} \leq x) ∣, j = 1, \dots, p .

k_{j} (x) = \frac{1}{P ( X _{j} \leq x ) ( 1 - P ( X _{j} \leq x ))} y sup ∣ P (X_{j} \leq x) P (Y \leq y) - P (Y \leq y, X_{j} \leq x) ∣,

k_{j} (x) = \frac{1}{P ( X _{j} \leq x ) ( 1 - P ( X _{j} \leq x ))} y sup ∣ P (X_{j} \leq x) P (Y \leq y) - P (Y \leq y, X_{j} \leq x) ∣,

k_{j} (x)

k_{j} (x)

\displaystyle K_{j}=E\left[k_{j}(\tilde{X}_{j})\right],\quad\mbox{for \ $j=1,\ldots,p$,}

\displaystyle K_{j}=E\left[k_{j}(\tilde{X}_{j})\right],\quad\mbox{for \ $j=1,\ldots,p$,}

\hat{K}_{j} = \frac{1}{n} i = 1 \sum n \hat{k}_{j} (X_{ij}) .

\hat{K}_{j} = \frac{1}{n} i = 1 \sum n \hat{k}_{j} (X_{ij}) .

\hat{S}(d_{n})=\{j:\hat{K}_{j}\mbox{ is among the first $d_{n}$ largest of all $\hat{K}_{j},j=1,\cdots,p$}\}.

\hat{S}(d_{n})=\{j:\hat{K}_{j}\mbox{ is among the first $d_{n}$ largest of all $\hat{K}_{j},j=1,\cdots,p$}\}.

Δ_{S} = j \in S min K_{j} - j \in / S max K_{j} > 0.

Δ_{S} = j \in S min K_{j} - j \in / S max K_{j} > 0.

P (S^{*} \subset \hat{S} (d_{n})) \geq 1 - η,

P (S^{*} \subset \hat{S} (d_{n})) \geq 1 - η,

η = p (4 n exp (- n Δ_{S}^{2} /128) + 2 exp (- n Δ_{S}^{2} /16)) .

η = p (4 n exp (- n Δ_{S}^{2} /128) + 2 exp (- n Δ_{S}^{2} /16)) .

G(y):=\Phi(x)\Phi(y)-\int_{-\infty}^{y}\Phi\Big{(}\frac{x-\rho_{j}u}{\sqrt{1-\rho_{j}^{2}}}\Big{)}\phi(u)du.

G(y):=\Phi(x)\Phi(y)-\int_{-\infty}^{y}\Phi\Big{(}\frac{x-\rho_{j}u}{\sqrt{1-\rho_{j}^{2}}}\Big{)}\phi(u)du.

k_{j} (x) = \frac{1}{Φ ( x ) ( 1 - Φ ( x ))} y sup ∣ G (y) ∣.

k_{j} (x) = \frac{1}{Φ ( x ) ( 1 - Φ ( x ))} y sup ∣ G (y) ∣.

k_{j} (x)

k_{j} (x)

\frac{\partial k _{j} ( x )}{\partial ρ _{j}}

\frac{\partial k _{j} ( x )}{\partial ρ _{j}}

= \frac{sgn ( ρ _{j} )}{Φ ( x ) ( 1 - Φ ( x ))} \frac{( 1 - ρ _{j}^{2} ) ^{- 1/2}}{2 π} exp (- x^{2} h (ρ_{j})),

\hat{K}_{j} > K_{j} - Δ_{S} \geq j \in S min K_{j} - Δ_{S}

\hat{K}_{j} > K_{j} - Δ_{S} \geq j \in S min K_{j} - Δ_{S}

\hat{K}_{j} < K_{j} + Δ_{S} \leq j \in / S max K_{j} + Δ_{S}

P (j \in {1, \dots, p} max ∣ \hat{K}_{j} - K_{j} ∣ \geq Δ_{S})

P (j \in {1, \dots, p} max ∣ \hat{K}_{j} - K_{j} ∣ \geq Δ_{S})

\geq 1 - p (4 n exp (- n Δ_{S}^{2} /128) + 2 exp (- n Δ_{S}^{2} /16)) .

P (∣ \hat{K}_{j} - K_{j} ∣ \geq ϵ) \leq 4 n exp (- n ϵ^{2} /128) + 2 exp (- n ϵ^{2} /16) .

P (∣ \hat{K}_{j} - K_{j} ∣ \geq ϵ) \leq 4 n exp (- n ϵ^{2} /128) + 2 exp (- n ϵ^{2} /16) .

P (∣ \hat{K}_{j} - K_{j} ∣ \geq ϵ)

P (∣ \hat{K}_{j} - K_{j} ∣ \geq ϵ)

\displaystyle\leq P\Big{(}\Big{|}\sum_{\ell}\hat{k}_{j}(X_{\ell j})-\sum_{\ell}k_{j}(X_{\ell j})\Big{|}\geq\frac{n\epsilon}{2}\Big{)}+

\displaystyle\qquad\qquad P\Big{(}\Big{|}\frac{1}{n}\sum_{\ell}k_{j}(X_{\ell j})-Ek_{j}(\tilde{X}_{j})\Big{|}\geq\frac{\epsilon}{2}\Big{)}

:= (i) + (ii) .

(ii)

(ii)

\displaystyle(i)=P\Big{(}\Big{|}\sum_{\ell=1}^{n}\hat{k}_{j}(X_{\ell j})-\sum_{\ell=1}^{n}k_{j}(X_{\ell j})\Big{|}\geq\frac{n\epsilon}{2}\Big{)}

\displaystyle(i)=P\Big{(}\Big{|}\sum_{\ell=1}^{n}\hat{k}_{j}(X_{\ell j})-\sum_{\ell=1}^{n}k_{j}(X_{\ell j})\Big{|}\geq\frac{n\epsilon}{2}\Big{)}

\displaystyle\leq\sum_{\ell\neq\ell^{\prime}}{P}\Big{(}\Big{|}\hat{k}_{j}(X_{\ell j})

{P}\left(\Big{|}\hat{k}_{j}(x)-k_{j}(x)\Big{|}\geq\epsilon_{\tilde{\ell}}\Big{|}X_{1j},\ldots,X_{nj}\right)\leq 2\exp(-n_{+}\epsilon_{\tilde{\ell}}^{2}/2)+2\exp(-n_{-}\epsilon_{\tilde{\ell}}^{2}/2)

{P}\left(\Big{|}\hat{k}_{j}(x)-k_{j}(x)\Big{|}\geq\epsilon_{\tilde{\ell}}\Big{|}X_{1j},\ldots,X_{nj}\right)\leq 2\exp(-n_{+}\epsilon_{\tilde{\ell}}^{2}/2)+2\exp(-n_{-}\epsilon_{\tilde{\ell}}^{2}/2)

\sum_{\ell\neq\ell^{\prime}}{P}\left(\Big{|}\hat{k}_{j}(X_{\ell j})-k_{j}(X_{\ell j})\Big{|}\geq\epsilon_{\tilde{\ell}}\right)\leq\sum_{\ell=1}^{n-1}\Big{(}2\exp(-(n-\ell)\epsilon_{\ell}^{2}/2)+2\exp(-\ell\epsilon_{\ell}^{2}/2)\Big{)}.

\sum_{\ell\neq\ell^{\prime}}{P}\left(\Big{|}\hat{k}_{j}(X_{\ell j})-k_{j}(X_{\ell j})\Big{|}\geq\epsilon_{\tilde{\ell}}\right)\leq\sum_{\ell=1}^{n-1}\Big{(}2\exp(-(n-\ell)\epsilon_{\ell}^{2}/2)+2\exp(-\ell\epsilon_{\ell}^{2}/2)\Big{)}.

(i) \leq 4 ℓ = 1 \sum n - 1 exp (- ℓ ϵ_{ℓ}^{2} /2) = 4 ℓ = 1 \sum n - 1 exp (- n ϵ^{'2} /8) \leq 4 n exp (- n ϵ^{2} /128),

(i) \leq 4 ℓ = 1 \sum n - 1 exp (- ℓ ϵ_{ℓ}^{2} /2) = 4 ℓ = 1 \sum n - 1 exp (- n ϵ^{'2} /8) \leq 4 n exp (- n ϵ^{2} /128),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The cumulative Kolmogorov filter for model-free screening in ultrahigh dimensional data

Arlene K. H. Kim and Seung Jun Shin

University of Cambridge and Korea University

Abstract

We propose a cumulative Kolmogorov filter to improve the fused Kolmogorov filter proposed by Mai and Zou (2015) via cumulative slicing. We establish an improved asymptotic result under relaxed assumptions and numerically demonstrate its enhanced finite sample performance.

Keyword: cumulative slicing; Kolmogorov filter; model-free marginal screening

1 Introduction

Since Fan and Lv (2008), a marginal feature screening has been regarded as one canonical tool in ultrahigh-dimensional data analysis. Let $Y$ be a univariate response and ${\mathbf{X}}=(X_{1},\ldots,X_{p})^{T}$ be a $p$ -dimensional covariate. We assume that only a small subset of covariates are informative to explain $Y$ . In particular, we assume $|S^{*}|=d\ll p$ where

[TABLE]

with $F(\cdot|{\mathbf{X}})$ being the conditional distribution function of $Y|{\mathbf{X}}$ . Such assumption is reasonable since including large number of variables with weak signals often deteriorates the model performance due to accumulated estimation errors.

Since the introduction of Fan and Lv (2008), numerous marginal screening methods have been developed (see Section 1 of Mai and Zou (2015) for a comprehensive summary). Among these methods, model-free screening (Zhu et al., 2011; Li et al., 2012; Mai and Zou, 2015) is desirable since the screening is a pre-processing procedure followed by a main statistical analysis.

For feature selection in binary classification, Kolmogorov filter (KF) is proposed by Mai and Zou (2012). For each $X_{j},j=1,\ldots,p$ , KF computes

[TABLE]

and selects variables with large $\kappa_{j}$ ’s among all $j=1,\cdots,p$ . A sample version of $\kappa_{j}$ is obtained by replacing the probability measure with its empirical counterpart, leading to the well-known Kolmogorov–Smirnov statistic where its name came from. KF shows impressive performance in binary classification.

Recently, Mai and Zou (2015) have extended the idea of KF beyond the binary response by slicing data into $G$ pieces depending on the value of $Y$ . In particular, a pseudo response $\tilde{Y}$ taking $g$ if $Y\in(a_{g}-1,a_{g}]$ for $g=1,\ldots,G$ , is defined for given knots $\mathcal{G}=\{(-\infty=)a_{0}<a_{1}<\ldots<a_{G}(=\infty)\}$ . Following the spirit of KF, one can select a set of variables with large values of

[TABLE]

However, information loss is inevitable due to the lower resolution of pseudo variable $\tilde{Y}$ compared to $Y$ regardless of the choice of $\mathcal{G}$ . To tackle this, Mai and Zou (2015) proposed fused Kolmogorov filter (FKF) that combinies $\kappa_{j}^{\mathcal{G}}$ for different $N$ sets of knots $\mathcal{G}_{1},\ldots,\mathcal{G}_{N}$ and selects variables with large values of $\kappa_{j}^{\text{fused}}=\sum_{\ell=1}^{N}\kappa_{j}^{\mathcal{G}_{\ell}}$ , for $j=1,\ldots,p$ . The source of improvement in FKF is clear, however, it cannot perfectly overcome the information-loss problem caused by slicing. In addition, it is subtle to decide how to slice data in a finite sample case. To this end, we propose the cumulative Kolmogorov filter (CKF). CKF minimizes information loss from the slicing step and is free from choice of slices. As a consequence, it enhances the FKF.

2 Cumulative Kolmogorov filter

We let $F(\cdot|X_{j})$ denote the conditional distribution function of $Y$ given $X_{j}$ . Given $x$ such that $0<P(X_{j}\leq x)<1$ , define

[TABLE]

We remark that (3) is identical to (2) with $\mathcal{G}=\{-\infty,x,\infty\}$ except that the sliced variable in (3) is $X_{j}$ instead of $Y$ . The choice of a slicing variable between $X_{j}$ and $Y$ is not crucial, however, it would be more natural to slice independent variable in regression set up whose target is $E(Y|{\mathbf{X}})$ . Now,

[TABLE]

which immediately yields $k_{j}(x)=0$ for all $x$ satisfying $0<P(X_{j}\leq x)<1$ if and only if $X_{j}$ and $Y$ are independent. In fact, $k_{j}(x)$ indicates the level of dependence as shown in the following lemma.

Lemma 2.1

If $(X_{j},Y)$ has a bivariate Gaussian copula distribution such that $(g_{1}(X_{j}),g_{2}(Y))$ is jointly normal with correlations $\rho_{j}=\text{Cor}(g_{1}(X_{j}),g_{2}(Y))$ after transformation via two monotone funcitons $g_{1},g_{2}$ , and $g_{1}(X_{j})$ and $g_{2}(Y)$ are marginally standard normal. Then

$k_{j}(x)=1$ * if $|\rho_{j}|=1$ and $k_{j}(x)=0$ if $\rho_{j}=0$ .* 2. 2.

Denoting $y^{*}=x\big{(}\frac{1-\sqrt{1-\rho_{j}^{2}}}{\rho_{j}}\big{)}$ ,

[TABLE] 3. 3.

For each $x$ , $k_{j}(x)$ is a strictly increasing function of $|\rho_{j}|$ .

Nonetheless, (3) loses lots of information from the dichotomization of $X_{j}$ . To overcome this, we define

[TABLE]

where $\tilde{X}_{j}$ denotes an independent copy of $X_{j}$ . In the population level, (4) is fusing infinitely many KFs with all possible dichotomized $X_{j}$ ’s. By doing this, we can not only minimize efficiency loss but also be free from the choice of knot sets. Similar idea has been firstly proposed by Zhu et al. (2010) in the context of sufficient dimension reduction where the slicing scheme has been regarded as a canonical approach.

Given $(Y_{i},{\mathbf{X}}_{i}),i=1,\ldots,n$ where ${\mathbf{X}}_{i}=(X_{i1},\ldots,X_{ip})^{T}$ , a sample version of (3) is $\hat{k}_{j}(x)=\sup_{y}\left|\hat{F}(y|X_{j}>x)-\hat{F}(y|X_{j}\leq x)\right|$ where $\hat{F}(y|X_{j}>x)=\frac{\sum_{i=1}^{n}\mathds{1}_{\{Y_{i}\leq y,X_{ij}>x\}}}{\sum_{i=1}^{n}\mathds{1}_{\{X_{ij}>x\}}}$ and $\hat{F}(y|X_{j}\leq x)$ is similarly defined. Following the convention, we regard $0/0=0$ . Now, an estimator of (4) is given by

[TABLE]

Finally, for $d_{n}\in\mathbb{N}$ , we propose CKF to select the following set

[TABLE]

3 The Sure Screening Property

We assume a regularity condition.

Assumption 3.1

There exists a nondegenerate set $S$ such that $S^{*}\subseteq S$ and

[TABLE]

Assumption 3.1 is similar to the regularity condition (C1) for KFK (Mai and Zou, 2015). In fact, FKF requires one additional condition that guarantees that the estimated slices are not very different from oracle slices based on population quantiles of $Y$ , which is not necessary for CKF since it is free from the slice choice. KF with a binary response requires only one assumption similar to Assumption 3.1.

Theorem 3.2

Under Assumption 3.1, when $d_{n}\geq|S|$ and $\Delta_{S}>4/n$ ,

[TABLE]

where

[TABLE]

This probability tends to 1 when $\Delta_{S}\gg\sqrt{\frac{\log(pn)}{n}}$ .

The sure screening probability converges to one when $\Delta_{S}\gg\{\log(pn)/n\}^{1/2}$ .

4 A simulation study

4.1 A toy example

Consider a simple regression model $Y=\beta X+\epsilon$ where $X$ and $\epsilon$ are from independent $N(0,1)$ . In this regard, (5) can be thought as a statistic for testing $H_{0}:\beta=0$ . To demonstrate the performance of CKF, we compare its power to i) $\hat{\kappa}^{\text{binary}}=\sup_{y}|\hat{F}(y|X>\mbox{median}\{X\})-\hat{F}(y|X\leq\mbox{median}\{X\})|$ and ii) $\sum_{\ell=1}^{4}\hat{\kappa}^{\mathcal{G}_{\ell}}/4$ with four equally-spaced knot sets whose sizes are 3,4,5, and 6 as suggested by Mai and Zou (2015). Figure 1 depicts numerically computed power functions of three methods under significance level $\alpha=0.05$ . As expected, CKF (5) performs best while the simplest $\hat{\kappa}^{\text{binary}}$ does worst, which echoes the fact that screening performance can be improved by minimizing information loss entailed in the slicing step and CKF indeed achieves it.

4.2 Comparison to other screening methods

We consider the following nine models with $(n,p)=(200,5000)$ and $\epsilon\sim N(0,1)$ independent of ${\mathbf{X}}$ :

$U(Y)=T({\mathbf{X}})^{T}\beta+\epsilon$ , where $\beta=(2.8\times 1_{2}^{T},0_{p-2}^{T})^{T}$ , $T(X)\sim N_{p}(0_{p},{\boldsymbol{\Sigma}})$ with ${\boldsymbol{\Sigma}}=CS(0.7)$ . $CS(0.7)$ is a compound symmetry correlation matrix with the correlation coefficient of $0.7$ . Let $U(Y)=Y$ , $T({\mathbf{X}})={\mathbf{X}}$ . 2. 2.

$T({\mathbf{X}})={\mathbf{X}}^{1/9}$ and other settings are the same as Model 1. 3. 3.

$U(Y)=Y^{1/9}$ and other settings are the same as Model 1. 4. 4.

$U(Y)=T({\mathbf{X}})^{T}\beta+\epsilon$ , where $\beta=(0.8\times 1_{10}^{T},0_{p-10}^{T})^{T}$ , $T(X)\sim N_{p}(0_{p},{\boldsymbol{\Sigma}})$ with ${\boldsymbol{\Sigma}}=AR(0.7)$ . $AR(0.7)$ is an autoregressive correlation matrix with the autoregressive correlation coefficient of $0.7$ . Let $U(Y)=Y$ , $U({\mathbf{X}})={\mathbf{X}}$ . 5. 5.

$T({\mathbf{X}})=\frac{1}{2}\log({\mathbf{X}})$ and and other settings are the same as Model 4. 6. 6.

$U(Y)=\log(Y)$ and other settings are the same as Model 4. 7. 7.

$Y=(X_{1}+X_{2}+1)^{3}+\epsilon$ , where $X_{j}\stackrel{{\scriptstyle iid}}{{\sim}}Cauchy$ . 8. 8.

$Y=4X_{1}+2\tan(\pi X_{2}/2)+5X_{3}+\epsilon$ , where $X_{j}\stackrel{{\scriptstyle iid}}{{\sim}}U(0,1)$ independently. 9. 9.

$Y=2(X_{1}+0.8X_{2}+0.6X_{3}+0.4X_{4}+0.2X_{5})+\exp(X_{20}+X_{21}+X_{22})\epsilon$ , where ${\mathbf{X}}\sim N(0,{\boldsymbol{\Sigma}})$ with ${\boldsymbol{\Sigma}}=CS(0.8)$ .

To avoid a cutoff selection problem, we report the average number of minimum variables needed to recover all informative ones over 100 independent repetitions. Hence, a smaller value implies a better performance. Table 1 contains the comparison results against correlation learning (CS, Fan and Lv, 2008) and distance correlation learning (DCS, Li et al., 2012) as well as FKF. The results clearly show that the proposed CKF has improved performance compared to others including FKF.

5 Discussions

We employ a cumulative slicing technique to extend a screening tool for binary response to contiuous one. The idea is quite general and can be applied to t-test-based screening (Fan and Fan, 2008; Fan and Lv, 2008) as well as logistic-regression-based screening (Fan and Song, 2010). In addition, it is possible to extend the idea of CKF to the censored response by replacing the empirical distribution function with the Kaplan-Meier estimator.

Appendix A Proof of Lemma 2.1

Because $k_{j}$ is invariant under monotone transformation, it suffices to consider the case where $g_{1}(t)=t$ , $g_{2}(t)=t$ , and thus $X_{j}$ and $Y$ are jointly normal. If $\rho_{j}=0$ , then $X_{j}$ is independent of $Y$ and $k_{j}(x)=0$ . On the other hand, if $\rho_{j}\neq 0$ , $Y|X_{j}=x_{j}\sim N(\rho_{j}x,(1-\rho_{j}^{2}))$ . Let

[TABLE]

Then we have

[TABLE]

Note that $\frac{\partial G}{\partial y}=\Phi(x)\phi(y)-\Phi\Big{(}\frac{x-\rho_{j}y}{\sqrt{1-\rho_{j}^{2}}}\Big{)}\phi(y)=\phi(y)\left(\Phi(x)-\Phi\Big{(}\frac{x-\rho_{j}y}{\sqrt{1-\rho_{j}^{2}}}\Big{)}\right)$ , which gives $\frac{\partial G}{\partial y}\Big{|}_{y=y^{*}}=0.$ where $y^{*}=x\big{(}\frac{1-\sqrt{1-\rho_{j}^{2}}}{\rho_{j}}\big{)}$ . When $\rho_{j}<0$ then $G^{\prime\prime}(y^{*})=\phi(y^{*})\frac{\rho_{j}}{\sqrt{1-\rho_{j}^{2}}}\phi(x)<0.$ Thus when $\rho_{j}<0$ then $G$ attains its supremum at $y=y^{*}$ . Similarly, when $\rho_{j}>0$ then $-G$ attains its supremum at $y=y^{*}$ . It follows that

[TABLE]

When $\rho_{j}=1$ , then $k_{j}(x)=\frac{1}{\Phi(x)(1-\Phi(x))}\left|\Phi(x)-\Phi(x)\Phi(x)\right|=1.$ When $\rho_{j}=-1$ , then $k_{j}(x)=\frac{1}{\Phi(x)(1-\Phi(x))}\left|-\Phi(x)\Phi(-x)\right|=1.$

Now we show that $k_{j}(x)$ is an increasing function of $|\rho_{j}|$ by taking derivative $k_{j}(x)$ with respect to $\rho_{j}$ . After some tedious calculations,

[TABLE]

where $h(\rho_{j})=\frac{1-\sqrt{1-\rho_{j}^{2}}}{\rho_{j}^{2}}$ . Thus, $k_{j}(x)$ is increasing in $|\rho_{j}|$ since $h$ is symmetric.

Appendix B Proof of Theorem 3.2

Under the event that $\max_{j\in\{1,\ldots,p\}}|\hat{K}_{j}-K_{j}|<\Delta_{S}$ , we know that

[TABLE]

Hence, for any $d_{n}\geq|S|$ , we have $\hat{S}(|S|)\subset\hat{S}(d_{n})$ , which implies $S^{*}\subseteq\hat{S}(d_{n})$ . On the other hand, by the following Lemma B.1, we have for any $\Delta_{S}>4/n$ ,

[TABLE]

It follows that when $\Delta_{S}\gg\sqrt{\log(pn)/n}$ , the probability tends to 1.

Lemma B.1 Consider $K_{j}$ in (4) and $\hat{K}_{j}$ in (5). Then for any $\epsilon>4/n$ ,

[TABLE]

Proof of Lemma B.1 Without loss of generality, we only need to consider $\epsilon<1$ since otherwise, the probability in the left side is trivially 0. Also we assume that all $X_{\ell j}$ are distinct for convenience. First, we use a simple triangle inequality to bound

[TABLE]

Then we treat the second term $(ii)$ . By the Bernstein’s inequality(e.g. Lemma 2.2.9 in Van Der Vaart and Wellner (1996)), and using the fact that each $X_{\ell j}$ for $\ell=1,\ldots,n$ is independent and has the same distribution as the distribution of $\tilde{X}_{j}$ , we have

[TABLE]

where the first inequality follows by bounding the variance of each $k_{j}(X_{\ell j})-Ek_{j}(\tilde{X}_{j})$ by 1 from the fact that $|k_{j}(X_{\ell j})-Ek_{j}(\tilde{X}_{j})|\leq 1$ for any $\ell=1,\ldots n$ .

Now we consider the first term $(i)$ in (6). First note that $|\hat{k}_{j}(X_{\ell j})-k_{j}(X_{\ell j})|\leq 1$ for any $\ell=1,\ldots,n$ . We use this trivial bound for $\ell=\ell^{\prime}$ where $X_{\ell^{\prime}j}$ is the maximum of $X_{1j},\ldots,X_{nj}$ . Let $\epsilon^{\prime}:=\epsilon/2-1/n$ and $\epsilon_{\ell}:=\frac{1}{2}\sqrt{\frac{n}{\ell}}\epsilon^{\prime}$ . Using $\sum_{\ell=1}^{n}\epsilon_{\ell}=\sum_{\ell=1}^{n}\frac{1}{2}\sqrt{\frac{n}{\ell}}\epsilon^{\prime}\leq\frac{\sqrt{n}\epsilon^{\prime}}{2}\int_{1}^{n}x^{-1/2}dx\leq n\epsilon^{\prime},$ we have by the union bound that

[TABLE]

where $\tilde{\ell}$ corresponds to the rank of $X_{\ell j}$ .

We bound (7) by above using similar ideas in Lemma A1 of Mai and Zou (2012). Using the Dvoretzky–Kiefer–Wolfowitz inequality, for any $x$ in the support of $X_{j}$ ,

[TABLE]

where $n_{+}=\sum_{i=1}^{n}\mathds{1}_{\{X_{ij}>x\}}$ and $n_{-}=\sum_{i=1}^{n}\mathds{1}_{\{X_{ij}\leq x\}}$ . Thus by replacing $x$ by $X_{\ell j}$ followed by taking the expectation, we have

[TABLE]

It follows by symmetry

[TABLE]

where the last inequality holds since $\epsilon^{\prime}=\epsilon/2-1/n\geq\epsilon/4$ . The proof is complete.

Bibliography10

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Fan and Fan (2008) Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules, The Annals of statistics 36 (6): 2605.
3Fan and Lv (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society: Series B 70 (5): 849–911.
4Fan and Song (2010) Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality, The Annals of Statistics 38 (6): 3567–3604.
5Li et al. (2012) Li, R., Zhong, W. and Zhu, L. (2012). Feature screening via distance correlation learning, Journal of the American Statistical Association 107 (499): 1129–1139.
6Mai and Zou (2012) Mai, Q. and Zou, H. (2012). The kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika 100 : 229–234.
7Mai and Zou (2015) Mai, Q. and Zou, H. (2015). The fused Kolmogorov filter: A nonparametric model-free screening method, The Annals of Statistics 43 (4): 1471–1497.
8Van Der Vaart and Wellner (1996) Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence , Springer, New York.