Model-free Feature Screening and FDR Control with Knockoff Features
Wanjun Liu, Yuan Ke, Jingyuan Liu, Runze Li

TL;DR
This paper introduces a model-free, data-adaptive feature screening method for high-dimensional data using projection correlation, with an innovative two-step FDR control approach leveraging knockoff features, demonstrating strong empirical results.
Contribution
It develops a novel model-free screening method based on projection correlation and a two-step FDR control procedure using knockoff features, applicable to complex data scenarios.
Findings
Method achieves sure screening and rank consistency.
FDR control is effective when FDR level ≥ 1/s.
Numerical experiments show superior empirical performance.
Abstract
This paper proposes a model-free and data-adaptive feature screening method for ultra-high dimensional datasets. The proposed method is based on the projection correlation which measures the dependence between two random vectors. This projection correlation based method does not require specifying a regression model and applies to the data in the presence of heavy-tailed errors and multivariate response. It enjoys both sure screening and rank consistency properties under weak assumptions. Further, a two-step approach is proposed to control the false discovery rate (FDR) in feature screening with the help of knockoff features. It can be shown that the proposed two-step approach enjoys both sure screening and FDR control if the pre-specified FDR level is greater or equal to , where is the number of active features. The superior empirical performance of the proposed…
| Model 1.a | Model 1.b | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 5.0 | 5.0 | 5.0 | 5.0 | 8.0 | 5.0 | 5.0 | 6.0 | 14.0 | 125.0 | ||
| DC-SIS | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 5.0 | 5.0 | 10.0 | 81.2 | 1305.0 | ||
| bcDC-SIS | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 5.0 | 5.0 | 6.0 | 20.5 | 231.4 | ||
| SIS | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 238.0 | 1833.0 | 3878.5 | 4915.0 | ||
| PC-Screen | 5.0 | 5.0 | 5.0 | 5.0 | 7.0 | 5.0 | 5.0 | 8.0 | 23.0 | 233.5 | ||
| DC-SIS | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 5.0 | 6.8 | 21.0 | 204.2 | 3511.2 | ||
| bcDC-SIS | 5.0 | 5.0 | 5.0 | 5.0 | 6.0 | 5.0 | 5.0 | 10.0 | 43.2 | 718.8 | ||
| SIS | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 16.0 | 703.2 | 3418.5 | 7432.0 | 9651.0 | ||
| Model 1.c | Model 1.d | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 5.0 | 5.0 | 6.0 | 8.0 | 50.9 | 5.0 | 5.0 | 6.0 | 10.2 | 139.9 | ||
| DC-SIS | 5.0 | 8.0 | 42.5 | 143.0 | 722.5 | 5.0 | 16.0 | 54.0 | 189.0 | 701.6 | ||
| bcDC-SIS | 5.0 | 5.0 | 6.0 | 14.2 | 80.4 | 5.0 | 5.0 | 8.0 | 21.8 | 156.8 | ||
| SIS | 5.0 | 39.0 | 81.5 | 374.0 | 3244.4 | 5.0 | 45.8 | 130.5 | 523.5 | 3241.0 | ||
| PC-Screen | 5.0 | 5.0 | 6.0 | 8.0 | 92.4 | 5.0 | 6.0 | 6.0 | 13.0 | 176.2 | ||
| DC-SIS | 5.0 | 18.2 | 78.0 | 277.5 | 1567.5 | 5.0 | 30.8 | 113.0 | 410.0 | 2036.5 | ||
| bcDC-SIS | 5.0 | 5.0 | 7.0 | 19.2 | 179.1 | 5.0 | 5.0 | 10.0 | 27.0 | 412.6 | ||
| SIS | 6.0 | 58.8 | 180.5 | 773.2 | 4189.0 | 8.9 | 73.0 | 244.5 | 959.8 | 5777.7 | ||
| Model 1.e | Model 1.f | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 5.0 | 5.0 | 5.0 | 5.0 | 17.2 | 5.0 | 5.0 | 5.0 | 5.0 | 7.0 | ||
| DC-SIS | 90.3 | 396.0 | 897.5 | 1762.5 | 3787.3 | 76.3 | 396.0 | 893.5 | 1764.8 | 3471.8 | ||
| bcDC-SIS | 22.7 | 82.2 | 259.5 | 669.2 | 2358.7 | 15.0 | 87.2 | 266.5 | 821.2 | 2471.5 | ||
| SIS | 178.6 | 604.8 | 1137.0 | 2319.8 | 4303.4 | 186.0 | 606.2 | 1210.5 | 2253.0 | 4261.6 | ||
| PC-Screen | 5.0 | 5.0 | 5.0 | 6.0 | 23.0 | 5.0 | 5.0 | 5.0 | 5.0 | 12.0 | ||
| DC-SIS | 138.0 | 788.8 | 1878.5 | 3301.0 | 6831.9 | 154.2 | 729.2 | 1725.5 | 3424.0 | 6832.8 | ||
| bcDC-SIS | 45.7 | 163.5 | 534.5 | 1520.5 | 5415.2 | 30.0 | 175.8 | 509.0 | 1513.2 | 5580.0 | ||
| SIS | 462.8 | 1276.8 | 2460.0 | 4164.5 | 8622.5 | 512.6 | 1271.2 | 2484.5 | 4281.0 | 8416.3 | ||
| Model 2.a | Model 2.b | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 4.0 | 4.0 | 4.0 | 5.2 | 19.1 | 4.0 | 5.0 | 9.5 | 26.5 | 261.9 | ||
| DC-SIS | 600.0 | 1880.2 | 2994.5 | 4052.2 | 4701.3 | 4.0 | 7.8 | 61.0 | 541.5 | 2310.9 | ||
| bcDC-SIS | 488.4 | 1480.8 | 2863.0 | 3967.8 | 4855.9 | 4.0 | 6.0 | 21.5 | 88.0 | 893.8 | ||
| SIS | 709.0 | 2065.8 | 3062.5 | 4160.0 | 4869.4 | 54.0 | 658.5 | 2692.5 | 4213.0 | 4829.1 | ||
| PC-Screen | 4.0 | 4.0 | 4.0 | 5.0 | 31.0 | 4.0 | 5.8 | 13.0 | 48.8 | 393.9 | ||
| DC-SIS | 664.0 | 3162.5 | 5655.5 | 7605.8 | 9490.1 | 4.0 | 13.8 | 86.0 | 843.2 | 5529.2 | ||
| bcDC-SIS | 663.4 | 2605.8 | 5578.5 | 7745.2 | 9208.4 | 4.0 | 8.0 | 24.5 | 150.5 | 1403.9 | ||
| SIS | 986.2 | 4048.2 | 6068.0 | 8298.5 | 9752.6 | 64.5 | 1193.8 | 4488.0 | 8139.5 | 9725.6 | ||
| Model 2.c | Model 2.d | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 4.0 | 4.0 | 4.0 | 6.0 | 21.3 | 4.0 | 5.0 | 9.5 | 36.0 | 169.4 | ||
| DC-SIS | 397.6 | 1536.8 | 2750.5 | 3930.0 | 4721.8 | 1639.1 | 3138.5 | 3851.5 | 4434.8 | 4902.4 | ||
| bcDC-SIS | 196.5 | 1267.5 | 2774.0 | 4236.0 | 4986.2 | 605.5 | 1251.8 | 2073.5 | 2972.5 | 4209.1 | ||
| SIS | 421.2 | 1615.8 | 2920.0 | 4057.0 | 4761.0 | 2065.0 | 3487.8 | 4083.5 | 4659.0 | 4950.3 | ||
| PC-Screen | 4.0 | 4.0 | 4.0 | 7.2 | 25.0 | 4.0 | 5.0 | 13.0 | 62.5 | 297.4 | ||
| DC-SIS | 668.8 | 3628.2 | 6243.5 | 7920.5 | 9531.6 | 3409.6 | 6380.8 | 7705.0 | 8756.5 | 9790.6 | ||
| bcDC-SIS | 288.2 | 2006.5 | 4714.5 | 7370.2 | 9869.9 | 776.4 | 2567.8 | 4154.5 | 5517.8 | 8251.0 | ||
| SIS | 760.9 | 3609.5 | 6155.5 | 8129.8 | 9577.1 | 3333.4 | 6387.8 | 8007.5 | 9252.8 | 9847.2 | ||
| Model 3.a | Model 3.b | |||||||||||
| 5% | 25% | 50% | 75% | 95% | 5% | 25% | 50% | 75% | 95% | |||
| PC-Screen | 4.0 | 4.0 | 4.0 | 4.0 | 6.0 | 4.0 | 4.0 | 6.0 | 18.0 | 114.4 | ||
| DC-SIS | 53.0 | 463.0 | 1211.0 | 2349.0 | 3774.2 | 641.1 | 2308.8 | 3307.0 | 4257.8 | 4838.0 | ||
| bcDC-SIS | 24.0 | 215.5 | 758.5 | 1965.0 | 3999.8 | 225.2 | 1270.2 | 2494.5 | 3596.8 | 4709.0 | ||
| PC-Screen | 4.0 | 4.0 | 4.0 | 4.0 | 9.0 | 4.0 | 4.0 | 8.0 | 30.2 | 302.9 | ||
| DC-SIS | 136.8 | 978.5 | 2237.5 | 4506.0 | 8106.6 | 1804.7 | 4510.5 | 6445.5 | 8153.5 | 9707.6 | ||
| bcDC-SIS | 83.0 | 546.0 | 1828.5 | 4264.5 | 7711.3 | 444.4 | 2217.0 | 4695.0 | 7122.5 | 9076.4 | ||
| All | |||||||||||||
| Model 4.a | |||||||||||||
| 0.10 | 11.245 | 0.995 | 0.995 | 1.000 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.990 | 0.990 | 0.097 |
| 0.15 | 11.935 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.990 | 0.130 |
| 0.20 | 12.860 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.188 |
| 0.25 | 14.095 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.244 |
| 0.30 | 15.270 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.285 |
| Model 4.b | |||||||||||||
| 0.10 | 10.545 | 0.945 | 0.945 | 0.960 | 0.950 | 0.955 | 0.950 | 0.955 | 0.955 | 0.955 | 0.945 | 0.915 | 0.079 |
| 0.15 | 11.350 | 0.985 | 0.990 | 0.995 | 0.995 | 0.995 | 1.000 | 0.995 | 1.000 | 0.990 | 0.970 | 0.920 | 0.100 |
| 0.20 | 12.460 | 0.990 | 0.995 | 1.000 | 0.995 | 0.995 | 1.000 | 0.995 | 1.000 | 1.000 | 0.985 | 0.955 | 0.164 |
| 0.25 | 13.570 | 0.995 | 1.000 | 1.000 | 0.995 | 0.995 | 1.000 | 0.995 | 1.000 | 1.000 | 0.985 | 0.965 | 0.216 |
| 0.30 | 14.705 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 0.990 | 0.980 | 0.260 |
| Model 4.c | |||||||||||||
| 0.10 | 11.165 | 0.995 | 0.995 | 0.995 | 1.000 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.995 | 0.092 |
| 0.15 | 11.660 | 1.000 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.116 |
| 0.20 | 13.075 | 1.000 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.193 |
| 0.25 | 14.420 | 1.000 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.254 |
| 0.30 | 15.570 | 1.000 | 0.995 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | 0.299 |
| Model 4.d | |||||||||||||
| 0.10 | 10.630 | 0.940 | 0.945 | 0.940 | 0.945 | 0.955 | 0.940 | 0.950 | 0.940 | 0.950 | 0.940 | 0.925 | 0.092 |
| 0.15 | 11.660 | 0.985 | 0.980 | 0.995 | 0.985 | 0.995 | 0.990 | 1.000 | 1.000 | 0.990 | 0.990 | 0.930 | 0.122 |
| 0.20 | 12.870 | 0.990 | 0.990 | 1.000 | 1.000 | 1.000 | 0.995 | 1.000 | 1.000 | 0.990 | 0.990 | 0.965 | 0.193 |
| 0.25 | 13.830 | 0.990 | 0.995 | 1.000 | 1.000 | 1.000 | 0.995 | 1.000 | 1.000 | 0.995 | 0.995 | 0.975 | 0.242 |
| 0.30 | 15.020 | 0.995 | 0.995 | 1.000 | 1.000 | 1.000 | 0.995 | 1.000 | 1.000 | 0.995 | 0.995 | 0.980 | 0.288 |
| Model 4.e | |||||||||||||
| 0.10 | 9.790 | 0.855 | 0.870 | 0.860 | 0.860 | 0.865 | 0.870 | 0.875 | 0.870 | 0.865 | 0.865 | 0.815 | 0.086 |
| 0.15 | 11.590 | 0.955 | 0.985 | 0.975 | 0.965 | 0.980 | 0.995 | 0.980 | 0.970 | 0.965 | 0.970 | 0.825 | 0.123 |
| 0.20 | 12.895 | 0.970 | 0.990 | 0.995 | 0.995 | 1.000 | 1.000 | 0.995 | 0.990 | 0.985 | 0.980 | 0.935 | 0.189 |
| 0.25 | 14.025 | 0.980 | 0.995 | 0.995 | 0.995 | 1.000 | 1.000 | 0.995 | 0.990 | 0.990 | 0.990 | 0.935 | 0.244 |
| 0.30 | 15.040 | 0.985 | 1.000 | 0.995 | 0.995 | 1.000 | 1.000 | 0.995 | 0.990 | 0.990 | 0.990 | 0.945 | 0.283 |
| PC-Screen | PC-Knockoff | ||||||
| All | All | ||||||
| Model 4.a | 100 | 0.995 | 0.90 | 12.9 | 0.995 | 0.188 | |
| Model 4.b | 100 | 0.995 | 0.90 | 12.5 | 0.955 | 0.164 | |
| Model 4.c | 100 | 0.995 | 0.90 | 13.0 | 0.995 | 0.193 | |
| Model 4.d | 100 | 0.995 | 0.90 | 12.9 | 0.965 | 0.193 | |
| Model 4.e | 100 | 0.995 | 0.90 | 12.9 | 0.935 | 0.189 | |
| Training set | Test set | ||||
| Mean | SD | Mean | SD | ||
| PC-Knockoff | 0.8681 | 0.0042 | 0.8608 | 0.0281 | |
| DC-SIS with -test | 0.8534 | 0.0043 | 0.8488 | 0.0276 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Model-free Feature Screening and
FDR Control with Knockoff Features
Wanjun Liu, Yuan Ke, Jingyuan Liu
and Runze Li Department of Statistics, The Pennsylvania State University, University Park, PA 16802. E-mail: [email protected].Department of Statistics, University of Georgia, Athens, GA 30602. E-mail: [email protected].MOE Key Laboratory of Econometrics, Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, P.R China. E-mail: [email protected]Department of Statistics, The Pennsylvania State University, University Park, PA 16802. E-mail: [email protected].
Abstract
This paper proposes a model-free and data-adaptive feature screening method for ultra-high dimensional data. The proposed method is based on the projection correlation which measures the dependence between two random vectors. This projection correlation based method does not require specifying a regression model, and applies to data in the presence of heavy tails and multivariate responses. It enjoys both sure screening and rank consistency properties under weak assumptions. A two-step approach, with the help of knockoff features, is advocated to specify the threshold for feature screening such that the false discovery rate (FDR) is controlled under a pre-specified level. The proposed two-step approach enjoys both sure screening and FDR control simultaneously if the pre-specified FDR level is greater or equal to , where is the number of active features. The superior empirical performance of the proposed method is illustrated by simulation examples and real data applications.
Keywords: Data adaptive, multivariate response model, nonlinear model, projection correlation, rank consistency, sample splitting, sure screening
1 Introduction
The technological development has made extensive data collection and storage feasible in diverse fields. Datasets with ultra-high dimensional features characterize many contemporary research problems in machine learning, computer science, statistics, engineering, social science, finance and so on. When the features contain redundant or noisy information, estimating their functional relationship with the response may become quite challenging in terms of computational expediency, statistical accuracy, and algorithmic stability (Fan et al., 2009; Hall & Miller, 2009; Lv & Liu, 2014). To overcome such challenges caused by ultra-high dimensionality, Fan & Lv (2008) proposed a sure independence screening (SIS) method, which aims to screen out the redundant features by ranking their marginal Pearson correlations. The SIS method is named after the sure independence screening property, which states the selected subset of features contains all the active ones with probability approaching one. The promising numerical performance soon made SIS popular among ultra-high dimensional studies (Liu et al., 2015; Liu & Li, 2020). The sure screening idea has been applied to many important statistical problems including generalized linear model (Fan & Song, 2010), multi-index semi-parametric model (Zhu et al., 2011), nonparametric model (Fan et al., 2011; Liu et al., 2014), quantile regression (He et al., 2013; Wu & Yin, 2015) and compressed sensing (Xue & Zou, 2011) among others.
In addition to the sure screening property, we argue an appealing screening method should satisfy the following two properties. First, the screening method should be model-free in the sense that it can be implemented without specifying a regression model. In the ultra-high dimensional regime, it is challenging, if not impossible, to specify a correct regression model before removing a huge number of redundant features. Hence, the model-free property is desired as it guarantees the effectiveness of the screening method in the presence of model misspecification. The model-free screening method becomes a hot research topic in recent years (Zhu et al., 2011; Li et al., 2012; Mai & Zou, 2015). The second property is data-adaptive which means the screening method should not be sensitive to assumptions like independence, sub-Gaussianity, and univariate response. Such assumptions are usually not satisfied under ultra-high dimensional settings. Even met on the population level, they can be violated in the observed sample due to ultra-high dimensionality. Therefore, the screening methods that are sensitive to such assumptions may perform poorly in real applications. The data-adaptive screening methods also draw a certain amount of attention recently. For instance, He et al. (2013), Wu & Yin (2015) and Ma et al. (2017), among others, considered quantile-based screening which adapts to heavy-tailed data. In addition, Wang (2012) and Fan, Ke & Wang (2020) developed screening methods for strongly correlated features.
Unfortunately, none of the aforementioned screening methods enjoys sure screening, model-free and data-adaptive properties simultaneously. For example, the SIS is tailored to the linear regression and depends on the independence assumption. Li et al. (2012) developed a model-free sure independence screening procedure based on the distance correlation. However, its sure screening property requires sub-exponential assumption for features and response. The Kolmogorov distance based screening method proposed in Mai & Zou (2012) is robust against heavy-tailed data, but it only works for binary classification problems. Pan et al. (2016) proposed a pairwise sure screening procedure for linear discriminant analysis, which is not sensitive to the tail behavior of the features, but requires balanced categories.
In this paper, we propose a model-free and data-adaptive feature screening method named PC-Screen. The PC-Screen is based on ranking the projection correlations between features and response variables. The projection correlation, proposed by Zhu et al. (2017), is a measure of dependence between two random vectors which enjoys several nice probability properties. The PC-Screen does not require specifying any regression model and is insensitive to the correlations and moment conditions of the dataset. As the projection correlation is dimension-free to both random vectors, the PC-Screen can be applied to multi-task learning problems (Caruana, 1997). For instance, we can find a parsimonious set of features that are jointly dependent on the multivariate response. As existing asymptotic results are not suitable for the “large small problem” studied in this paper, we also develop non-asymptotic concentration-type inequalities for empirical projection correlation. Further, we demonstrate the proposed PC-Screen enjoys not only the sure screening property but also a stronger result called rank consistency property. The only condition required is a minimum signal strength gap between active and inactive features. The extensive numerical examples show the proposed method wins the horse racing against its competitors in various scenarios.
Most feature screening methods depend on some threshold parameter that controls the cut-off between active and inactive features. The optimal choice of the threshold typically depends on unknown parameters. Under a specific model assumption, the threshold can be selected by cross-validation or information criteria approaches. However, when no model assumption is imposed, such parameter selection approaches are not applicable as the loss function which measures the goodness of fit is not well defined. In addition, existing screening methods tend to sacrifice the false discovery rate for sure screening property by choosing a conservative threshold parameter, leading to an inflated model size. Recently, Zhao & Li (2012) studied the sure independence screening for Cox models with a principled method to select the threshold aiming at controlling the false positive rate. Song et al. (2014) proposed a censored rank independence screening for survival data and chose the threshold by estimating the proportion of active features. The validity of this procedure relies on the independence assumption of multiple test statistics which may easily get violated in the model-free setup. In this paper, we tackle the issue of threshold selection with a two-step procedure named PC-Knockoff. In the first step, we apply the PC-Screen method to obtain an over-fitted subset of moderate size from the ultra-high dimensional features. In the second step, we construct knockoff counterparts for the features which survive in the first step. Conditioning on the sure screening of the first step, we further select a parsimonious model with FDR controlled below a pre-specified level with a statistic that utilizing the knockoff features. Theoretical analysis shows that when the pre-specified level is not too aggressive, the PC-Knockoff procedure enjoys sure screening property, as well as conditional FDR control simultaneously with high probability. We also validate the theoretical findings with various numerical examples.
The rest of the paper is organized as follows. In Section 2, we briefly review the definition and properties of projection correlation and demonstrate its non-asymptotic properties. Then we propose the PC-Screen procedure and show its sure screening and rank consistency properties under very mild conditions. Section 3 studies the PC-Knockoff procedure which selects the threshold with FDR control as well as sure screening property. Section 4 assesses the finite sample performance of the PC-Screen and PC-Knockoff methods with several simulated examples and a real data application. We briefly summarize the paper in Section 5. Due to the limited space, we provide the proofs of some theoretical results in the Appendix, and relegate the proofs of remaining results as well as some additional numerical examples to the a supplementary file.
2 Model-free and data-adaptive screening procedure
2.1 Projection correlation
To pave the way for the proposed screening procedure, we first provide some background on the projection correlation and its properties introduced in Zhu et al. (2017). Let and be two random vectors. The projection correlation is elicited by the following independence testing problem,
[TABLE]
The null hypothesis holds if and only if and are independent for all unit vectors and . Let be the joint distribution of , and and be the marginal distributions of and . The squared projection covariance is defined as
[TABLE]
where is the indicator function. Furthermore, the projection correlation between and is defined as the square root of
[TABLE]
and we follow the convention .
In general , testing whether and are independent amounts to testing whether . The projection correlation is a measure of dependence between two random vectors and enjoys some appealing properties. Let and be two random vectors with continuous marginal and joint probability distributions, if and only if and are independent. We remark here this property does not hold in general without the assumption that is jointly continuous. When and are two dependent discrete random variables that are constructed in a similar fashion as in Hoeffding (1948), it is possible that we have .
Zhu et al. (2017) gives an explicit formula for the squared projection covariance in (2.1). Let be 5 independent random copies of , then
[TABLE]
where is the norm. Equation (2.3) shows that the projection covariance only depends on the vectors through forms and whose second moments are unity. This gives us the intuition that the projection covariance is free of the moment conditions on which are usually required by some other measurements, such as distance correlation (Li et al., 2012).
Let and be an observed sample of . Equation (2.3) leads to a straightforward estimator of based on a -statistic, yet it is difficult to calculate (Székely & Rizzo, 2010). An equivalent form of the -statistic is given in Zhu et al. (2017). In particular, the squared sample projection variance and covariance between and can be calculated as
[TABLE]
where
[TABLE]
Then the sample projection correlation between and is defined as the square root of
[TABLE]
Based on (2.4), the sample projection correlation can be computed in .
We first provide exponential-type deviation inequalities for squared sample projection covariance and correlation.
Theorem 1**.**
For any satisfying , there exists positive constants and , such that
[TABLE]
[TABLE]
where , , and .
The proof of Theorem 1 is based on an exponential-type deviation inequality for -statistic and can be found in the supplementary material. The above exponential inequalities do not depend on the dimensionality and moment conditions of both random vectors. The exception probability decays exponentially with sample size which guarantees good finite sample performance of the proposed estimator.
2.2 PC-Screen procedure
In this subsection, we propose a model-free and data-adaptive screening procedure utilizing the nice properties of projection correlation. Let be the response vector of variables and be the vector of features. To avoid the trivial discussion, we restrict ourselves to the non-degenerate case. That is, and for some .
Denote the conditional distribution function of given . Without specifying any regression model of on , we define the index set of active features by
[TABLE]
The number of active features is , the cardinality of . The features that do not belong to are inactive features. We use , the complement of , to denote the index set of inactive features. The above setting abstracts a large number of sparse regression problems including linear model, generalized linear model, additive model, semi-parametric model, non-linear model and so on. Moreover, it also allows the multivariate response and grouped predictors.
Suppose that is a random sample of . Denote and . In the ultra-high dimensional regime, it is natural to assume that the number of features greatly exceeds the sample size , but the number of active features is smaller than . For a given feature , , a sufficient condition for to be an inactive feature is the independence between and . This, together with Theorem 1, motivates us to screen out the features whose projection correlations with are close to [math]. As a result, we estimate the set of active features by
[TABLE]
where is a pre-specified positive threshold, and is the th column of . With a proper choice of , we show that the proposed feature screening procedure enjoys the sure screening property, which states that with probability approaching 1, all active features are included in . We name this feature screening procedure as projection correlation based screening, or PC-Screen.
Denote and to be the squared population and sample projection correlations between the th feature and response, respectively. To analyze the property of PC-Screen, we impose the following minimum signal strength condition.
Condition 1. (Minimum signal strength)
- (a)
For some and ,
- (b)
For some and ,
Remark 1**.**
Condition 1 (a) is a minimum signal strength condition that assumes the squared projection correlations between active features and response are uniformly bounded below, and cannot converge to zero too fast as diverges. Condition 1 (b) imposes an assumption on the gap of signal strength between active and inactive features. Condition 1 (a) is weaker than Condition 1 (b), since is always non-negative. Such minimum signal strength condition can be viewed as a sparsity assumption that guarantees active features to be distinguished from the inactive ones. Condition 1 is a very mild condition as we allow the minimum signal strength to converge to zero as the sample size diverges. Empirically, one may verify Condition 1 by plotting the sorted empirical projection correlations of features in descending order and visually identifying whether there exists an “elbow” shape, or whether the sequence can be segmented into two groups. A more rigorous verification approach is to consider a multiple-testing procedure. For each feature, one can calculate the projection correlation based test statistic proposed in Zhu et al. (2017). Since it is not the main focus of this paper, we do not pursue more details.
The following two theorems state the sure screening property and the rank consistency property of PC-Screen.
Theorem 2** (Sure screening).**
Under Condition 1 (a), take , we have
[TABLE]
where is a positive constant.
In Theorem 2, if we set , which satisfies the condition , we have
[TABLE]
From (2.7), we know that if , all active features are selected with probability approaching 1 as . In fact, any choice of leads to the sure screening property. With the same choice of , Li et al. (2012) showed that the distance correlation based screening method (DC-SIS) satisfies
[TABLE]
where , and are positive constants. Thus, PC-Screen achieves a faster rate than the DC-SIS since (1) we do not have the term and (2) we do not have an extra in the power of the first term. The faster rate of PC-Screen is due to the fact that projection correlation is not sensitive to dimensionality and free of moment conditions.
Theorem 3** (Rank consistency).**
Under Condition 1 (b), we have
[TABLE]
where is some positive constant. If with , then we have
[TABLE]
The rank consistency in Theorem 3 is a stronger result than the sure screening property. When the signal strength gap between active and inactive features satisfies Condition 1 (b), the active features are always ranked ahead of the inactive ones with high probability. In other words, there exists a choice of on the solution path that can perfectly separate the active and inactive sets with high probability.
3 Screening with FDR control
3.1 Motivation
In the PC-Screen procedure, the threshold controls the cut-off between active and inactive features. Theorem 2 suggests choosing for some positive constants and . With certain model assumptions, the threshold (or equivalently and ) can be selected by cross-validation or information criterion approaches. However, in the model-free setup, such approaches are not directly applicable as the loss functions that measure the goodness of fit are not well defined. More recently, Zhao & Li (2012) and Song et al. (2014) suggested to select the threshold by controlling the false positive rate and estimating the proportion of active features , respectively. These two methods are tailored for survival analysis and are not directly applicable to model-free setup.
In practice, one can arbitrarily select a conservative threshold to ensure that all active features are included with high probability. However, it will include too many inactive features and inflate the FDR. Therefore, selecting the threshold parameter for model-free screening methods will inevitably cause the issue of balancing the trade-off between the sure screening property and FDR control. In this section, we propose a two-step procedure to address this issue by utilizing knockoff features. The construction of knockoff features does not depend on the regression model, and hence is suitable for the model-free setting. Also, the second-order knockoff features can be easily computed by an equicorrelated construction or solving a semidefinite program (Barber & Candès, 2015). The proposed procedure enjoys sure screening property and controls FDR simultaneously with high probability.
3.2 Knockoff features: a brief review
Recently, knockoff has drawn huge attention due to its success in variable selection and many important applications such as genome-wide association studies. The concept of knockoff was first proposed in Barber & Candès (2015) for the fixed design matrix problem and then extended to the random design matrix setting known as Model-X knockoffs (Candès et al., 2018). For a more detailed development of knockoff, see Barber & Candès (2015); Candès et al. (2018); Fan, Demirkaya, Li & Lv (2020) and references therein. In this subsection, we briefly review the notations and definitions of knockoffs for further discussions.
Let be a response vector of variables and be a covariate of features. We say is a knockoff copy of if it satisfies the following conditions.
Condition 2. (Exact knockoff features)
- (a)
Swap with its knockoff counterpart does not change the joint distribution of for ;
- (b)
Given , is independent of , i.e., \widetilde{\mathbf{x}}\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\mathbf{y}|\mathbf{x}.
Remark 2**.**
Condition 2 (a) requires the original and knockoff features to be pairwise exchangeable. Condition 2 (b) indicates that knockoff features are conditionally independent of response variables; this is trivially satisfied if is generated without using the information of .
Constructing knockoff features that exactly follow Condition 2 is challenging, especially when the dimensionality of features is large. In general, generating exact knockoff features requires the knowledge of the underlying distribution of , which is usually not available in practice. Barber & Candès (2015) studied the variable selection problem with knockoffs for fixed design matrix . The construction of exact knockoffs is not necessary if assuming the response follows a linear regression model with Gaussian error and . A more recent study (Candès et al., 2018) proposed to construct exact knockoff features in a manner of generating sequential conditional independent pairs under the assumption that the distribution of is known.
Without the knowledge of the distribution of , one can construct approximate second-order knockoff features such that is pairwise exchangeable with respect to the first two moments. In other words, the mean vector and covariance matrix of is invariant if we swap and for any . The invariant of mean is trivial, and can be achieved by forcing . Suppose , the second-order pairwise exchangeable condition is equivalent to
[TABLE]
where is a vector that makes a positive semidefinite covariance matrix.
Barber & Candès (2015) introduced two approaches to construct the second-order knockoffs. The first approach is known as the equicorrelated construction, which sets
[TABLE]
where is the minimum eigenvalue of , and denotes the larger one between and . The second method, named semidefinite programme, finds by solving a semidefinite program of the following form
[TABLE]
However, both methods are not directly applicable in the high-dimensional scenarios since both (3.2) and (3.3) require .
Remark 3**.**
When is jointly Gaussian, the equivalence of the first two moments implies the equivalence of the joint distribution and hence (3.1) constructs exact knockoff features. However, when the Gaussian assumption does not hold, the accuracy of the second-order approximation depends on the impact of ignoring higher-order moments of . Also, the covariance matrix is usually unknown and needs to be estimated. Hence the estimation accuracy of also affects the validation of the second-order approximation of exact knockoff features. In addition to sample covariance estimator, more sophisticated estimators of can be obtained under additional structure or moment conditions, see Fan et al. (2013) and Ke et al. (2019), among others. As the above two issues are not of key interest in this paper, we do not pursue further in these directions.
3.3 FDR control with knockoff features
Suppose is a knockoff copy of , we propose to measure the population level dependence between and response by the following quantity
[TABLE]
where and are the projection correlations as defined in (2.2). When is an exact knockoff feature of , will be a non-negative quantity. Further, implies the distribution of depends on and if is independent with conditioning on active features.
Given a random sample drawn from , we estimate by
[TABLE]
where and are sample projection correlations defined in (2.5). Intuitively, a large positive value of provides some evidence that the distribution of depends on . On the other hand, if is an inactive feature, is likely be be small and is equally likely to be positive or negative. This intuition is justified by the following lemma.
Lemma 1**.**
Let be an exact knockoff copy of and . Then
- (i)
for all . 2. (ii)
Conditioning on , follow i.i.d. , where if and 0 otherwise.
For a fixed threshold , the false discovery proportion (FDP) is defined as
[TABLE]
where is the cardinality of a set and we follow the convention that . The false discovery rate (FDR) is defined as the expectation of FDP, i.e., . According to Lemma 1, is equally likely to be positive or negative if is an inactive feature. Therefore, we have
[TABLE]
which leads to a conservative estimation of ,
[TABLE]
To control FDR at a pre-specified level , we follow the knockoff+ procedure (Barber & Candès, 2015) to choose the threshold as
[TABLE]
where . The extra term in the numerator makes the choice of slightly more conservative. Then, the active set is selected as
[TABLE]
3.4 PC-Knockoff procedure
The knockoff feature construction methods discussed in Section 3.2 require and hence are not applicable to high-dimensional scenarios. To address this issue, we propose a two-step procedure, named PC-Knockoff, to screen active features and control the FDR. To avoid mathematical challenges caused by the reuse of sample, we follow the simple sample splitting idea, which has been widely used in statistics and recent examples include hypothesis testing (e.g. Fan et al., 2019), error variance estimation (e.g. Chen et al., 2018), variable selection (e.g. Barber & Candès, 2019), large scale inference (e.g. Fan, Demirkaya, Li & Lv, 2020), and so on. We partition the full sample into two disjoint subsamples with sample sizes and . More specifically, let and be a random partition of , and let follow the same partition. Without loss of generality, we write
[TABLE]
The two steps of PC-Knockoff procedure are introduced as follows:
(1) Screening step: We rank all features in descending order based on the sample projection correlation . Then, we select the top features such that . Denote the set of selected features by . In practice, one can set to be a relatively large value as long as it satisfies .
(2) Knockoff step: Let
[TABLE]
We construct knockoff features for by either the equicorrelated construction as in (3.2) or the semidefinite programme as in (3.3). Denoted the constructed knockoff features for . Then, we calculate
[TABLE]
where and are the th columns of and , respectively. For a pre-specified FDR level , we use (3.6) to choose the threshold and the set of selected active features is given by
[TABLE]
The proposed two-step approach has two advantages. First, we apply the PC-Screen procedure to reduce the number of features from to , which allows us to construct second-order knockoff features. Second, by ruling out inactive features in the screening step, we reduce the total computation cost from to . We summarize the PC-Knockoff procedure in Algorithm 1. In fact, the Algorithm 1 provides a general framework for feature screening with FDR control. One can easily modify Algorithm 1 by replacing PC with other measurement statistics such as Pearson correlation, distance correlation, etc.
Remark 4**.**
We want to add a note that Algorithm 1 is not tuning free as one needs to specify the subsample size and target dimension for the screening step. However, Algorithm 1 is not sensitive to the choice of these two hyper-parameters as Theorem 2 guarantees the sure screening property of the screening step under mild conditions. In practice, we suggest small . As a result, we leave a relatively large subsample for the knockoff step. A relatively large allows more features to be selected in the screening step (larger ) and more accurate second-order knockoff features constructed in the knockoff step.
Denote the event that the sure screening property is satisfied in the screening step, i.e.,
[TABLE]
The sure screening property in Theorem 2 ensures that, with a relatively large choice of , the event holds with high probability. To be specific, let be the order statistics of sample squared projection correlations based on . If Condition 1 (a) holds and , then event holds with probability at least . Conditioning on , the following theorem states that the PC-Knockoff procedure can control the FDR of selected features under the pre-specified level of .
Theorem 4**.**
Let be a knockoff copy of satisfying Condition 2. For any , the set of selected features given by Algorithm 1 satisfies
[TABLE]
Theorem 4 states that, with exact knockoff features and conditioning on , the PC-Knockoff procedure can control the FDR under the pre-specified level . As occurs with probability close to 1, the can be controlled even without conditioning on (Barber & Candès, 2019). We refer to Barber & Candès (2019) and Fan, Demirkaya, Li & Lv (2020) for more discussion regarding control in two-step procedures.
We hope the PC-Knockoff procedure can maintain sure screening property and control FDR simultaneously. This task is challenging as the procedure needs to balance the trade-off between type I and type II errors. To guarantee the sure screening property, the procedure is likely to select an over-fitted model which leads to higher type I errors as well as FDR. On the other hand, the FDR control forces a parsimonious but possibly under-fitted model that can increase type II errors and ruin the sure screening property. In the following, we study the conditions under which this challenging task can be achieved by the PC-Knockoff procedure. Conditioning on , the following theorem states that the simultaneous achievement of sure screening property and FDR control under level , which depends on the relationship between and .
Theorem 5**.**
Under the conditions of Theorem 4 and further assume for some and .
- (i)
If , we have . 2. (ii)
If , we have . Furthermore, if , we have , where is a constant that only depends on .
The part (i) of Theorem 5 together with Theorem 4 suggest that if is chosen to be greater or equal to , the PC-Knockoff procedure enjoys the sure screening property and controls FDR under with high probability. When is chosen to be smaller than , we either recover the active set or end up with an empty set with high probability. The probability of recovering the active set is upper bounded by some constant depending on . Therefore there is no guarantee that PC-Knockoff can select all active features while controlling FDR under . As we know, the value of controls the amount of type I errors that we can tolerate. One can imagine that the smaller the is, the more challenging the task is to achieve sure screening and FDR control simultaneously. It is because we allow fewer and fewer inactive features to be selected. There is a phase transition that happens when goes below . In order to satisfy the sure screening property, the procedure will fail to control the FDR at with non-negligible probability if any inactive feature is selected. In other words, the procedure has to exactly recover the true active set to satisfy FDR control and sure screening simultaneously. We numerically validate this phase transition phenomenon through a simulated example in the supplementary file.
Theorem 5 discourages us to pursue a too aggressive in practice as the PC-Knockoff procedure may lose the sure screening property. The phase transition between part (i) and part (ii) can also be used as a rule of thumb guideline to estimate in practice. For a sequence of grid points of in , we find the largest grid point such that the PC-Knockoff procedure selects an empty set. Then, we can roughly estimate as the integer part of .
Remark 5**.**
As a two-step procedure, the power of the PC-Knockoff is conditioned on the event that the sure screening property is satisfied in the screening step, i.e., . This together with the results in Theorem 5 yield that the probability that the PC-Knockoff procedure does not lose any power is at least . Empirically, one may follow the “data recycling ” idea proposed in Barber & Candès (2019) to improve the power of the PC-Knockoff procedure.
As a byproduct of the PC-Knockoff procedure, the statistic defined in (3.5) can also be used to measure the marginal dependence between the response and the th feature. Given a threshold , we can screen a model based on by considering the following set
[TABLE]
The next theorem states that the screening procedure based on also enjoys the sure screening and rank consistency properties.
Theorem 6**.**
Suppose is an exact knockoff copy for and for some positive constants and , then
- (i)
\mathrm{Pr}\Big{(}{\cal A}\subseteq\widehat{\cal A}_{W}(\delta)\Big{)}\geq 1-O\Big{(}s\exp\{-c_{4}n^{1-2\kappa}\}\Big{)} for . 2. (ii)
.
Recall that the rank consistency property in Theorem 3 requires a minimum signal gap between active and inactive sets, that is . However, the rank consistency result in Theorem 6 only requires a minimum signal strength of active features and no condition is imposed on the inactive set. Due to the construction of , signals of inactive features are canceled out by their knockoff counterparts. As a result, the rank consistency property holds even when some inactive features are spuriously correlated with the response. If we can construct high-quality knockoff features, the screening procedure based on can be more powerful than PC-Screen.
4 Numerical examples
4.1 Screening performance
In this subsection, we use simulated examples to assess the finite sample performance of the proposed projection correlation based feature screening procedure (PC-Screen) and compare it with sure independence screening (Fan & Lv, 2008, SIS), distance correlation based screening (Li et al., 2012, DC-SIS) and bias-corrected distance correlation based screening (Székely & Rizzo, 2014, bcDC-SIS). Within each replication, we rank the features in descending order by the above four screening criteria and record the minimum model size that contains all active features. The screening performance is measured by the and quantiles of the minimum model size over 200 replications. Throughout this subsection, we denote with . To mimic ultra-high dimensional scenario, we set and for each example.
Example 1: Linear and Poisson models
Consider the linear model with . We generate covariates and independently from the following 4 scenarios.
** Model 1.a:**
and .
** Model 1.b:**
and .
** Model 1.c:**
, and .
** Model 1.d:**
, and .
In above models, stands for the -dimensional standard Cauchy distribution which is heavy-tailed. Hence, in Models 1.b – 1.d, at least one of and is heavy-tailed. We also consider the following two Poisson regression models
** Model 1.e:**
(Continuous) , where .
** Model 1.f:**
(Discrete) .
Let and we draw from . Model 1.e is the Poisson regression model with continuous response while the response in Model 1.f is discrete.
The quantiles of the minimum model size that includes all 5 active features are summarized in Table 1. In the linear benchmark Model 1.a, all four competitors perform well. For Models 1.b – 1.d, the SIS completely fails at the presence of heavy-tailed features and errors. Both distance correlation-based methods struggle to maintain a reasonable model size at and quantiles. In contrast, PC-Screen works reasonably well in all scenarios and outperforms the other three methods by big margins. Similarly, for the Poisson models 1.e and 1.f, PC-Screen can recover the true active set with a model size close to 5 while the other three methods can perform as bad as random guesses at and quantiles.
Example 2: Nonlinear models
Consider the following four non-linear data generating models
** Model 2.a:**
** Model 2.b:**
** Model 2.c:**
** Model 2.d:**
Models 2.a and 2.b admit additive structure while Models 2.c and 2.d have more challenging nonlinear structures. In addition, we generate and . Hence, for each model above, the active set contains the first 4 covariates in .
The quantiles of the minimum model size that includes all 4 active features are presented in Table 2. Again, we observe that PC-Screen significantly outperforms the other three methods in all scenarios. For Model 2.a, the quantile of the minimum model size of PC-Screen is exactly four while the other three methods need much larger model sizes to recover the active set. For Model 2.b, DC-SIS and bcDC-SIS preform comparably to PC-Screen at and quantiles but much worse for other higher quantiles. For Models 2.c and 2.d, PC-Screen performs reasonably well while the other methods fail to effectively screen out the inactive features. The 95% quantiles of SIS, DC-SIS, and bcDC-SIS are almost as large as . This indicates, in the worst-case scenario, SIS, DC-SIS, and bcDC-SIS are hopeless to effectively reduce the dimensionality without missing any active feature.
Example 3: Multivariate response models
In this experiment, we investigate the performance of PC-Screen for multivariate response models. We omit SIS method in this experiment as it is not applicable to multivariate response problems. We generate from a bivariate normal distribution with conditional mean and covariance matrix , where and . Following the setting in Li et al. (2012), we set and generate , and from the two models below.
** Model 3.a:**
, and .
** Model 3.b:**
, and
.
The union of the active sets of , and contains the first four covariates in . The simulation results are summarized in Table 3. Again, the PC-Screen method performs strikingly well compared to the other two methods.
4.2 FDR control performance
In this subsection, we use simulated examples to numerically assess the FDR control as well as sure screening property of the proposed PC-Knockoff procedure. We refer to Algorithm 1 for implementation details.
Example 4: FDR control for linear and Poisson models
Consider the following five regression models with ten active variables.
** Model 4.a:**
Same as Model 1.a except that with .
** Model 4.b:**
Same as Model 4.a except that , the distribution with degrees of freedom 2.
** Model 4.c:**
Same as Model 4.a except that , where and are independently drawn from and , respectively.
** Model 4.d:**
Same as Model 1.e except that with .
** Model 4.e:**
Same as Model 1.f except that with .
In this example, we set , and repeat 200 replications for each scenario. In each replication, we randomly divide the sample into two non-overlapping subsamples. The sample size and target dimension in the screening step are set to be and , and the sample size used to construct knockoff features is . In addition, the covariance matrix is set to be with . The performance of FDR control is examined under a sequence of specified levels: and .
We summarize the results in Table 4 in which is the pre-specified FDR level, is the average number of selected variables, the column ‘’ represents the probability that the active variable is selected, the column ‘All’ represents the sure screening probability (i.e., the probability that all active variables are selected) and the column ‘’ is the empirical FDR (i.e., the average of empirical FDP). According to Table 4, the proposed PC-Knockoff procedure controls the empirical FDR under the pre-specified level for almost all scenarios. The only exception is in Model 4.c we have at . In Model 4.c, is drawn from a mixture of multivariate normal and multivariate distributions. As a result, the second-order knockoffs may not approximate the exact ones very well. For the other four models, the second-order knockoffs perform well as features are normally distributed. Besides FDR control, the PC-Knockoff procedure maintains the sure screening property reasonably well in all scenarios. For Models 4.a – 4.d, the probabilities of selecting all active variables are ranging from to regardless of the pre-specified level . Even for the more challenging scenario Model 4.e, the probability of selecting all active variables simultaneously is greater than when . Please notice that this is achieved under very small model sizes (i.e., small ), thanks to the knockoff features.
To illustrate how the two steps in PC-Knockoff work together, we also compare the active set selected by the screening step without knockoff features (i.e. selected by PC-Screen) and the active set selected by PC-Knockoff. Table 5 summarizes the average number of selected features , the sure screening probability ‘All’ and empirical false discovery rate for Models 4.a – 4.e. For the screening step, we conservatively select active features by including the top features. The results show that enjoys the sure screening property but has a very high empirical FDR (). On the contrary, the active set selected by the PC-Knockoff is able to control the FDR below the pre-specified level with only a little sacrifice of power.
4.3 Supermarket data
In this subsection, we apply the PC-Knockoff procedure to study a supermarket dataset (Wang, 2009; Chen et al., 2018). The dataset consists of observations of daily records from a supermarket. The response is the number of customers visited the supermarket on that day. The covariates are sale volumes of products. Due to data privacy, the detailed product codes are not released in the dataset. Instead, we name the covariates by their indices, i.e., . Both the response and predictors have been standardized to have zero mean and unit variance. The goal is to screen a parsimonious set of products whose sale volumes significantly contribute to the daily number of customers. Meanwhile, we want to control the FDR at level .
The implementation of PC-Knockoff follows Algorithm 1. We set and in the screening step. That is, we randomly choose observations and use PC-Screen to pre-screen 50 features in the screening step. Then we construct second-order knockoffs for the pre-screened 50 features using the remaining 264 observations in the knockoff step. Under the pre-specified FDR level , the PC-Knockoff procedure selects a model of variables: , and . For each selected variable, we draw a scatter plot between this variable and the response. We observe obvious outliers in the scatter plots of and . We report this observation in Figure 1, where red triangles denote potential outliers and blue dots denote the rest data points. Besides, red dashed curves and blue solid curves are the fitted local polynomial regression curves with and without potential outliers, receptively. The gray shaded areas are the confidence regions. By comparing blue and red curves, we find the existence of outliers visually alters the fitted curves. This justifies the PC-Knockoff procedure is insensitive to the presence of outliers. The scatter plots between the response and other nine variables (, , , , , , , and ) are presented in Figure 2, which includes the local polynomial regression curves with the confidence regions. Figure 2 indicates that PC-Knockoff can detect various functional relationships.
This supermarket dataset has also been analyzed by Chen et al. (2018). They first apply DC-SIS to screen variables and then fit an additive model with selected variables. They further employ the Wald’s -test with the refitted cross-validation error variance estimate to determine if the selected features are significant at a pre-specified level. As a result, Chen et al. (2018) selected seven significant variables and . Next, we compare the in-sample fitting and out-of-sample prediction performance between the model selected by PC-Knockoff and the one in Chen et al. (2018) through a bootstrap experiment. In each replication, we randomly split the dataset into a training set of size 400 and a test set of size 64. We fit two additive models with the features selected by the two competitive methods, respectively. Then we calculate and record the training and testing s for the two models. We repeat it for 200 replications. The sample mean and sample standard deviation of s are reported in Table 6, which show that the model selected by PC-Knockoff yields higher sample means of for both training set and test set.
Due to the limited space, we present an additional real application to microarray data in Section S.2 of the supplementary file.
5 Conclusion
We study the feature screening problems under a general setup. The proposed PC-Screen method developed in Section 2 is novel in the sense that it does not impose any regression model assumption and is robust against possibly heavy-tailed data. The response variable is also allowed to be multidimensional. Regarding the theoretical aspect, the asymptotic results developed in Zhu et al. (2017) do not apply to the “large small problem”, and hence we develop non-asymptotic results for the empirical projection correlation. The theoretical analysis shows that the PC-Screen method satisfies sure screening and rank consistency properties. Numerically, we show PC-Screen outperforms popular competitors over various data generative processes. Moreover, this paper tackles the false discovery rate control problem in feature screening. In Section 3, we propose a two-step procedure named PC-Knockoff to determine the threshold of active features. With the sample splitting idea, we first screen the ultra-high dimensional features to a moderate model size and then further select a model with FDR control using a statistic built upon knockoff features. The PC-Knockoff procedure controls FDR under a pre-specified level while maintaining sure screening property with high probability. In practice, the PC-Knockoff procedure works well in various scenarios.
Appendix: Proof of results in Section 3
In the appendix, we provide the proofs of Theorem 4 and Theorem 5 in Section 3. We relegate the proofs of remaining results as well as some additional numerical examples to a supplementary file.
A.1 Proof the Theorem 4
Denote , the set of features that are selected in the screening step. Throughout this proof, we restrict ourselves to the event . According to Theorem 2, the probability of is at least . To simplify the notations, we omit the condition on .
Denote by and the intersections between active and inactive sets with , respectively. Notice that any feature with will not be included in the final selected set . Hence, without loss of generality, we assume and . For ease of presentation, we further define . Then we have
[TABLE]
The first inequality holds since and the second inequality is due to the definition of in (3.6). In order to find , one can simply try different values of starting from the smallest value , then move to the second smallest value , then move to , and so on. The procedure stops as soon as it finds a value of satisfying (3.6). In this process, can be regarded as a stopping time. More rigorously, for , we define
[TABLE]
where and . Let be the -algebra generated by where and
[TABLE]
As a result, given , we know whether is in the active set or not.
Next we show that the process is a super-martingale running backward with respect to the filtration . On one hand, if , we have , and thus . On the other hand, if , we have
[TABLE]
where and is the indicator function. Let with the inactive set . From Lemma 1 part (ii), we know that are i.i.d. Bernoulli(0.5) random variables. Thus conditioning on , (hence , are known), we have
[TABLE]
Thus in the case ,
[TABLE]
Therefore, , implying that is a super-martingale with respect to . By definition, is a stopping time with respect to the backward filtration . According to the optional stopping theorem for super-martingale, we know
[TABLE]
where . Since , we have
[TABLE]
Therefore . From (A.1), we have
[TABLE]
Since , we have . Then, we conclude
[TABLE]
A.2 Proof of Theorem 5
Now we restrict ourselves to the subset . For , define , , , and . Theorem 1 implies that
[TABLE]
Thus we have
[TABLE]
On the other hand, part (i) in Lemma 1 implies that for all . We have
[TABLE]
where the last equality is implied by Theorem 1 and the assumption . Since and , (A.2) implies that with probability at least . Together with (A.3), we know that with probability . In other words, with probability at least , the active features will be ranked ahead of the inactive features. Recall that in order to find the cutoff value , we start from the smallest value , then move to the second smallest value , then move to , and so on. The procedure stops once it finds a value of satisfying (3.6). Restrict on the event , which holds with probability at least and let . If , then
[TABLE]
From the inequality above, we known this process must stop no later than reaches and hence . As a result,
[TABLE]
If , in order to satisfy (3.6), we must have or . indicates all active features are selected and leads to an empty , as stated in the second part. Notice that satisfies the rule (3.6) is equivalent to that at most of the signs of are negative. Let be the probability that satisfies (3.6) and be the probability that the process stops at for . Let , we have
[TABLE]
It is easy to verify that for ,
[TABLE]
where is some constant only depending on . This completes the proof.
Acknowledgment: We would like to thank the AE and reviewers for their constructive comments, which lead to significant improvement of this work. Liu’s research was supported by National Natural Science Foundation of China (No. 11771361, 11671334, 11871409), Basic Scientific Center Project 71988101 of National Science Foundation of China and JAS14007. Li’s research was supported by NSF grants DMS 1820702 and 1953196. All authors equally contributed to this work, and the authors are listed in seniority. Both Yuan Ke and Jingyuan Liu are designated to be the corresponding authors of this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Barber & Candès (2015) Barber, R. F. & Candès, E. J. (2015), ‘Controlling the false discovery rate via knockoffs’, The Annals of Statistics 43 (5), 2055–2085.
- 3Barber & Candès (2019) Barber, R. F. & Candès, E. J. (2019), ‘A knockoff filter for high-dimensional selective inference’, The Annals of Statistics 47 (5), 2504–2537.
- 4Candès et al. (2018) Candès, E., Fan, Y., Janson, L. & Lv, J. (2018), ‘Panning for gold: ‘Model-X’ knockoffs for high-dimensional controlled variable selection’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (3), 551–577.
- 5Caruana (1997) Caruana, R. (1997), ‘Multitask learning’, Machine learning 28 (1), 41–75.
- 6Chen et al. (2018) Chen, Z., Fan, J. & Li, R. (2018), ‘Error variance estimation in ultrahigh-dimensional additive models’, Journal of the American Statistical Association 113 (521), 315–327.
- 7Fan et al. (2011) Fan, J., Feng, Y. & Song, R. (2011), ‘Nonparametric independence screening in sparse ultra-high-dimensional additive models’, Journal of the American Statistical Association 106 (494), 544–557.
- 8Fan et al. (2019) Fan, J., Ke, Y., Sun, Q. & Zhou, W.-X. (2019), ‘Farmtest: Factor-adjusted robust multiple testing with approximate false discovery control’, Journal of the American Statistical Association 114 (528), 1880–1893.
