Iterated Feature Screening based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariates Measurement Error
Li-Pang Chen

TL;DR
This paper introduces an iterated feature screening method based on distance correlation designed for ultrahigh-dimensional survival data with covariate measurement error, effectively identifying important variables despite complex data issues.
Contribution
It proposes a novel iterative feature screening approach that handles both censoring and measurement error in ultrahigh-dimensional survival data, improving variable selection accuracy.
Findings
The method outperforms existing screening techniques in simulations.
It effectively detects important covariates with complex dependencies.
Application to real datasets demonstrates practical utility.
Abstract
Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariates measurement error are often accompanying with survival analysis. Even though there are many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates which are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem…
| Feature screening | Iterated feature screening | ||||||||||||||
| Model | Method | ||||||||||||||
| PH | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.998 | 0.998 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | ||||||
| PO | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | ||||||
| Feature screening | Iterated feature screening | ||||||||||||||
| Model | Method | ||||||||||||||
| PH | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.994 | 0.994 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | 1.000 | 1.000 | 1.000 | 0.998 | 0.998 | ||||||
| PO | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | ||||||
| Feature screening | Iterated feature screening | ||||||||||||||
| Model | Method | ||||||||||||||
| PH | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.007 | 0.007 | 1.000 | 1.000 | 1.000 | 0.007 | 0.007 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.007 | 0.007 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.004 | 0.003 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.001 | 0.001 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.001 | 0.001 | 1.000 | 1.000 | 1.000 | 0.994 | 0.994 | |||||
| 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.998 | 0.998 | ||||||
| PO | 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.008 | 0.008 | 1.000 | 1.000 | 1.000 | 0.009 | 0.009 | |||
| Propose | 1.000 | 1.000 | 1.000 | 0.008 | 0.008 | 1.000 | 1.000 | 1.000 | 0.998 | 0.998 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.997 | 0.997 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.005 | 0.005 | 1.000 | 1.000 | 1.000 | 0.996 | 0.996 | |||||
| 1.000 | 1.000 | 1.000 | 0.006 | 0.006 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||||||
| 0.15 | Naive | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 0.50 | Naive | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.004 | 0.004 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.002 | 0.002 | 1.000 | 1.000 | 1.000 | 0.995 | 0.995 | |||||
| 0.75 | Naive | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | ||||
| Propose | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 1.000 | 1.000 | 1.000 | 0.994 | 0.994 | |||||
| 1.000 | 1.000 | 1.000 | 0.003 | 0.003 | 1.000 | 1.000 | 1.000 | 0.998 | 0.998 | ||||||
| # | naive | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| FS | IFS | FS | IFS | FS | IFS | FS | IFS | ||||
| 1 | 16587 | 16587 | 16587 | 16587 | 16587 | 16587 | 16587 | 16587 | |||
| 2 | 24719 | 24719 | 24719 | 24719 | 24719 | 24719 | 24719 | 24719 | |||
| 3 | 27057 | 27057 | 27057 | 27057 | 27057 | 27057 | 27057 | 27057 | |||
| 4 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | |||
| 5 | 31420 | 31420 | 31420 | 31420 | 31420 | 31420 | 31420 | 31420 | |||
| 6 | 34790 | 34790 | 34790 | 34790 | 34790 | 34790 | 34790 | 34790 | |||
| 7 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | 28581 | |||
| 8 | 16312 | 29357 | 16312 | 29357 | 16312 | 29357 | 30157 | 30157 | |||
| 9 | 34771 | 29897 | 26537 | 29897 | 17053 | 29897 | 27116 | 28872 | |||
| 10 | 28346 | 30620 | 29637 | 30620 | 30917 | 30620 | 30334 | 32699 | |||
| 11 | 26521 | 30898 | 16587 | 30898 | 30929 | 30898 | 27762 | 27095 | |||
| 12 | 34375 | 32699 | 17053 | 32699 | 31972 | 32699 | 17326 | 24710 | |||
| 13 | 29642 | 15843 | 28346 | 15843 | 29637 | 15844 | 27019 | 19325 | |||
| 14 | 26537 | 15924 | 28908 | 15924 | 17605 | 15924 | 27762 | 30282 | |||
| 15 | 17605 | 27927 | 32519 | 27927 | 28346 | 27931 | 17176 | 32187 | |||
| 16 | 28920 | 28929 | 26521 | 28929 | 34771 | 28929 | 23887 | 29209 | |||
| 17 | 29657 | 34339 | 34364 | 34375 | 28908 | 34375 | 17343 | 16528 | |||
| 18 | 32519 | 34913 | 34667 | 32475 | 34651 | 32475 | 32699 | 27019 | |||
| 19 | 34651 | 26510 | 34771 | 26510 | 16079 | 26510 | 30157 | 23887 | |||
| 20 | 28908 | 27530 | 27931 | 34913 | 26537 | 27530 | 17917 | 16020 | |||
| # | naive | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| FS | IFS | FS | IFS | FS | IFS | FS | IFS | ||||
| 1 | NM 016359 | NM 016359 | NM 016359 | NM 016359 | NM 016359 | NM 016359 | NM 016359 | NM 016359 | |||
| 2 | AA555029 RC | AA555029 RC | AA555029 RC | AA555029 RC | AA555029 RC | AA555029 RC | AA555029 RC | AA555029 RC | |||
| 3 | NM 003748 | NM 003748 | NM 003748 | NM 003748 | NM 003748 | NM 003748 | NM 003748 | NM 003748 | |||
| 4 | Contig38288 RC | Contig38288 RC | Contig38288 RC | Contig38288 RC | Contig38288 RC | Contig38288 RC | Contig38288 RC | Contig38288 RC | |||
| 5 | NM 003862 | NM 003862 | NM 003862 | NM 003862 | NM 003862 | NM 003862 | NM 003862 | NM 003862 | |||
| 6 | Contig28552 RC | Contig28552 RC | Contig28552 RC | Contig28552 RC | Contig28552 RC | Contig28552 RC | Contig28552 RC | Contig28552 RC | |||
| 7 | Contig32125 RC | Contig32125 RC | Contig32125 RC | Contig32125 RC | Contig32125 RC | Contig32125 RC | Contig32125 RC | Contig32125 RC | |||
| 8 | AB037863 | Contig036649 RC | Contig55725 RC | Contig036649 RC | Contig55725 RC | Contig036649 RC | NM 000599 | NM 000599 | |||
| 9 | Contig036649 RC | Contig46218 RC | AF201905 | Contig46218 RC | AB037863 | Contig46218 RC | Contig46223 | NM 005915 | |||
| 10 | X05610 | AB037863 | AB037863 | AB037863 | AF201905 | AB037863 | AF257175 | Contig46223 | |||
| 11 | AL080079 | NM 020188 | Contig48328 RC | NM 020188 | Contig036649 RC | NM 020188 | NM 006931 | X05610 | |||
| 12 | NM 006931 | Contig55377 RC | Contig036649 RC | Contig25991 | X05610 | Contig25991 | AK000745 | AK000745 | |||
| 13 | AF201905 | Contig48328 RC | AL080079 | Contig55377 RC | NM 018354 | Contig48328 RC | NM 005915 | NM 005915 | |||
| 14 | NM 003875 | Contig25991 | X05610 | Contig46223 RC | AL080079 | Contig55377 RC | NM 001282 | NM 001282 | |||
| 15 | Contig55725 RC | NM 003875 | Contig55725 RC | NM 003875 | Contig55725 RC | NM 003875 | AL080079 | NM 614321 | |||
| 16 | Contig48328 RC | NM 006101 | NM 018354 | NM 006101 | NM 006931 | NM 006101 | NM 014889 | AF257175 | |||
| 17 | NM 000599 | NM 003882 | NM 003875 | NM 003607 | Contig48328 RC | NM 000849 | Contig55725 RC | NM 014889 | |||
| 18 | NM 018354 | NM 016577 | NM 006931 | NM 003882 | NM 003875 | NM 016577 | NM 614321 | AF201905 | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Soil Geostatistics and Mapping · Bayesian Methods and Mixture Models
Iterated Feature Screening based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariates Measurement Error
Li-Pang Chen111Corresponding Author: Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, [email protected]
**Abstract **
Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariates measurement error are often accompanying with survival analysis. Even though there are many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates which are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the valid feature screening method in the presence of survival data with measurement error. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to two different real datasets.
Keywords**: Distance correlation; feature screening; measurement error; survival data; ultrahigh-dimensional data.**
Short title**: Iterated feature screening for censored data and measurement error**
1 Introduction
Ultrahigh-dimensional data appears in various scientific research areas, including genetic data, financial data, survival data, and so on. In regression analysis, ultrahigh-dimensional data is very difficult to analyze since it contains many unimportant variables in the sense that those variables are not highly correlated to the response. In addition, the covariance matrix of the variables is usually singular due to that the dimension of variables is ultra higher than the sample size. As a result, we should select the informative variables before constructing regression models. Moreover, to cope with ultrahigh dimensionality, the assumption of sparsity is imposed. In other words, there are only a small number of predicting variables associated with the response.
In the early development of variable selection, Akaike’s Information Criterion (AIC) (Akaike 1973) and Bayesian Information Criterion (BIC) (Schwarz 1978) are two well-known conventional variable selection criteria. Those two methods aim to search over all possible combinations so that the optimal solution is achieved. However, in ultrahigh-dimensional data, it is near impossible to search the final model through all possible combinations of variables. In the two decades, some regularization methods have been proposed to select variables. Those methods include the LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), LARS (Efron et al. 2004), elastic net (Zou and Hastie 2005), adaptive LASSO (Zou 2006), and Dantzig selector (Candes and Tao 2007) methods. However, those methods are mainly implemented in high-dimensional data but the dimension of variables is smaller than the sample size, and they may perform worse for ultrahigh-dimensional data.
To address ultrahigh-dimensional data with stable computation and accurate selection, Fan and Lv (2008) first proposed the sure independent screening (SIS) procedure for ultrahigh-dimensional linear model which utilized the Pearson correlation to rank the importance of each predictor. Hall and Miller (2009) developed the bootstrap procedure to rank the importance of each predictor based on Pearson correlation between the response and predictors. Fan et al. (2009) and Fan and Song (2010) considered to rank the importance of each predictor through marginal maximum likelihood. Different from the SIS method which specifies the model structure, Zhu et al. (2011) and Li et al. (2012) proposed the model-free feature screening to capture the informative covariates for the ultrahigh-dimensional data.
Even though feature screening methods for ultrahigh-dimensional data have been developed, the research gaps still exist. Specifically, in survival analysis with genetic data, the response (failure time) is usually incomplete due to right-censoring and the covariates are usually contaminated with measurement error. It is not trivial to implement the conventional feature screening methods to analyze such data. Actually, in the presence of the incomplete response (or survival data) and precise measurement, some valid methods have been proposed. To name a few, Fan et al. (2010) proposed SIS method but restricted on Cox model. Song et al. (2014) proposed the censored rank independence screening. Yan et al. (2017) proposed the Spearman rank correlation screening. Chen et al. (2018) developed the robust feature screening based on distance correlation. Chen et al. (2019) considered a model-free survival conditional feature screening. In the presence of measurement error, however, it is unknown that whether or not those existing methods can determine the “correct” features for the surrogate version of the covariates.
The other crucial issue is the accuracy of feature screening. Since conventional SIS methods rank the importance of each predictor through marginal utilities, then those methods may fail to detect truly important predictors which are marginally independent of the response due to correlations among predictors. The detailed example is deferred in Section 2.4. To overcome this problem, Fan and Lv (2008) proposed the iterative SIS method. Zhong and Zhu (2015) developed the iterated distance correlation to improve the accuracy of variable screening. These methods, however, are based on complete data and free of mismeasurement. For ultrahigh-dimensional survival data with measurement error in covariates, there is no method to deal with this problem. As a result, we mainly explore this important problem with both survival data and covariates measurement error incorporated. In our development, we first present the distance correlation with error correction for feature screening. Under such approach, the set of selected surrogate variables is the same as the set of selected unobserved variables. After that, we propose the valid iterated procedure with error correction to improve the accuracy of feature screening. In particular, our proposed method is free of model specification and free of specification of distribution for the covariates.
The remainder is organized as follows. In Section 2, we introduce the survival data with right-censoring, measurement error model and the distance correlation method. In Section 3, we propose the iteration algorithm of feature screening procedure for censored data and covariates measurement error. Empirical studies, including simulation results and real data analysis, are provided in Sections 4 and 5, respectively. We conclude the article with discussions in Section 6.
2 Notation and Model
2.1 Survival Data
In survival analysis, the response is usually incomplete due to the presence of the censoring time. Specifically, let be the failure time and be the censoring time. Then let and denote , where is the indicator function. Let be the -dimensional random vector of covariates. Suppose that we have a sample of subjects and that for , has the same distribution as and represents realizations of . Let denote a maximum support of the failure time. Some regular conditions are imposed.
- (C1)
, where is an upper bound of failure times which is assumed to be finite and is the risk set.
- (C2)
Censoring time is non-informative. That is, the failure time and the censoring time are independent.
2.2 Measurement Error Model
Let denote the surrogate, or observed covariate, of . Let and be the covariance matrices of and , respectively. For , has the same distribution as . Let denote the realizations of . In this paper, we focus on the the following measurement error model
[TABLE]
for , where is independent to , with covariance matrix . Here can be known or unknown. Hence, to discuss and its estimation, we consider the following three scenarios:
Scenario I
: is known.
**In this scenario, is a constant matrix. Therefore, it is straightforward to discuss the analysis. **
Scenario II
: is unknown and repeated measurements is available.
Measurement error model (1) with repeated measurement is given by
[TABLE]
for and , where the and independent to . Using the method of moments, we estimate by
[TABLE]
**where . **
Scenario III
: is unknown and validation data is available.
Suppose that is the subject sets for the main study containing subjects and is the subject sets for the external validation study containing subjects. Assume that and do not overlap. Therefore, the available data contain measurements from the main study and from the validation sample. Hence, for the measurement error model, we have
[TABLE]
for , where the and independent to . In this case, applying the least square regression method gives
[TABLE]
where .
2.3 Review of the Distance Correlation Method
In this section, we briefly review the distance correlation (DC) method, which was first proposed by Székely et al. (2007).
Let and denote the characteristic functions of two random vectors and , respectively, and let be the joint characteristic function of and . Let for any complex function , where is the conjugate of . The distance covariance between and is defined as
[TABLE]
where and are dimensions of and , respectively, and
[TABLE]
with and is the Euclidean norm of any vector . Therefore, the DC is defined as
[TABLE]
Székely et al. (2007) showed that two random vectors and are independent if and only if . This property motivates us to do the feature screening and identify which covariates are dependent with the response (e.g., Li et al. 2012). The detailed estimation of (4) can be found in Li et al. (2012).
2.4 Potential Problem in Conventional Screening Method
As discussed in Section 1, even though many feature screening methods have been proposed, those methods cannot capture all important variables due to that some variables are highly correlated with others. To see this problem explicitly, we consider the following regression model which was adopted by Fan and Lv (2008)
[TABLE]
where with is a vector of covariates and each is generated from the normal distribution with mean zero and unit variance. The correlations of all except are , while has the correlation with all other variables.
It is clear to see that variables with are included in model (5). By the feature screening based on conventional distance correlation method, we can only identify and , while there is a large probability that cannot be identified due to .
This simple example verifies that the conventional feature screening method fails to select whole important variables. To successfully identify the variable , Fan and Lv (2008) proposed the iterated SIS method. Zhong and Zhu (2015) considered the iterated distance correlation method. In survival analysis, however, the response in model (5) is usually incomplete due to right-censoring. Therefore, in the presence of right-censoring, it is not trivial to implement those conventional methods to deal with survival data. In addition, the other challenge comes from the mismeasurement of covariates. Specifically, variables -** in model (5) may be contaminated with measurement error, and we only have the surrogate variables -. It is expected that the important variables with label 1-4 cannot be identified if we ignore the impact of mismeasurement. As a result, it is also crucial to take care the measurement error effect.**
3 The Proposed Method
3.1 Feature Screening for Censored Data and Measurement Error
To present the setting, we start from the unobserved covariate .
Let denote the conditional distribution function of given , and let
[TABLE]
denote the active set which contains all relevant predictors for the response with and , and is the complement of which contains all irrelevant predictors for the response . In this case, let denote the vector containing all the active predictors, and let be the vector containing all the irrelevant predictors.
If is complete, i.e., , then it is straightforward to implement conventional methods to determine the active set. However, if is incomplete, i.e., right-censoring occurs, then we impute by (Buckley and James 1979)
[TABLE]
indicating that (Miller 1981, p. 151). In addition, by Condition (C1) in Section 2.1, can be written as
[TABLE]
where and are the density and distribution functions of , respectively. Moreover, can be estimated by
[TABLE]
where is the Kaplan-Meier estimator of . As a result, the estimator of , denoted as , is determined by replacing with , and thus, we have
[TABLE]
Finally, the crucial target is the determination of the active set . In the presence of measurement error, we adopt the DC method described in Section 2.3 with modification. Let denote the characteristic function of , where is a complex number with . Define
[TABLE]
and
[TABLE]
If is unknown, then it can be estimated based on repeated measurement or validation in (2) or (3). Therefore, we define
[TABLE]
and
[TABLE]
As a result, to select features, it suffices to consider
[TABLE]
for , and the corresponding estimator is
[TABLE]
As suggested by Li et al. (2012), let the threshold value be for some constants and , then the estimated active set is given by
[TABLE]
To see the validity of the criterion (7), we have the following theorem:
Theorem 3.1
Active features based on and are the same. That is, for every ,
[TABLE]
or
[TABLE]
where is determined by implementing and to (4).
Generally speaking, Theorem 3.1 suggests that based on the feature selection criterion (7), the true and surrogate covariates share the same active set . Furthermore, similar derivation in Li et al. (2012) yields that has the sure screening property in this sense that as . Therefore, we can decompose the measurement error model (1) by
[TABLE]
where , , and . The covariance matrix can be further decomposed as
[TABLE]
where is the covariance matrix based on (10a), is the covariance matrix based on (10b), and is the covariance matrix based on the interaction of (10a) and (10b).
3.2 Iteration Algorithm
As motivated by example in Section 2.4, directly implementing (7) may lose some important variables. To increase the probability of selecting all important variables, we modify the selection criterion (7) and develop the iterated feature screening procedure.
The key idea is as follows: We first implement the feature screening criterion (7) to determine and . It is noted that there exist some potential important variables in but not be identified. Therefore, to determine the other important variables in , a natural way is to remove the correlations of and by regressing onto . As a result, the residuals obtained from such linear regression are then uncorrelated with . Therefore, other important variables in can be identified by residuals and .
Specifically, to present the idea explicitly, we provide the following iteration algorithm:
Step 1:
Initial determination of the active set**.**
**Let denote the covariate matrix, where is a -dimensional vector of **th covariate with .
In this stage, we first implement (7) to determine the initial active set and the corresponding relevant covariate is with dimension . Let denote the irrelevant covariates matrix with dimension such that . In addition, based on feature selection criterion (7) and Theorem 3.1, the active set based on the surrogate variables is equal to the set based on the true covariates. Therefore, we also have .
Step 2:
Improvement**.**
In this stage, we aim to search other important variables in . Our main approach is to regress onto and update the active set through the residual.
In this paper, we consider the multivariate linear regression model and the ordinary least square is given by
[TABLE]
where is the -norm and is the parameter matrix with dimension . The corresponding score function is
[TABLE]
However, in the presence of covariates measurement error , we only observe and , then the score function becomes
[TABLE]
It is well known that directly solving may incur the estimator of with tremendous bias (e.g., Carroll et al. 2006). Instead, by the simple calculation, we obtain
[TABLE]
such that
[TABLE]
indicating that is the suitable score function which corrects error-prone variables. Therefore, the estimator of based on (13) is given by
[TABLE]
Based on (14) and the surrogate variables, define
[TABLE]
In fact, is an exact formulation of the residual and thus contains the covariate information in and is uncorrelated to . Therefore, implementing (7) with gives the active set based on .
Step 3:
Update of the active set**.**
Update the active set by and continue Step 2 until no more covariate is included. Finally, the final model is .
In practice, as suggested in Yan et al. (2017), Chen et al. (2019) and among others, we can specify the size of the active set to be , where stands for the floor function. In this sense, based on the iteration algorithm, we can first select variables with size in Step 1, and then determine the variables with size in Step 2.
4 Simulation Studies
4.1 Simulation Setup
Let denote the sample size. Let with , or denote a -dimensional vector of covariates which is generated from the normal distribution with mean zero and the covariance matrix with the diagonal elements being one and the non-diagonal elements being the correlations of all with . Similar to the setting with an example in Section 2.4, we specify the correlations of all except to be , while has the correlation with all other variables. We consider or .
The failure time is generated by the following model:
[TABLE]
Specifying the distribution of the error term gives some commonly used survival models. In this paper, we consider the extreme value distribution for the proportional hazards (PH) model and the logistic distribution for the proportional odds (PO) model. The censoring time is generated from the uniform distribution where is a constant such that the censoring rate is approximately 50%. As a result, we have and . For , the survival data is .
For the measurement error model (1), let be generated from the normal distribution with mean zero and the diagonal matrix with entries being , 0.5, or 0.75. If is unknown, then the following two scenarios are considered as additional information:
Scenario 1:
Repeated measurement
For and with , and are again be generated from and , respectively, and is generated by
[TABLE]
for and . As a result, can be estimated by (2).
Scenario 2:
Validation data
For with , and are again be generated from and , respectively, and is generated by
[TABLE]
for . Therefore, can be estimated by (3).
Finally, we repeat simulation 1000 times in each setting.
4.2 Simulation Results
To evaluate the finite-sample performance of the proposed method, we consider the proportion that each active covariates is selected out of 1000 simulations which is denoted by , and the proportion that all active covariates are selected out of 1000 simulations which is denoted by . In addition, for the comparisons, we also examine the naive estimator, which is derived by directly implementing the observed covariates and taking iteration through (12). For two different survival models and several settings of , we compare the results obtained from applying the proposed method to the surrogate covariates as opposed to the estimators obtained from fitting the data with the true covariate measurements.
The numerical results are placed in Tables 1-3. Since feature screenings based on the naive and proposed methods use the same criterion (7), so the screening result are the same. Furthermore, the results of feature screening based on the true covariates are similar to the results based on the surrogate covariates regardless values of and . It also verifies Theorem 3.1. However, the feature screening method can successfully select variables and with high probability, but is selected with low proportion. This result is consistent with the example in Section 2.4. On the contrary, from Tables 1-3, we can see that the iterated feature screening method based on corrected score function (13) successfully identify the variable with high proportion. This result is parallel to the case that the true covariate is implemented. On the other hand, even the iterated feature screening method is implemented, cannot be identified if the measurement error effect is not corrected appropriately. This result is verified by the naive method with the usage of (12).
5 Data Analysis
5.1 Analysis of The Mantle Cell Lymphoma Microarray Data
We first illustrate the proposed methods by an application to the mantle cell lymphoma microarray dataset, available from http://llmpp.nih.gov/MCL/. The dataset contains the survival time of 92 patients and the gene expression measurements of 8810 genes for each patient. However, we only concern 6312 genes after deleting 2498 ones appearing to be missing. During the follow-up, 64 patients died of mantle cell lymphoma and the other 28 ones were censored, causing 36% censoring ratio. The aim of the study was to formulate a molecular predictor of survival after chemotherapy for the disease.
Since this dataset contains no information to characterize the degree of measurement error that is accompanying with the gene expressions, here we conduct sensitivity analyses to investigate the measurement error effects on analysis results. Specifically, let be the covariance matrix of the gene expressions. For sensitivity analyses, we consider to be the covariance matrix for the measurement error model (1), where is the diagonal matrix with diagonal elements being a common value , which is specified as , or to feature a setting with minor, moderate or severe measurement error. Let , indicating that we aim to select 20 variables in the active set . In the iteration algorithm, we first select 8 gene expressions, and then the remaining 12 gene expressions are selected by either (12) or (13). For comparisons, we examine the feature screening (FS) method in Section 3.1 and the iterated feature screening (IFS) method in Section 3.2. The selection results are summarized in Table 4.
From Table 4, we can see that both feature screening and iterated feature screening methods have the same results in the first 8 gene expressions regardless of proposed and naive methods. It indicates that the first 8 gene expressions are clearly dependent on the response and easily identified. In the remaining 12 gene expressions, on the other hand, the screening results are different. Specifically, the iterated feature screening method select some gene expressions, such as 29897, 30620, 32699 and so on, regardless of different degrees of measurement error effects, and those selected gene expressions are not shown in the result of feature screening method. It implies that the iterated feature screening method select some potentially important variables which are not identified by the feature screening method. Furthermore, for the result based on naive method, even the iterated feature screening method is implemented, the selections in the remaining 12 gene expressions are different from the result based on the correction of error effect. The main reason comes from the usage of the estimator of solved by (12) or (13).
5.2 Analysis of NKI Breast Cancer Data
In this section, we implement our proposed method to analyze the breast cancer data collected by the Netherlands Cancer Institute (NKI) (van de Vijver et al. 2002). Tumors from 295 women with breast cancer were collected from the fresh-frozen-tissue bank of the Netherlands Cancer Institute. Tumors of those patients were primarily invasive breast cancer carcinoma that were about 5 cm in diameter. Patients at diagnosis were 52 years or younger and the diagnosis was done from 1984 to 1995. Of all those patients, 79 patients died before the study ended, yielding approximately the 73.2% censoring rate. For each tumor of patient, approximate 25000 gene expressions were collected. Consistent with the analysis of gene expression data, we treat log intensity as the covariates.
Since this dataset also contains no information to characterize the degree of measurement error that is accompanying with the gene expressions, similar to the idea in Section 5.1, we conduct sensitivity analyses to investigate the measurement error effects on analysis results. That is, let be the covariance matrix of the gene expressions, and we consider to be the covariance matrix for the measurement error model (1), where is the diagonal matrix with diagonal elements being a common value , which is specified as , or to feature a setting with minor, moderate or severe measurement error. Let , indicating that we aim to select 18 variables in the active set . In the iteration algorithm, here we first select 7 gene expressions, and then the remaining 11 gene expressions are selected by either (12) or (13). Similar to the procedure in Section 5.1, we investigate the feature screening (FS) method in Section 3.1 and the iterated feature screening (IFS) method in Section 3.2. The selection results are summarized in Table 5.
From Table 5, the result of NKI data is parallel to the result in Section 5.1, in the sense that both feature screening and iterated feature screening methods have the same results in the first 7 gene expressions regardless of proposed and naive methods. It indicates that the first 7 gene expressions are clearly dependent on the response and easily identified. In the remaining 11 gene expressions, on the other hand, the screening results are different. For example, the iterated feature screening method select some gene expressions, such as NM** 020188, Contig25991, NM **003882 and so on, regardless of different degrees of measurement error effects, and those selected gene expressions are not shown in the result of feature screening method. It implies that the iterated feature screening method select some potentially important variables which are not identified by the feature screening method. Furthermore, for the result based on naive method, even the iterated feature screening method is implemented, the selections in the remaining 11 gene expressions are different from the result based on the correction of error effect. The main reason comes from the usage of the estimator of solved by (12) or (13).
6 Conclusion
Ultrahigh-dimensional data analysis is one of an important topics in decades, and it appears frequently in many practical situations and research fields, such as biological data and financial data. Many methods have been developed to deal with this problem. In the presence of censored data and covariates measurement error simultaneously, however, little method is available. Furthermore, some truly important covariates may be failed to be detected due to correlations among other covariates.
To overcome those challenges, we propose the valid feature screening method to deal with ultrahigh-dimensionality with both censored data and covariates measurement error incorporated simultaneously. Different from other feature screening methods based on censored data, the proposed method enables to determine the same active predictors based on the surrogate and unobserved covariates. To improve the accuracy of feature screening and identify some potentially important variables, we further develop the iterated feature screening with correction of measurement error. Throughout the simulation studies and real data analysis, it is verified that the iterated feature screening method yields the satisfactory results and outperforms the feature screening and naive methods.
There are some possible extensions and applications. First, even the dimension of variables is reduced to be smaller than the sample size, sometimes the dimension is still high and some unimportant variables may still contain in the dataset. In this case, we then implement the variable selection techniques, such as LASSO or SCAD, to identify the most important variables and shrink other unimportant variables. Second, although we mainly consider continuous covariates and classical measurement error model, the proposed method can be naturally extended to other types of variables, such as binary and count variables, and other measurement error models, including Berkson error model. Furthermore, the binary covariates with mismeasurement, also called misclassification, is also a crucial problem. Finally, in addition to right-censoring, some complex structures, such as left-truncation (e.g., Chen 2019), also appear in survival data with ultrahigh dimensionality. It is also interesting to explore this problem by extending the proposed method. These important topics are our future work.
Appendix: Proof of Theorem 3.1
We first consider and . Note that the former formulation is based on the true covariates , while the latter formulation is based on the surrogate covariates .
Since the error term follows normal distribution , then its characteristic function is given by
[TABLE]
By the direct computation, we have
[TABLE]
where the second equality is due to the independence of and , and the last equality is due to (A.1).
In addition, we can also derive
[TABLE]
where the second equality is due to the independence of and , and the last equality again comes from (A.1). As a result, combining (A.2) and (A.3) with gives the same expression of .
The equivalence of and holds by the similar derivations. Therefore, we conclude that and are equivalent in the sense that if and only if . Consequently, the same active features can be determined for and .
References
Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, eds by Petrov, N. and Czaki, F., 267 - 281. Akademiai Kaido, Bydapest.
Buckley, J. and James, I. (1979) Linear regression with censored data. Biometrika, 66, 429-436.
Candes, E. and Tao, T. (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). The Annals of Statistics, 35, 2313 - 2404.
Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006) Measurement Error in Nonlinear Model. CRC Press, New York.
Chen, L.-P. (2019). Pseudo likelihood estimation for the additive hazards model with data subject to left-truncation and right-censoring. Statistics and Its Interface, 12, 135-148.
Chen, X., Chen, X. and Wang, H. (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Computational Statistics and Data Analysis, 119, 118-138.
Chen, X., Zhang, Y., Chen, X. and Liu, Y. (2019) A simple model-free survival conditional feature screening. Statistics and Probability Letters, 146, 156-160.
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32, 409 - 499.
Fan, J. and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.
Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society. Series B, 70, 849 - 911.
Fan, J., Samworth, R. and Wu, Y. (2009) Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research,, 10, 1829 - 1853.
Fan, J. and Song, R. (2010) Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567 - 3604.
Fan, J., Feng, Y. and Wu, Y. (2010) Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect, 6, 70 - 86.
Hall, P. and Miller, H. (2009) Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533 - 550.
Li, R., Zhong, W. & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129 - 1139.
Miller, R. G. (1981). Survival Analysis. Wiley, New York.
Rosenwald, A.,Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-bell lymphoma. The New England Journal of Medicine, 346, 1937-1947.
Schwarz, G. (1978) Estimating the dimension of model. Annals of Statistics, 6, 461 - 464.
Székely, G. J., Rizzo, M. L. & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35, 2769-2794.
Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288.
van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, D. W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E.T., Friend, S.H. and Bernards, R. (2002) A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine, 347, 1999 - 2009.
Yan, X., Tang, N. and Zhao, X. (2017) The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1
Zhong, W. and Zhu, L. (2015) An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85, 2331 - 2345.
Zhu, L., Li, L., Li, R. and Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464 - 1475.
Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of Royal Statistical Society: Series B, 67, 301-320.
Zou, H. (2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418-1429.
