An improved fully nonparametric estimator of the marginal survival function based on case-control clustered data
David M. Zucker, Malka Gorfine

TL;DR
This paper introduces a new nonparametric estimator for the marginal survival function in case-control family studies, accounting for within-family correlations and right-censored data, with proven asymptotic properties and demonstrated effectiveness through simulations and real data application.
Contribution
It proposes a novel local linear nonparametric estimator for the marginal survival function in family-based case-control studies, addressing within-family correlation and censoring.
Findings
Estimator performs well in simulations
Asymptotic properties are established
Method applied successfully to prostate cancer data
Abstract
A case-control family study is a study where individuals with a disease of interest (case probands) and individuals without the disease (control probands) are randomly sampled from a well-defined population. Possibly right-censored age at onset and disease status are observed for both probands and their relatives. Correlation among the outcomes within a family is induced by factors such as inherited genetic susceptibility, shared environment, and common behavior patterns. For this setting, we present a nonparametric estimator of the marginal survival function, based on local linear estimation of conditional survival functions. Asymptotic theory for the estimator is provided, and simulation results are presented showing that the method performs well. The method is illustrated on data from a prostate cancer study. Keywords: case-control; family study; multivariate survival;…
| t | The proposed Estimator | Gorfine et al. | Naive KM | SEER |
|---|---|---|---|---|
| 50 | 0.9997 (0.0007) | 0.9918 (0.0311) | 0.9991 (0.0003) | 0.9958 |
| 52 | 0.9997 (0.0010) | 0.9801 (0.0340) | 0.9986 (0.0005) | 0.9930 |
| 54 | 0.9993 (0.0012) | 0.9784 (0.0413) | 0.9981 (0.0006) | 0.9902 |
| 56 | 0.9990 (0.0023) | 0.9784 (0.0451) | 0.9963 (0.0008) | 0.9843 |
| 58 | 0.9910 (0.0037) | 0.9784 (0.0451) | 0.9945 (0.0010) | 0.9783 |
| 60 | 0.9934 (0.0054) | 0.9784 (0.0461) | 0.9881 (0.0015) | 0.9703 |
| 62 | 0.9908 (0.0063) | 0.9678 (0.0501) | 0.9848 (0.0017) | 0.9603 |
| 64 | 0.9895 (0.0085) | 0.9423 (0.0577) | 0.9813 (0.0019) | 0.9504 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Methods and Mixture Models · Statistical Methods and Bayesian Inference
An improved fully nonparametric estimator of the marginal survival function based on case-control clustered data
**David M. Zucker
**Department of Statistics and Data Science,
The Hebrew University of Jerusalem
Mount Scopus, Jerusalem, Israel
email: [email protected]
**and
Malka Gorfine
**Department of Statistics and Operations Research
Tel Aviv University, Ramat Gan, Israel
email: [email protected]
Abstract
A case-control family study is a study where individuals with a disease of interest (case probands) and individuals without the disease (control probands) are randomly sampled from a well-defined population. Possibly right-censored age at onset and disease status are observed for both probands and their relatives. Correlation among the outcomes within a family is induced by factors such as inherited genetic susceptibility, shared environment, and common behavior patterns. For this setting, we present a nonparametric estimator of the marginal survival function, based on local linear estimation of conditional survival functions. Asymptotic theory for the estimator is provided, and simulation results are presented showing that the method performs well. The method is illustrated on data from a prostate cancer study.
Keywords: case-control; family study; multivariate survival; nonparametric estimator; local linear
1 Introduction
Many epidemiological and medical studies focus on disease events that are rare, so that a random sample from the population provides very few observations where failure has occurred during the monitoring time. In such situations, it is a common practice to consider a case-control strategy. Separate samples of individuals in whom the event has already occurred (case probands) and individuals in whom the event has not yet occurred (control probands) are obtained. Age at onset or age at censoring and disease status of each proband and of one or more of his/her relatives are recorded. For example, in a prostate cancer study, case probands are men diagnosed with prostate cancer, control probands are men without prostate cancer, and each proband is interviewed to obtain detailed disease history information of his relatives. The goals of such case-control family studies are to evaluate the effect of genetic and environmental factors on disease risk, and to estimate the distribution of the age at onset. The analysis of such data is complicated by the case-control selection scheme used to ascertain the families and by the within-family dependence. It is of interest to estimate the marginal survival function under dependent failure times of family members by a fully nonparametric estimator which avoids specific assumptions about the form of the distribution or the dependence structure among failure times with a family.
The estimator that first comes to mind is the Kaplan-Meier survival curve estimator based on the survival data on the relatives. This estimator, however, is biased because it does not take the case-control sampling and the within-family dependence into account. Gorfine et al. (2017) demonstrated the serious bias that can arise with this naive Kaplan-Meier estimator.
Accordingly, Gorfine et al. proposed a nonparametric estimator for this problem, based on a kernel smoothing approach. They presented a simulation study showing that their estimator performs well in terms of bias. The estimator of Gorfine et al. is based on the median of random variables with a complicated dependence structure. Consequently, the asymptotic properties of the estimator were not derived. Also, their use of the median causes some efficiency loss. In addition, their bandwidth selection procedure was not specifically targeted to the estimand of interest.
In the present paper we develop a new nonparametric estimator which performs well in terms of bias and much better in terms of variance than the estimator of Gorfine et al. In some scenarios the new estimator also outperforms Gorfine et al. in terms of bias. We provide asymptotic theory for the estimator. We also present a bandwidth selection procedure that is specifically targeted to the estimand of interest.
2 Preliminaries
The setup is as in Gorfine et al. (2017). We denote the maximum observation time among the probands by and the maximum observation time among the relatives by . We write . We let denote the number of relatives in family and we let denote the maximum number of relatives for a given proband. We view as a random variable. We will write some of the formulas as if each proband has exactly relatives, with the extra relatives taken to be censored at time 0. For family , , let denote the failure time of the proband and the failure times of the relatives. Let and denote the corresponding censoring times, and the corresponding observed times, and and the corresponding event indicators. We assume that the censoring is independent of the survival times. The sample includes case probands, with each case proband frequency matched with control probands, so that the total number of control probands is . Thus, the data consist of independent and identically distributed matched sets comprising one case family and control families, and the observed data on family consists of . We seek to estimate the marginal survival function
[TABLE]
which we are assuming is the same for probands and relatives, a common assumption in case-control family studies (Shih & Chatterjee, 2002; Chatterjee et al., 2006). We further assume that the bivariate survival function for the proband and a given relative is the same for all relatives. The marginal survival distribution is assumed to be continuous with density . In the case where and have different marginal distributions, our procedure yields a consistent estimate of the marginal distribution of .
Denote and . As in Gorfine et al., we have and .
We now develop an expression for in terms of and . Let and denote the hazard and cumulative hazard functions corresponding to , and define , . Also define and , . We can write . We then have the following:
[TABLE]
Now define
[TABLE]
Note that the bracketed integral is nonzero provided that for every there exists a set of values of positive measure for which . Integrating both sides of (1) and rearranging gives
[TABLE]
We use (4) to construct our estimator.
3 Estimation Procedure
In Gorfine et al., and were estimated using a generalized version of the kernel-smoothed Kaplan-Meier estimator proposed in an unpublished 1981 University of California at Berkeley technical report by R. Beran and examined in Dabrowska (1987), and the resulting estimators were used to construct an estimator of . Here, in light of (4), we work not only with and but also the derivative . Accordingly, we take a local linear estimation approach. Choose a symmetric kernel function and a bandwidth . Let and , and write . Let . Define (with )
[TABLE]
We can write
[TABLE]
with E[dM_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v)/Y_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v)]=0. A first-order Taylor approximation gives the local linear representation
[TABLE]
We now carry out weighted linear least squares with response variable dN_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v)/Y_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v), explanatory variable , and weights \chi(Q_{i},q)Y_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v)K((X_{Pi}-s)/h). This leads to the local linear estimators
[TABLE]
that is, for a given ,
[TABLE]
with the second equation leading to . We can now substitute , , and into (3) and (4) to obtain estimators and for and . We then take . When we want to emphasize the dependence on the bandwidth , we will write .
4 Asymptotic Theory
We can write with
[TABLE]
We will provide a detailed asymptotic analysis of . Similar arguments can be used to show that converges in probability to zero at faster rate than , and is negligible in comparison with the other two terms.
In this section and in the appendix, we will write and assume that the indices have been arranged so that the control probands appear first in the list of probands, meaning that a sum over probands 1 to is a sum over the control probands.
Defining
[TABLE]
we have
[TABLE]
Now, the process is a martingale with respect to the filtration , . Accordingly, for any process that is predictable with respect to , the process
[TABLE]
is a mean-zero martingale. It follows, even though the process M_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ } does not have any martingale properties, that for any function we have
[TABLE]
We now write , where
[TABLE]
In the appendix it is shown, under the conditions listed in the appended supplementary document that in probability and that
[TABLE]
with
[TABLE]
where is the limiting value of . Define . We then obtain the following theorem.
Theorem 1**.**
For each , converges to a limit and converges in distribution to .
The proof of this theorem appears in the appendix. Details are given in the appended supplementary document.
5 Practical Implementation Details
In preliminary work, we found that the performance of the estimator of can be improved dramatically by introducing a time transformation that makes the proband observation times approximately uniformly distributed. Along the lines of Doksum et al. (2017), we propose transforming according to an estimate of the distribution function of the proband observation times, which leads to a modified form of nearest neighbor regression. In the appended supplementary document, we show that the consistency and asymptotic normality is maintained under the time transformation if a smooth estimate of the distribution function is used. We believe that this result holds as well when the empirical distribution function is used. In our numerical work, we used the empirical distribution function.
In the context of family survival data, it is usually reasonable to assume that for all and , i.e., for all and . This condition implies that
[TABLE]
If we let and denote the Kaplan-Meier survival curve estimator based on the case relatives’ survival data and the control relatives’ survival data, respectively, the foregoing inequalities motivate modifying the estimator to the estimator resulting from replacing with if and by if . We implemented this modification in our numerical work. The modification comes into play mainly when the event rate is extremely low or extremely high. Given the consistency of , if the inequalities in (8) are strict, then for large sample sizes the modification no longer comes into play. The inequalities in (8) are strict if the following mild condition holds: for each there exists a set such that and
[TABLE]
For bandwidth selection, we propose a bootstrap procedure. Let denote the Kaplan-Meier estimate of the survival function of the time to censoring among the relatives (which is the same for case relatives and control relatives). In each bootstrap replication , for each family we generate event times for proband ’s relatives according to the survival function and censoring times for relatives according to the survival function . We then run our estimation procedure for a given on the resulting data, obtaining the estimate . Denote
[TABLE]
We evaluate over a grid of values in the range and choose the values with the minimum .
To construct confidence intervals in finite samples with bandwidth selection, we use the percentile bootstrap method.
6 Simulation Study
We carried out a simulation study to evaluate the finite sample properties of the proposed estimator. Data were generated under frailty models in which the within-family dependence is expressed in terms of a shared frailty variate , conditional on which the failure times of the family members are independent with hazard function . We manipulated five design factors, as follows: (1) frailty distribution: gamma or positive stable, (2) cumulative end-of-study event rate: high (60%) or low (15%), (3) number of case probands: 500 or 1000 (with 1:1 matching of control probands to case probands), (4) number of relatives per family: 1 or 4, and (5) strength of within-family dependence: low (Kendall tau of 1/3 between the failure of times of two members of the same family) and moderate (Kendall tau of 1/2). We took with , , and chosen so as to obtain the desired cumulative end-of-study event rate. The end of study age was taken to be 110 years. The overall censoring rate, including both interim and end-of-study censoring, was about 60% in the high event rate case and 90% in the low event rate case. The number of case probands was taken to be 500, with 1:1 matching of control probands to case probands. The data generation was carried out in the same manner as in Gorfine et al. We carried out 1024 simulation replications for each of the 32 combinations of the design. For each replication, we carried out 30 inner replications for the bootstrap bandwidth selection procedure and 100 outer replications for the percentile bootstrap confidence interval procedure. The initial bandwidth was 0.5 and the bandwidth search was done in two stages. In the first stage, we searched over in steps of 0.1 and identified the value with the lowest . In the second stage, we searched over , and and chose the value with the lowest to be the final value. The kernel used was the triweight kernel .
The results for 500 case probands are summarized in Fig. 1 and 2. The left two columns of each figure show the true survival curve, along with Gorfine et al.’s estimator and the new estimator. The finite-sample bias of the new estimator tends to be smaller, and in some settings, such as the gamma frailty model with very low event rates, its finite-sample bias is dramatically smaller. The right two columns of each figure summarize the point-wise coverage rates of the percentile-bootstrap confidence interval of the proposed estimator, along with the standard errors of the Gorfine et al. estimator and the proposed estimators. Clearly, the proposed estimator substantially outperforms the old estimator in terms of efficiency. In general, the coverage rates are reasonably close to , except at very early ages with small number of observed events. Similar results were obtained with 1000 case probands.
7 Example
In this section we illustrate our method by re-analysing the data presented in Gorfine et al. (2017), population-based case-control family study of early onset prostate cancer (Stanford et al., 1999). Briefly, case participants were identified from the Seattle-Puget Sound Surveillance, Epidemiology, and End Results (SEER) cancer registry. Cases were those with age at diagnosis between 40 and 64 years. Controls were identified by use of random-digit dialing and they were frequency matched to case participants by age. The information collected on the relatives is the age at diagnosis for prostate cancer if the relative had prostate cancer or age at the last observation if the relative did not have prostate cancer. Here we use the information about age at onset or age at censoring and disease status that was observed for the probands and their fathers, brothers, and uncles. The following analysis is based on 730 prostate-cancer case probands, 693 control probands, and a total of 7316 relatives. Out of the 3793 case-probands’ relatives, 211 had prostate cancer, and out of the 3523 control-probands’ relatives, 102 had prostate cancer. The age range of the relatives with prostate cancer was 40–93. Given that frequency matching was used rather than exact age matching, and that the number of relatives per proband varied across the probands, we carried out the time transformation based on the empirical distribution of the proband observation times across all 7316 relatives in the data set. For bandwidth selection we used the same two-stage procedure as in the simulations.
Figure 3 and Table 1 present the estimates of prostate-cancer marginal survival function using the naive Kaplan-Meier estimator based on the relatives’ data, Gorfine et al.’s estimator with nearest-neighbor smoothing and the median operator, the SEER survival curve based on the SEER Cancer Statistics Review 1975–2012, and the proposed estimator. In this dataset, Gorfine et al.’s estimator is closer to the SEER survival curve, but with very large point-wise standard errors compared to the proposed estimator. The standard errors reported in Table 1 are much larger than those reported in Gorfine et al. due to an error in the bootstrap code applied back then.
Acknowledgements
The authors would like to thank Dr. Stanford for her generosity in sharing the prostate cancer dataset that was used for illustrating the method. Malka Gorfine gratefully acknowledges support from the U.S.-Israel Binational Science Foundation in carrying out this work.
Supplementary material
We append a document with details of the proof of the asymptotic properties of the estimator. R code used to carry out the simulations and R code for applying the method to a data set may be found at https://github.com/david-zucker/marginal-survival/.
Appendix: Asymptotic Theory
A. Preliminaries
We first present some definitions. Let denote the density of . Define y(s,v)=E[Y_{Ri\,\begin{picture}(-1.0,1.0)(-1.0,-2.0)\circle*{2.0}\end{picture}\ }(v)|X_{Pi}=s,\delta_{Pi}=0] and . In addition, define
[TABLE]
Note that, by symmetry of , for all odd .
We now present two lemmas, whose proofs appear in the online Supplemental Materials. The notations and , and similarly and , should be understood as being uniform in the relevant arguments.
Lemma 1**.**
For even we have
[TABLE]
and for odd we have
[TABLE]
Lemma 2**.**
For we have
[TABLE]
and for we have
[TABLE]
B. Analysis of
We can write as
[TABLE]
with
[TABLE]
By Taylor expansion, we can write
[TABLE]
in which , with , where and denote, respectively, the second and third partial derivatives of with respect to We then have
[TABLE]
The term in square brackets does not depend on . Since
[TABLE]
we get
[TABLE]
We can then write , where
[TABLE]
We have
[TABLE]
We now consider separately the case of and . For , the results of Lemmas 1 and 2 imply that and , so that and
[TABLE]
For , the results of Lemmas A1 and A2 imply that and , so that and
[TABLE]
(recalling that the length of is 2). We thus obtain , so that , since .
C. Analysis of
Let and define
[TABLE]
Define further
[TABLE]
We can then write , where
[TABLE]
Our claim is that is asymptotically mean-zero normal, and that and are both .
By (7), . Thus, is the sum of i.i.d. mean-zero terms. Accordingly, to show that is asymptotically mean-zero normal we need to show that converges to a limit and that satisfies the Lindeberg condition
[TABLE]
where
[TABLE]
The appended supplementary document provides a proof of the above two assertions, along with a proof that is . The proof of the corresponding result for is similar.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Chatterjee et al. (2006) Chatterjee, N. , Kalaylioglu, Z. , Shih, J. H. & Gail, M. (2006). Case–control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics 62 , 36–48.
- 2Dabrowska (1987) Dabrowska, D. M. (1987). Non-parametric regression with censored survival time data. Scandinavian Journal of Statistics 14 , 181–197.
- 3Doksum et al. (2017) Doksum, K. A. , Jiang, J. , Sun, B. & Wang, S. (2017). Nearest neighbor estimates of regression. Computational Statistics and Data Analysis 110 , 64–74.
- 4Gorfine et al. (2017) Gorfine, M. , Bordo, N. & Hsu, L. (2017). A fully nonparametric estimator of the marginal survival function based on case–control clustered age-at-onset data. Biostatistics 18 , 76–90.
- 5Shih & Chatterjee (2002) Shih, J. H. & Chatterjee, N. (2002). Analysis of survival data from case–control family studies. Biometrics 58 , 502–509.
- 6Stanford et al. (1999) Stanford, J. L. , Wicklund, K. G. , Mc Knight, B. , Daling, J. R. & Brawer, M. K. (1999). Vasectomy and risk of prostate cancer. Cancer Epidemiology Biomarkers and Prevention 8 , 881–886.
