Similarity-based Random Survival Forest
Yingying Xu, Joon Lee, Joel A. Dubin

TL;DR
This paper introduces a modified random survival forest method that incorporates similarity measures to improve the accuracy of predicting time-to-event outcomes in heterogeneous medical datasets, demonstrated on ICU data.
Contribution
The paper proposes a novel similarity-based modification to the random survival forest algorithm, enhancing prediction accuracy for survival analysis in complex, heterogeneous datasets.
Findings
Improved predictive accuracy over standard random survival forests.
Effective in ICU datasets like MIMIC-III.
Validated through comprehensive simulation studies.
Abstract
Predicting time-to-event outcomes in large databases can be a challenging but important task. One example of this is in predicting the time to a clinical outcome for patients in intensive care units (ICUs), which helps to support critical medical treatment decisions. In this context, the time to an event of interest could be, for example, survival time or time to recovery from a disease/ailment observed within the ICU. The massive health datasets generated from the uptake of Electronic Health Records (EHRs) are quite heterogeneous as patients can be quite dissimilar in their relationship between the feature vector and the outcome, adding more noise than information to prediction. In this paper, we propose a modified random forest method for survival data that identifies similar cases in an attempt to improve accuracy for predicting time-to-event outcomes; this methodology can be applied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Bayesian Modeling and Causal Inference · Bayesian Methods and Mixture Models
A Similarity-based Approach to Random Survival Forests
1st Yingying Xu
Department of Statistics
- and Actuarial Science
University of Waterloo
*Waterloo, Canada
2nd Joon Lee
Cumming School of Medicine
*University of Calgary
*Calgary, Canada
3rd Joel A. Dubin
Department of Statistics
- and Actuarial Science
University of Waterloo
*Waterloo, Canada
Abstract
Predicting time-to-event outcomes in large databases can be a challenging but important task. One example of this is in predicting the time to a clinical outcome for patients in intensive care units (ICUs), which helps to support critical medical treatment decisions. In this context, the time to an event of interest could be, for example, survival time or time to recovery from a disease/ailment observed within the ICU. The massive health datasets generated from the uptake of Electronic Health Records (EHRs) are quite heterogeneous as patients can be quite dissimilar in their relationship between the feature vector and the outcome, adding more noise than information to prediction. In this paper, we propose a modified random forest method for survival data that identifies similar cases in an attempt to improve accuracy for predicting time-to-event outcomes; this methodology can be applied in various settings, including with ICU databases. We also introduce an adaptation of our methodology in the case of dependent censoring. Our proposed method is demonstrated in the Medical Information Mart for Intensive Care (MIMIC-III) database, and, in addition, we present properties of our methodology through a comprehensive simulation study. Introducing similarity to the random survival forest method indeed provides improved predictive accuracy compared to random survival forest alone across the various analyses we undertook.
Index Terms:
dependent censoring, intensive care unit data, MIMIC database, predictive accuracy, time-to-event response data
I Introduction
Electronic Health Records (EHRs) have generated health data sets that provide rich and diverse information for modeling and prediction. Survival analysis has been essential in clinical and epidemiological studies, and both parametric and semiparametric modeling have been utilized in the literature (e.g., [1]). Especially with big datasets, patients can be heterogeneous, which pose challenges to accurate prediction of outcomes of interest. Conditioning on a more relevant subset where the cases are more similar to the point of prediction might improve prediction accuracy. Similarity-based prediction has been focused upon for other types of responses, such as binary outcomes (e.g., [2]). The concept of similarity within the random forest context is seen in [3] for regression and classification. In [4], the author applied the case-specific random forests method of [3] to a dataset for a binary response from the Medical Information Mart for Intensive Care (MIMIC-II) database ([5, 6]).
In survival analysis, one notion of similarity is seen in cure models. These models assume that while some cases will die from a disease or experimental stress, a sub-population will survive for a long time without experiencing the event. Although the term similarity is not specifically mentioned in this literature, the sub-population of long-term survivors can be considered as a group of similar cases. Early studies on such models include [7], [8], and [9]. In [10], the authors suggested a straightforward computational method to deal with grouped survival data based on the Cox proportional-hazards model. In both [11] and [12], the respective authors used a mixture model representation for the two populations, which models the probability of being a long-term survivor with a logistic regression and the time to event for those that would experience the event with survival models, respectively. Many variations of mixture cure models can be seen in literature. In [13], the authors provided an alternative to two-component mixture models in estimating cure rate by using bounded cumulative hazard function. These models focus on modeling rather than prediction.
We take a rather different approach to model and predict survival data when there are one or more sub-populations in the dataset, that is, when the relationship between the time-to-event outcome and the explanatory variables are homogeneous within groups and more heterogeneous between groups. This is a more general case than the cure model as there can be more than two groups in the population, and the number of groups is unknown, in general. Note that the similarity is not just based on the grouping of the survival time, or the closeness of the explanatory variables, but depends on the relationship between the two. Tree-based methods such as random forests [14] are a natural way of incorporating both outcome and covariate information, and can be utilized to characterize similarity as cases in the same terminal node can be considered as similar to each other. Random forests methods have been extended to survival data as well, as in [15], and our approach is essentially combining the case-specific random forests model in [3] with the random survival forests model [15]. An approach for handling dependent right censoring will be proposed as well.
In Section 2, we will discuss our proposed similarity-based random survival forest algorithm with independent right censoring, and methods to adjust for dependent censoring. Time-varying area under the receiver operating characteristic curve (time-varying AUC; note AUC is sometimes written as AUROC in the literature) is used as our primary criterion for evaluating prediction performance. In Sections 3 and 4, respectively, we present applications of the algorithm in a simulation study, as well to a real dataset from the MIMIC-III database [16], an update to MIMIC-II ([5, 6]). In Section 5, we will summarize our methodology and findings from the simulation study and real data analysis.
II Similarity-based Random Survival Forest
In this section, we will introduce the algorithm for our proposed similarity-based random survival forest (SB-RSF). The idea is to build a different random survival forest for prediction for each test case, giving greater weight to the training cases that are in closer proximity to the test case, and using less information from those that merely add more noise to prediction. We will discuss the methods under the assumption of independent censoring in Section 2.1 and then under the more flexible assumption of dependent censoring in Section 2.2. In Section 2.3, we will talk about using time-varying AUC for model comparison.
II-A With Independent Censoring
We will assume independent censoring for now. Methods to incorporate dependent censoring will be discussed in Section 2.2.
- •
- Construct a regular random survival forest model for a training dataset that has sample size .
- –
(a) Draw bootstrap samples from the training data. Uniform sampling is used.
- –
(b) Grow a survival tree for each bootstrap sample under the constraint that it should have unique deaths.
- •
- For each point in the test dataset of size , obtain a weight vector based on the random survival forest in the first step.
- –
(a) Pass a test data point down each tree in the random survival forest, and keep track of how many terminal nodes group a training data point with the test point.
- –
(b) Assign a weight vector of length to each test data point based on how many terminal nodes group a training data with that test data point.
- –
(c) Iterate through each test data point, and obtain a weight matrix of size . Normalize each row of the weight matrix so that each row sums to 1.
- •
- Build a different similarity-based random survival forest for each test data point.
- –
(a) For a given test data point, build a random survival forest model with the weight vector as the sampling probability vector in the bootstrap.
- –
(b) Pass down the test data point in each tree, and estimate the cumulative hazard function (CHF) of the terminal node to which the test data point belongs.
- –
(c) Average among all trees to get an ensemble CHF for that test data point.
- –
(d) Repeat (3.a)-(3.c) for each data point in the test dataset.
II-B Adjusting for Dependent Censoring
Dependent censoring for right-censored data is common in follow-up studies. For right censoring, the event is only known to have occurred after a certain time point. Denoting the censoring time by , the observed time will be the minimum of the event time and the censoring time, i.e., . Denote the event indicator by , which indicates the observed time corresponds to the true event time, then , which is 1 if the event occurs before censoring, and 0 otherwise. For non-informative censoring, the censoring process does not directly depend on the event process, although it can depend on some covariates. With informative censoring, the censoring process directly relates to the expected time to event. Inverse probability-of-censoring weights (IPCW) have been shown to account for the bias that occurs when ignoring informative censoring ([17, 18]). In this setting, the algorithm is modified as follows:
- •
- Use the standard Kaplan-Meier estimator with censoring time as the event time to get the probability of getting censored.
- •
- Calculate the IPC weights for each training case as , i.e. the weights are equal to the inverse probability of not getting censored.
- •
- Calculate the similarity weights for a training case and test case as , as described in Section 2.1.
- •
- The sampling weights under dependent censoring for use in the similarity-based random survival forest for and will be proportional to .
The intuition behind the multiplication of the weights is that the SB-RSF algorithm now gives greater sampling weights to those data points that are more likely to be censored.
II-C Prediction Accuracy
We will be using time-varying area-under-the receiver operating characteristic curve (time-varying ; sometimes written as time-varying ) for model comparison. For binary outcomes, the prediction accuracy can be characterized by ROC, which plots the sensitivity against (1-specificity) for the range of possible decision-cutoff thresholds. And the area under ROC ( or ) represents a measure of prediction accuracy.
For time-to-event outcomes, there are a few proposals to generalize the concept of sensitivity and specificity (e.g., [19]). One way is to look at sensitivity and specificity at each time of interest . The survival probability up to of a test case , i.e., , can be derived from its cumulative hazard . Then, can be estimated at each ; this is the time-varying . In this paper, we will evaluate time-varying over a dense grid of time points.
III Simulations
We use two simulated examples to further explain what similarity means in the model and demonstrate the prediction performance of the algorithm.
III-A Example 1
In a simple example, each case has a 3-dimensional covariate that links directly to the survival outcome. Two of the covariates are linked to similarity as well. In this case, is a survival outcome that follows a Weibull distribution with shape=2, and log(scale) mapped to linear predictor :
[TABLE]
Here, and describes a binary tree structure that clusters cases into two subspaces. Within each subspace, the relationship between the survival outcome and the covariates are the same, but different between the subspaces. 1000 cases are generated, where are independently and uniformly generated from (-15,15). Uniform right censoring (independent for now) is considered. Fig. 1 summarizes the comparison between the prediction performance of case-specific random survival forest and the regular random survival forest. The red dots represent the time-varying for the case-specific random survival forest and the black dots are for the regular random survival forest. The s are evaluated at each day from day 1 to day 20. At each day, the time-varying of the case-specific method exceeds the regular random survival forest, more often than not by a sizable margin.
III-B Example 2
In the second model, each case has a 5-dimensional covariate , where three of the covariates explain similarity. Again, we will use a binary tree structure to define subspaces. In this case, we will prune the tree until there are four terminal nodes, i.e., four subspaces. Again, within each subspace, the relationship between and the covariates are the same, but different between subspaces.
The result in Fig. 2 is similar to the first simulation result in Fig. 1. Giving more weights in the sampling to similar cases, based on our SB-RSF method, yields better predictive accuracy in the random survival forest framework.
IV Application to an ICU dataset
IV-A MIMIC-III
MIMIC-III (Medical Information Mart for Intensive Care III) is a freely accessible critical care database for 53,423 distinct hospital admissions for adult patients (aged 16 and above). Data includes vital signs, medications, diagnostic code, survival data and high resolution data including lab results and bedside monitoring data [16].
This large dataset provides rich information for modeling and prediction, but the diversity of the patients also poses challenges to accurate prediction of outcome of interest. To illustrate, the goal is to predict ICU patient survival with their age, gender, ICU type, admission type, and severity of disease classification score, SAPS II [20], as predictors. ICU type includes CCU (Coronary Care Unit), CSRU (Cardiovascular Intensive Care Unit), MICU (Medical Intensive Care Unit), SICU (Surgical Intensive Care Unit) and TSICU (Trauma Surgical Intensive Care Unit). Admission type includes Elective, Emergency, and Urgent. Only the first hospital admission of adult patients (older than 15 years of age) is included in our study. Excluding cases with missing data in one or more of the variables or outcome, the sample size is 38,604. In this dataset, 80% of the cases are right-censored at 90 days after hospital discharge, for the purpose of de-identification.
IV-B Result
Fig. 3(a) compares the time-varying for the algorithm in Section 2.1 with the random survival forests method. The time-varying from our proposed SB-RSF method outperforms that of the regular random survival forest at the beginning of the prediction and after day 20, and the gap between the two lines increases as we predict further into the future.
Fig. 3(b) shows the result when considering possible dependency in the censoring. The result is similar to that in Fig. 3(a). It is possible that for this dataset there is not much dependency in the censoring, and thus the calculation of the IPC weights did not have a big impact on the result.
V Discussion
In this paper we proposed to improve the random survival forests by incorporating the similarity structure between a test data point and training data point. Instead of building a global random survival forests for each test case, we construct similarity-based random survival forests for each one of them, by giving more weights to the training cases that are in closer proximity to the test case. Proximity is measured using a regular random survival forests model. We also developed an algorithm to account for dependent censoring which is common in survival data.
Both simulations and a real data example show promising results that, in general, indicate that the similarity-based prediction improves predictive performance of random survival forests in terms of time-varying . This result is also consistent with other findings using similarity structure for binary response data (e.g., [2]).
Our proposed SB-RSF method requires building a random survival forest for every test data point and specification of a few tuning parameters. Specifically, the tuning parameters are the depth of the tree (represented by the number of unique deaths in the terminal nodes), the number of candidate predictors to consider for splitting at each node, and the number of trees in the forests. This leads to a computationally intensive algorithm, especially when the size of the test data size is large. Future work to investigate ways in alleviating some of this computational burden would be helpful. One way of reducing computation time is to use a hard threshold for sampling, that is, giving 0 weight to cases that are too far away from the test case. The tuning parameters for the simulations are selected based on the entire training dataset. However, if they are determined from a smaller subset of the training data, the computational time might be greatly reduced.
For future work, methods other than random forests may be utilized for similarity-based prediction for survival outcomes. One possible extension is the joint modeling of longitudinal covariates and a time-to-event outcome (e.g., [21]). One might be able to identify similar cases based on longitudinal covariates as well as time-fixed covariates. In addition, an approach within this framework that handles missing values in the dataset should be pursued as well.
In spite of some areas that require future study, we have shown the proposed SB-RSF approach to hold promise for the prediction of survival outcomes. Our investigation shows that our similarity-based algorithm can improve the predictive accuracy of a popular and useful prediction tool, i.e., random survival forest ([15, 22]), for time-to-event data.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Klein, John P and Moeschberger, Melvin L, “Survival analysis: techniques for censored and truncated data (2ed),” Springer Science, 2006.
- 2[2] Lee, Joon and Maslove, David M and Dubin, Joel A, “Personalized mortality prediction driven by electronic medical data and a patient similarity metric,” Plo S ONE, vol. 10(1), e 0127428, 2015.
- 3[3] Xu, Ruo and Nettleton, Dan and Nordman, Daniel J, “Case-specific random forests,” Journal of Computational and Graphical Statistics, vol. 25(1), pp. 49–65, 2016.
- 4[4] Lee, Joon, “Patient-specific predictive modeling using random forests: An observational study for the critically ill,” JMIR Medical Informatics, vol. 5(1), e 3, 2017.
- 5[5] Saeed, Mohammed and Villarroel, Mauricio and Reisner, Andrew T and Clifford, Gari and Lehman, Li-Wei and Moody, George and Heldt, Thomas and Kyaw, Tin H and Moody, Benjamin and Mark, Roger G, “Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database,” Critical Care Medicine, vol. 39(5), pp. 952–960, 2011.
- 6[6] Lee, Joon and Scott, Daniel J and Villarroel, Mauricio and Clifford, Gari D and Saeed, Mohammed and Mark, Roger G, “Patient-specific predictive modeling using random forests: An observational study for the critically ill,” 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 8315-8318, 2011.
- 7[7] Boag, John W, “Maximum likelihood estimates of the proportion of patients cured by cancer therapy,” Journal of the Royal Statistical Society, Series B, vol. 11(1), pp. 15–53, 1949.
- 8[8] Berkson, Joseph and Gage, Robert P, “Survival curve for cancer patients following treatment,” Journal of the American Statistical Association, vol. 47, pp. 501–515, 1952.
