Dynamic Prediction of Competing Risk Events using Landmark   Sub-distribution Hazard Model with Multiple Longitudinal Biomarker

Cai Wu; Liang Li; Ruosha Li

arXiv:1906.05647·q-bio.QM·June 14, 2019

Dynamic Prediction of Competing Risk Events using Landmark Sub-distribution Hazard Model with Multiple Longitudinal Biomarker

Cai Wu, Liang Li, Ruosha Li

PDF

Open Access

TL;DR

This paper introduces a dynamic prediction framework for competing risk events using landmark sub-distribution hazard models with longitudinal biomarkers, enabling real-time risk assessment in chronic disease studies.

Contribution

It extends landmark survival models to competing risks with irregular biomarker measurements, providing a flexible, interpretable, and computationally efficient prediction method.

Findings

01

Accurate dynamic prediction of end-stage renal disease risk.

02

Effective handling of irregular biomarker measurement times.

03

Validated through simulations and real data application.

Abstract

The cause-specific cumulative incidence function (CIF) quantifies the subject-specific disease risk with competing risk outcome. With longitudinally collected biomarker data, it is of interest to dynamically update the predicted CIF by incorporating the most recent biomarker as well as the cumulating longitudinal history. Motivated by a longitudinal cohort study of chronic kidney disease, we propose a framework for dynamic prediction of end stage renal disease using multivariate longitudinal biomarkers, accounting for the competing risk of death. The proposed framework extends the landmark survival modeling to competing risks data, and implies that a distinct sub-distribution hazard regression model is defined at each landmark time. The model parameters, prediction horizon, longitudinal history and at-risk population are allowed to vary over the landmark time. When the measurement times…

Tables1

Table 1. Web Table 1 : The predicted CIF at different landmark times s 𝑠 s and biomarker values m 𝑚 m . The true conditional risk (True) were obtained empirically using the method described in Section 5. The average estimated CIF (EST), percent bias (%Bias), empirical standard deviation (ESD), and mean-squared errors ( × 1 , 000 \times 1,000 ) (MSE) are reported. Prediction horizon τ 1 = 3 subscript 𝜏 1 3 \tau_{1}=3 . The result is based on 500 Monte Carlo repetitions.

		True	EST	%Bias	ESD	MSE
	$m = 0$	0.167	0.168	0.703	0.029	0.836
$s = 1$	$m = 2$	0.357	0.350	-2.030	0.025	0.670
	$m = 4$	0.610	0.639	4.734	0.052	3.583
	$m = 0$	0.278	0.295	6.262	0.052	3.001
$s = 3$	$m = 2$	0.516	0.503	-2.589	0.033	1.299
	$m = 4$	0.729	0.755	3.463	0.053	3.419
	$m = 0$	0.312	0.327	4.762	0.089	8.068
$s = 5$	$m = 2$	0.505	0.491	-2.795	0.062	4.104
	$m = 4$	0.681	0.691	1.407	0.064	4.160

Equations8

Y_{i 1} (t_{ij})

Y_{i 1} (t_{ij})

Y_{i 2} (t_{ij})

{P (Y_{i 3} (t_{ij}) = 1)} = m_{i 3} (t_{ij}) = b_{i 03} + b_{i 13} \cdot t_{ij},

λ_{k} (t) = λ_{k 0} (t) exp {γ_{k} X_{i} + q = 1 \sum 3 β_{k q} m_{i q} (t) + v_{k} u_{i}} .

λ_{k} (t) = λ_{k 0} (t) exp {γ_{k} X_{i} + q = 1 \sum 3 β_{k q} m_{i q} (t) + v_{k} u_{i}} .

R = 1 0.26 1 - 0.5 - 0.65 1 - 0.3 - 0.3 0.35 1 - 0.5 - 0.5 0.5 0.5 1 - 0.3 - 0.3 0.3 0.3 0.3 1 .

R = 1 0.26 1 - 0.5 - 0.65 1 - 0.3 - 0.3 0.35 1 - 0.5 - 0.5 0.5 0.5 1 - 0.3 - 0.3 0.3 0.3 0.3 1 .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Insurance, Mortality, Demography, Risk Management · Statistical Methods and Bayesian Inference

Full text

**Web-based Supplementary Materials for

“Dynamic Prediction of Competing Risk Events using Landmark Sub-distribution Hazard Model with Multiple Longitudinal Biomarkers”

by Cai Wu, Liang Li, and Ruosha Li**

Web Appendix A: Data Generation Procedure for Simulation

The simulation results are presented in main text of the paper. This section presents details of the data generation procedure and parameter settings. The longitudinal processes are generated from equation (1) below. We simulated a total of $n$ subjects with independent and identically distributed data for each simulation run.

[TABLE]

For both non-informative biomarker effect (S1) and informative biomarker effect (S2), the data were simulated according to the joint frailty model of longitudinal biomarkers and the competing risk event times (Elashoff et al, 2008). It includes the longitudinal sub-model and the following survival sub-model ( $k=1,2$ ):

[TABLE]

The baseline hazard for the time-to-event outcome follows Weibull distribution with scale and shape parameters of (0.02, 2.3) and (0.01, 2.4) for event 1 and event 2 respectively. The longitudinal sub-model includes three longitudinal biomarkers. The first one $Y_{i1}(.)$ is a continuous biomarker with a linear mean trajectory $m_{i1}(.)$ . The second biomarker $Y_{i2}(.)$ has a nonlinear subject-specific mean trajectory. The third biomarker is binary with a logit-linear mean trajectory. For the first two biomarkers, $\epsilon_{i1}(.)$ and $\epsilon_{i2}(.)$ are random noises with $N(0,0.5^{2})$ distribution. Each biomarker’s longitudinal trajectory is characterized by two random effects, denoted by $\boldsymbol{b}_{ip}=(b_{i0p},b_{i1p})^{T}$ ( $p=1,2,3$ ). In the case of a linear trajectory, such as the first biomarker, they represent the subject-specific random intercept and slope. We let $\boldsymbol{b}_{i}=(\boldsymbol{b}_{i1}^{T},\boldsymbol{b}_{i2}^{T},\boldsymbol{b}_{i3}^{T})^{T}\sim MVN(\boldsymbol{\Omega},\boldsymbol{D})$ , where $\boldsymbol{\Omega}=(2.8,-0.14,2.1,0.01,-1,0.3)$ denote the population mean. The covariance matrix $\boldsymbol{D}$ can be decomposed into $\boldsymbol{D}=diag(\boldsymbol{\sigma}_{q})\times\boldsymbol{R}\times diag(\boldsymbol{\sigma}_{q})$ , where the diagonal matrix $diag(\boldsymbol{\sigma}_{q})$ includes elements $\boldsymbol{\sigma}_{q}=(\sigma_{01},\sigma_{11},\sigma_{02},\sigma_{12},\sigma_{03},\sigma_{13})=(0.9,0.1,0.9,0.005,0.9,0.1)$ and correlation matrix $\boldsymbol{R}$ .

[TABLE]

In the survival sub-model, $u_{i}$ is the frailty term accounting for the correlation between two competing events, and the parameter $v_{1}$ is set to 1 to ensure identifiability. We let $u_{i}\sim N(0,\sigma_{u}^{2})$ where $\sigma_{u}=0.5$ . For S1, $\{\beta_{1q}\}$ and $\{\beta_{2q}\}$ are all set to be zero. For S2, we set $\{\beta_{1q};q=1,2,3\}=(-1.2,0.3,1.5)$ and $\{\beta_{2q};q=1,2,3\}=(-0.2,0.05,0.6)$ . For both S1 and S2, the sub-model includes one baseline covariate $X_{i}\sim N(0.5,0.5^{2})$ with regression coefficient $\gamma_{1}=-1.5$ and $\gamma_{2}=-1$ . The censoring times are generated from a mixture of uniform distribution $\eta_{1}\textrm{Unif}(0,3)+\eta_{2}\textrm{Unif}(3,6)+\eta_{3}\textrm{Unif}(6,9)+\eta_{4}\textrm{Unif}(9,12)$ , where the mixing probabilities $\eta_{1}$ to $\eta_{4}$ ( $\sum_{i=1}^{4}{\eta_{i}}=1$ ) are chosen to control the censoring rate at approximately 25%. For example, they equal to $(0.1,0.1,0.2,0.6)$ for the simulation with informative biomarker and $(0.1,0.1,0.1,0.7)$ for the simulation with non-informative biomarker. See the description of these two simulation scenarios below.

The random intercept and random slope (time effect) are assumed to be positively correlated for each biomarker. We allow $\boldsymbol{Y}_{i1}$ and $\boldsymbol{Y}_{i2}$ to have mild negative correlation, and $\boldsymbol{Y}_{i1}$ and $\boldsymbol{Y}_{i3}$ mild positive correlation. The measurement times $t_{ij}$ are irregularly spaced and unsynchronized among different subjects. It was generated from $t_{ij}=\tilde{t}_{j}+e_{ij}$ , where $\{\tilde{t}_{j}\}$ is the scheduled measurement times from 0 to 12 years with 0.5 increment and $e_{ij}\sim Unif(-0.17,0.17)$ . This setup corresponds to the practical situation where the subject had clinical visit within a two-month window around the scheduled visit times. For each simulation scenario, we used $500$ Monte Carlo repetitions and the sample size is $n=500$ .

Web Appendix B: Simulation on Local Linear Estimation

As explained in the Simulation section, the proposed landmark SDH model is a working model and it is therefore difficult to simulate data so that the model holds at all landmark times. This is a common feature of the landmark (or partly conditional) modeling approaches in general. In light of this difficulty, we resort to a simple albeit approximate approach to evaluating the quality of the proposed local linear estimation, at any landmark time $s$ , as described below.

We simulated a cross-sectional time-to-event data set at a given landmark $s$ , e.g., $s=3$ , which was treated as baseline for the purpose of this simulation. Scattered individual measurement times $\{t_{ij}\}$ and the associated biomarker values $\boldsymbol{Y}_{i}(t_{ij})$ were simulated within a small neighborhood of $s$ . The proposed landmark SDH model was used to generate independent competing risks data starting from each $t_{ij}$ , following the simulation algorithm in Fine and Gray (1999). The log-SDH $\boldsymbol{\beta}(s)$ is assumed to be a quadratic function of $s$ (Web Figure 5). Note that this is not a really a landmark dataset because each subject only has one $t_{ij}$ . Nonetheless, this dataset exactly satisfies the landmark SDH model so that we can use it to study the numerical performance of the proposed local linear estimation in a small neighborhood of $s$ . Specifically, we evaluate the bias of estimating $\boldsymbol{\beta}(s)$ and the baseline CIF (Web Figure 6), $\pi_{0}(t^{*};s)=1-\textrm{exp}\Big{(}-\int_{0}^{t^{*}}\lambda_{10}(t,s)dt\Big{)}$ , as well as the selection of the bandwidth.

The results are presented in Web Figure 7. The three columns from left to right are the plots of the estimated log-SDH ratio, bias percentage, and mean squared error (MSE) against different bandwidths. The rows from top to bottom correspond to the three increasing sample sizes. For the plot of the log-SDH ratio (column 1), the mean estimated $\boldsymbol{\beta}(s)$ at $s=3$ over the Monte Carlo repetitions is close to the true value (red horizontal line) at small bandwidths (e.g. 0.3 and 0.5). With increased bandwidth, the estimator shows increasing downward bias. This is because the true $\boldsymbol{\beta}(s)$ function is concave (Web Figure 5), and the local linear fit underestimates it at the peak as the bandwidth increases. The empirical standard errors, shown in Web Figure 7 as the vertical whiskers, shrink with the increased bandwidth since more data points are included in the kernel estimation. From top to bottom, the empirical standard errors decrease when the sample size increases. Column 2 shows that the bias percentage generally increases with the bandwidth, except when the bandwidth is very small, in which case larger finite-sample bias may result due to very few data points available in the neighborhood defined by the bandwidth. In column 3, the U-shaped MSE curve is a demonstration of the typical bias-variance trade-off in kernel estimation. Overall, the percentage of absolute bias for the log-SDH ratio is very small, within 2% for middle ranged bandwidths (the horizontal dashed line in column 2). The results from this simulation suggests that the proposed local linear estimation works as expected from typical local polynomial estimators.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Web Appendix A: Data Generation Procedure for Simulation

Web Appendix B: Simulation on Local Linear Estimation

Web Appendix C: Table and Figures