Probabilistic Clustering Using Multivariate Growth Mixture Model in Clinical Settings—A Scleroderma Example

Ji Soo Kim; Yizhen Xu; Rachel S. Wallwork; Laura K. Hummers; Ami A. Shah; Scott L. Zeger

PMC · DOI:10.1002/sim.70450·February 13, 2026

Probabilistic Clustering Using Multivariate Growth Mixture Model in Clinical Settings—A Scleroderma Example

Ji Soo Kim, Yizhen Xu, Rachel S. Wallwork, Laura K. Hummers, Ami A. Shah, Scott L. Zeger

PDF

Open Access

TL;DR

This paper introduces a method to classify scleroderma patients into subgroups based on lung function progression, helping guide clinical decisions in real time.

Contribution

A Bayesian multivariate growth mixture model is proposed for real-time classification of scleroderma patients into clinically meaningful subgroups.

Findings

01

A 'stable' group and a 'progressor' group of scleroderma patients were identified based on lung function trajectories.

02

The method calculates and updates the probability of a patient being a fast progressor as more data becomes available.

03

This approach can improve clinical decisions by integrating baseline and longitudinal patient data.

Abstract

Scleroderma (systemic sclerosis; SSc) is a chronic autoimmune disease known for wide heterogeneity in patients' disease progression in multiple organ systems. Our goal is to guide clinical care by real‐time classification of patients into clinically interpretable subpopulations based on their baseline characteristics and the temporal patterns of their disease progression. A Bayesian multivariate growth mixture model was fit to identify subgroups of patients from the Johns Hopkins Scleroderma Center Research Registry who share similar lung function trajectories. We jointly modeled forced vital capacity (FVC) and diffusing capacity for carbon monoxide (DLCO) as pulmonary outcomes for 289 patients with SSc and anti‐topoisomerase 1 antibodies and developed a framework to sequentially update class membership probabilities for any given patient based on her accumulating data. We identified…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

carbon

Diseases4

scleroderma systemic sclerosis SSc autoimmune disease

Figures6

Click any figure to enlarge with its caption.

Directed acyclic graph under a simplified assumption of time‐invariant cluster membership. The arrows represent conditional dependencies. For example, the distribution of Yij(ℓ) depends on μij(ℓ) and R(ℓ). Therefore, there are two arrows pointing from the latter variables to the former. Rectangles represent observed variables; ovals represent unobserved parameters or latent variables.

Estimated average trajectories of pFVC and pDLCO in a standardized scale for each identified cluster among anti‐topo positive patients in the reference subgroup: Non‐diffuse, non‐Black, female patients with early onset. The values are quantile‐normalized so that observations within each measure follow the standard normal distribution. The solid line represents the posterior mean, while the confidence band represents the 95% credible interval of the population‐level temporal patterns. The fast progressor group (left) consists of patients with faster progressing trajectories of pFVC and pDLCO, and the stable group (right) consists of patients with stable trajectories for both measures. The coefficient estimates used to generate these trajectories are shown in Figure 3.

Posterior mean and 95% credible interval of the estimated coefficients of the cluster‐specific time trends that describe the progression of pFVC and pDLCO trajectories. Spline 1 and Spline 2 are obtained from natural cubic splines, with a knot at 5 years and bounded at 0 and 10 years.

Posterior mean and 95% credible interval of the estimated coefficients of the shared factors and their interactions with time trends (Spline 1 and Spline 2, obtained from natural cubic splines, with a knot at 5 years and bounded at 0 and 10 years) that describe the progression of pFVC and pDLCO trajectories.

Flowchart demonstrating changes in the probability of being in the fast progressor cluster of 289 anti‐topo positive patients. NA (white space) indicates no probability estimated due to pFVC or pDLCO not observed. The estimated probabilities at baseline (0 years since first visit) are more evenly distributed across the 5 probability groups, indicating there is low certainty of group membership for most patients. As more pFVC and pDLCO data are observed, estimated probabilities are closer to 0 or 1 for most patients, demonstrating an increase in certainty.

Observed and estimated pFVC and pDLCO trajectories in a standardized scale by group membership and estimated probabilities of belonging to the fast progressor group. A randomly selected patient's normalized observed pFVC (top) and pDLCO (middle) are shown in black, and her sequentially updated estimated trends (solid lines) and their 95% credible intervals (shades) are shown in blue and pink, given that the person is in the fast progressor or stable group, respectively. The estimated probabilities of belonging to the fast progressor group given available observed data are also sequentially updated after each visit (bottom). The first point shown in the bottom plot (at 0 year since onset) is the probability of being a fast progressor given only baseline information (without any pFVC or pDLCO measurement), that is exp(α01+WiαW1)/{∑ℓ=12α0ℓ+WiαWℓ} where ℓ=1 corresponds to the fast progressor group. The estimated probability of being in the fast progressor group is close to 1 after observing pFVC and pDLCO around 1 year since onset.

Tables1

TABLE 1. Data summary stratified by the identified trajectory clusters: The fast progressor and stable groups.

	All (N = 289)	Fast progressor (N = 139)	Stable (N = 150)
	Mean (SD) or # (%)	Mean (SD) or # (%)	Mean (SD) or # (%)
Male	47 (16.3)	19 (13.7)	28 (18.7)
Self‐identified Black race	65 (22.5)	46 (33.1)	19 (12.7)
Diffuse	165 (57.1)	98 (70.5)	67 (44.7)
Age of onset (non‐RP)	46.3 (13.6)	51.1 (13.1)	41.8 (12.5)
Baseline pFVC	80.8 (18.4)	79 (17.5)	82.6 (19)
Baseline pDLCO	67.1 (22.6)	63.6 (22.2)	70.4 (22.5)

Funding10

—National Institute of Arthritis and Musculoskeletal and Skin Diseases10.13039/100000069
—Jerome L. Greene Foundation10.13039/100015629
—Rheumatology Research Foundation10.13039/100006260
—Johns Hopkins inHealth initiative
—Scleroderma Research Foundation10.13039/100001845
—Nancy and Joachim Bechtle Precision Medicine Fund for Scleroderma
—Manugian Family Scholar
—Donald B. and Dorothy L. Stabler Foundation10.13039/100029250
—Chresanthe Staurulakis Memorial Fund
—Sara and Alex Othon Research Fund

Keywords

Bayesian hierarchical modelsmultivariate growth mixture modelingsclerodermasequentially‐updating algorithmtrend‐based cluster membership

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSystemic Sclerosis and Related Diseases · Multiple Sclerosis Research Studies · Inflammatory Myopathies and Dermatomyositis

Full text

Introduction

1

Scleroderma (systemic sclerosis; SSc) is a multisystem autoimmune disease characterized by vasculopathy, dysregulation of the immune system, and exaggerated fibrosis [1]. It is well established that SSc patients demonstrate wide heterogeneity in clinical manifestations, speed and direction of disease progression, response to treatment, and survival [2, 3]. Severe organ complications, including interstitial lung disease (ILD), pulmonary arterial hypertension (PAH), cardiac involvement, and digital ulcers, occur in about 15% of patients [4]. Autoantibodies are known to associate with distinct phenotypes and risks of complications at a population level [5, 6], but significant variability in disease trajectory and the risk of major events is observed even within the same autoantibody strata at the patient level. By identifying unique temporal patterns in the progression of multiple biomarkers, we can gain a better understanding of the etiology of disease progression and make more informed treatment decisions.

Over the past decade, there has been a growing interest in developing methods to risk‐stratify patients into subgroups with similar progression patterns based on observed biomarkers, as these approaches may facilitate translational research studies focused on disease pathogenesis and inform targeted therapeutic approaches. Prior research on identifying clusters of SSc patients typically used a single outcome. When the outcome measure is continuous and roughly follows a Gaussian distribution, linear mixed effects models (LMM) [7] can flexibly describe patients' health trajectories over time. Latent class mixture models (LCMM), or growth mixture models (GMM) [8, 9], assume that the population comprises a small number of distinct classes across which the relationship of the outcome with fixed and random effects can vary. The index of the component indicates the membership of a patient in a specific cluster. LCMM assumes that there are a fixed number of latent clusters consisting of individuals with similar trajectories, where the unobserved individual‐level heterogeneity within each cluster is captured by cluster‐specific random effects. Group‐based trajectory modeling (GBTM) or Latent Class Growth Analysis (LCGA) [10] is a simplified model of LCMM that assumes all observations are independent of one another within a latent cluster and does not allow individual‐level random effects.

The LCGA model has gained popularity for identifying subgroups of biomarker trajectories in SSc patients due to its simplicity and low computational requirements. Using LCGA models, Barbacki et al. identified three trajectory groups using Scleroderma Clinical Trials Consortium Damage Index [11], a global irreversible damage score that combines multiple biomarkers, including respiratory measures. Man et al. [12] identified seven distinct trajectory patterns of percent predicted forced vital capacity (pFVC), and Kida et al. [13] identified five trajectory patterns of systolic pulmonary arterial pressure in a multicenter cohort study of SSc patients. After identifying the clusters, post‐hoc analyses were conducted to examine the relationship between patient characteristics and the odds of being in each trajectory cluster. The limitation of this strategy is that the cluster membership estimates can only be acquired for individuals in the dataset on which the model was trained. Additionally, the cluster discovery process does not account for known factors that influence the progression of trajectories. Schulam et al. [14] proposed hierarchical latent variable models to identify homogeneous SSc patient subgroups and separately modeled 4 continuous biomarkers (pFVC, total skin score, and percent predicted diffusing capacity of carbon monoxide (pDLCO), right ventricular systolic pressure). This approach enables flexible modeling of population, subgroup, and individual‐level longitudinal trajectories. However, when fitting the model separately for each measure, the number of subgroups identified across biomarkers varies. This variability makes it challenging to translate the results for clinical use, as patients' cluster memberships are then determined by combinations of cluster memberships specific to each measure.

Multivariate latent class mixture models (MLCMM) [15, 16] are better suited for identifying longitudinal clusters using multiple SSc biomarkers. Compared to LCMM, MLCMM enables the simultaneous modeling of multiple outcomes and addresses the between‐measure correlations using correlated random effects and random error terms. We introduce an MLCMM within a Bayesian framework. This method incorporates two key features: Firstly, it jointly analyzes the trajectories of pulmonary measures FVC and DLCO for subgroup discovery. Secondly, it allows for sequential updates of cluster membership estimates and trajectory predictions based on the dynamic accumulation of patient information. It is crucial to model both FVC and DLCO simultaneously since both biomarkers are strongly correlated. Failing to consider this association would lead to a decrease in the statistical efficiency of model estimation. Additionally, the dynamic forecast of patient condition using the MLCMM is crucial for the utility of such models in clinical settings.

A key premise of the analysis is that SSc patients can be separated into subgroups based upon both mechanistic and empirical evidence. We believe that disease progression can be predicted from known patient characteristics, such as autoantibody status, and from unobserved (latent) variables that are manifest in the patient's biomarker trajectories to date. If true, predictions for individuals can be improved by identifying subsets of people whose trajectories have similar temporal patterns not only in terms of baseline characteristics but also latent traits. The model we propose enables the efficient identification of underlying traits that simultaneously and flexibly characterize the temporal patterns of trajectory progression across multiple biomarker measures. The method not only enables the identification of patient subgroups based on their baseline status but also allows it to be updated with accumulating patient information.

The structure of this paper is as follows. We introduce the goal of the analysis and study cohort in Section 2.1 and define the multivariate pulmonary outcomes and their preprocessing in Section 2.2. The multivariate latent class mixture model framework and notations are in Section 2.3. Section 2.4 provides an application and details of the model specified to meet clinical aims using the Johns Hopkins Scleroderma Center Cohort data. The details of generating dynamic predictions utilizing the multivariate mixture model are in Section 2.5. We describe the data characteristics in Section 3.1, the selection process for determining the number of subgroups in Section 3.2, present the convergence diagnostics of model estimation in Section 3.3, and describe results of model fitting in Section 3.4. Section 4 includes a discussion on the advantages and potential clinical applications of the model, limitations of our method, and future research directions. In Section 5, we briefly highlight the clinical significance and summarize the main contributions of the approach.

Methods

2

Modeling Multivariate Longitudinal Outcome Measures

2.1

Our aim is to identify patient subgroups that represent distinct temporal patterns of pulmonary trajectories using Johns Hopkins Scleroderma Center Cohort patient data. For longitudinal outcome measures, we use pFVC and pDLCO among patients with systemic sclerosis. We jointly model the two measures using an MLCMM, where the correlation between them is addressed both on the patient level and in the stochastic errors. Our method not only enables the identification of patient groups based on their initial characteristics but also allows it to be updated with accumulating biomarker information. For this study, we limit our sample to 289 patients with anti‐topoisomerase 1 (topo) autoantibody, which is associated with a high risk of ILD. However, even within this subgroup, we observe heterogeneity in pulmonary function trajectory, which motivates our clinical aim.

Preprocessing Longitudinal Data

2.2

To ensure that the outcomes follow roughly Gaussian distributions prior to model fitting, we quantile‐normalized each measure. For each measure Yk, where k=1,2 denotes pFVC and pDLCO, we obtain the quantile normalized values Φ−1∘Ĝk(Yk) where Gk^ is the estimated marginal distribution of Yk and Φ−1 is the inverse standard normal distribution. The quantile‐normalized values are used in model fitting, as described in the following sections.

Multivariate Latent Class Mixture Model

2.3

Assume the ith subject has Ji measurement times and K outcomes Yij={Yij1,…,YijK} at each time j∈{1,…,Ji}. Let Uijk and Vijk be sets of covariates that are associated with disease progression; they may or may not be time‐varying. Specifically, we assume that Uijk affects population‐level trajectory and is shared across all latent subgroups, and Vijk is associated with systematic, unobserved heterogeneity in the latent disease progression. In other words, we hypothesize that the impact of Vijk on the overall progression of biomarkers in the population may differ systematically among unobserved subgroups. Without loss of generality, we assume no missingness in the data and that there are L mixture components. For subject i who belongs to the ℓth mixture component, we let μij(ℓ)=(μij1(ℓ),…,μijK(ℓ)) be the latent disease progression at time j. The MLCMM assumes that with the additional presence of measurement error R(ℓ), there are K‐dimensional multivariate longitudinal outcomes that reflect the latent disease progression for each of the ℓth mixture component, i.e., Yij(ℓ)={Yij1(ℓ),…,YijK(ℓ)}. Note that the positive definite matrix R(ℓ)∈ℝK×K characterizes the correlation between measurement errors in Yij(ℓ). For generality, we allow both the population average trends and correlation structures to differ between clusters. Define Ci as the cluster membership of person i. In other words, if we know the true cluster membership Ci (which cannot be directly observed), the model assumes that the observed outcome is

[eqn]

We assume that {μijk(ℓ);k=1,…,K}, the latent disease trajectory under Ci=ℓ, follows the linear mixed effects model:

[eqn]

[eqn]

For the kth biomarker, βVk(ℓ) quantifies the effect of Vijk on the variation in the population‐level dynamic pattern of disease procedure, which is specific to the ℓth mixture component. In addition, βUk represents the association between Uijk and the latent disease progression μijk(ℓ), shared across all mixture components. Note that the distinct mixture components are largely characterized by systematically different patterns of the disease procedure in relation to Vijk. The random effects Zijkbik(ℓ) represent the individual‐level cluster‐specific deviation from the population average trajectories of the ℓth mixture components, UijkβUk+VijkβVk(ℓ), where Zijk∈ℝp are the subset of covariates that are associated with this heterogeneity. We assume that the random effect coefficients are distributed as bi(ℓ)=(bi1(ℓ),…,biK(ℓ))∼MVN(0,G(ℓ)).

Denote Wi as subject traits that provide information about the systematic unobserved heterogeneity characterized by cluster membership Ci. The cluster membership of person i, Ci, is modeled via logistic regression as follows, with cluster L as the reference level,

[eqn]

where α0ℓ be the intercept and αWℓ is the coefficients for Wi. We define α0=(α01,…,α0,L−1), αW=(αW1,…,αW,L−1), and α=(α0,αW).

For the ℓth mixture component, R(ℓ) and G(ℓ) quantify the within and across‐biomarker covariance of the random errors and random effects, respectively. Conditional on {μijk(ℓ);k=1,…,K}, covariance matrices R(ℓ) and G(ℓ), and the cluster membership Ci, the observed outcomes Yij are assumed to be independent across time and among patients. The distribution of biomarkers can then be written as

[eqn]

where Yi=(Yi1,…,YiJi), μi={μij(ℓ);j=1,…,Ji,ℓ=1,…,L}, G={G(ℓ);ℓ=1,…,L}, and R={R(ℓ);ℓ=1,…,L}.

Model Specification for the Johns Hopkins Scleroderma Center Cohort

2.4

The objective of developing our model using Johns Hopkins Scleroderma Center Cohort data is to meet the three clinical aims: (1) To identify distinct patterns of disease progression over time after adjusting for patients' baseline characteristics; (2) to generate an estimate of cluster membership for a new patient based on her baseline characteristics; and (3) to continuously update the estimated cluster membership and predict future biomarker trends using dynamically evolving and accumulating patient information. To efficiently address these three clinical aims, we jointly modeled the biomarkers to account for between‐measure correlations.

Figure 1 visualizes the dependency relationship among all the aforementioned parameters and variables. To simplify presentation, the figure assumes two mixture components and time‐invariant cluster membership. Furthermore, we define Vijk to be a smooth function of time such that the identified clusters will be characterized by unique longitudinal patterns of disease progression. The specification of P(Ci|Wi) in the MLCMM naturally serves as a baseline estimate of the cluster membership distribution, fulfilling the second clinical aim. Lastly, as illustrated in Section 2.5, the third clinical aim is addressed by using a Bayesian framework and employing posterior updating techniques to obtain real‐time calculations of cluster membership and cluster‐specific trajectory predictions.

Directed acyclic graph under a simplified assumption of time‐invariant cluster membership. The arrows represent conditional dependencies. For example, the distribution of Yij(ℓ) depends on μij(ℓ) and R(ℓ). Therefore, there are two arrows pointing from the latter variables to the former. Rectangles represent observed variables; ovals represent unobserved parameters or latent variables.

After preprocessing pFVC and pDLCO, we model them jointly as a multivariate Gaussian distribution,

[eqn]

where we assumed the random errors in the observed outcomes have a shared covariance matrix R across different clusters. This assumption ensures that the clusters are defined by differences in the population‐level trajectory patterns and patient‐level deviations around them.

We specify the latent disease progression as follows:

[eqn]

Disease onset (t=0) is defined by the first non‐Raynaud's phenomenon (non‐RP) symptom. We denote sij as the spline representation of the years since disease onset, tij, for the jth measurement of person i. To flexibly capture disease trajectories that vary over time, we used natural splines with 2 degrees of freedom with a knot at 5 years and boundary knots at 0 and 10 years (See Figure S1 in Supporting Information). These knot locations were chosen based on clinical knowledge of pulmonary disease progression in scleroderma. Variables that are known to influence lung trajectories from prior population studies are included as covariates (Xi) in the model. Xi includes sex, self‐identified Black race, diffuse cutaneous subtype, and an indicator of later disease onset (≥50 years; median age of onset among 289 anti‐topo positive patients is 47). The same set of covariates is used for Wi, which explains subject cluster membership. We assume there is a shared trend across clusters that vary flexibly by Xi, defined by β2k and β3k. Cluster‐specific deviation from this shared mean trend is characterized by β0k(ℓ), β1k(ℓ), and individual‐level deviation from the cluster‐specific trajectory is captured by a random intercept bik(ℓ). Details of prior specifications are in Appendix A. For model estimation, we use 2000 burn‐in samples before Npost=3000 posterior draws and run the estimating procedure using the package rstan [17] in the R software. We determined the number of latent subgroups by using Bayesian information criterion (BIC) [18] and checked model convergence by using the potential scale reduction factor (PSRF) [19].

Dynamic Predictions

2.5

To address the last clinical aim, we developed algorithms to (1) continuously update the probability of belonging to the fast progressor group (pit) for patient i and at time t given her past measures of both pFVC and pDLCO; and (2) generate a dynamic projection of an individual's future trajectories based on their history. In our model, cluster membership Ci is conceptualized as a time‐invariant, unobserved latent trait specific to each individual. While Ci itself does not vary over time, our Bayesian framework enables dynamic updating of its posterior distribution as more longitudinal data become available. Specifically, we compute the posterior probability of cluster membership at time t using the observed data up to that point via:

[eqn]

which provides a time‐varying estimate of the fixed latent cluster assignment. This reflects an evolving belief about Ci, rather than a change in the cluster membership itself. We propose a computational scheme for the dynamic calculation of pit in Appendix B. To separately project the person's disease progression in both the fast progressor and stable groups, we sample from the posterior predictive distribution of Yi,t+1(1) for the fast progressor group and Yi,t+1(2) for the stable group, following the computational scheme described in Appendix C. For example, the posterior predictive distribution of Yi,t+1(ℓ) would be represented by Npost posterior samples {Yi,t+11,…,Yi,t+1Npost}, with ℓ being 1 for the fast progressor group and 2 for the stable group.

Results

3

Data Characteristics

3.1

A total of 289 anti‐topo positive SSc patients were selected for this analysis by the following inclusion criteria: pFVC and pDLCO values collected within 10 years since the onset of the first non‐RP symptom, patients with at least 3 pFVC and pDLCO values in this 10‐year window, and patients whose first pFVC or pDLCO is observed within the first 5 years since non‐RP onset. We use complete data with no missingness in both the covariates and longitudinal outcomes. The dataset consists of 2329 observations, and the number of observed outcomes per patient ranges from 3 to 28, with a median of 7 observations. The upper part of Table 1 displays a summary of the variables.

Estimation of Number of Trajectory Subgroups

3.2

The model with two trajectory subgroups had the lowest BIC of 4124.8, compared to other models with one and three clusters for which BICs were 4478.4 and 4136.6, respectively. Therefore, we consider the two‐subgroup model as the optimal choice under the assumed framework. We identified a “stable” group of 150 patients for whom both biomarkers changed little from the date of disease onset over the next 10 years (Figure 2 right panel), and a “progressor” group of 139 patients, which on average experienced a steep decline in both measures starting soon after disease onset (Figure 2 left panel). Our method allows us to visualize the expected subgroup‐specific lung trajectories for any combination of the baseline factors under consideration.

Estimated average trajectories of pFVC and pDLCO in a standardized scale for each identified cluster among anti‐topo positive patients in the reference subgroup: Non‐diffuse, non‐Black, female patients with early onset. The values are quantile‐normalized so that observations within each measure follow the standard normal distribution. The solid line represents the posterior mean, while the confidence band represents the 95% credible interval of the population‐level temporal patterns. The fast progressor group (left) consists of patients with faster progressing trajectories of pFVC and pDLCO, and the stable group (right) consists of patients with stable trajectories for both measures. The coefficient estimates used to generate these trajectories are shown in Figure 3.

Convergence Diagnostics

3.3

We use the PSRF to demonstrate that the posterior estimation of model parameters under our assumed two‐subgroup model achieved sufficient Markov chain Monte Carlo (MCMC) convergence. The PSRF is a consistent estimator of Monte Carlo variance and is a Gelman–Rubin convergence diagnostic for MCMC. The estimate of the multivariate PSRF is 1.00016 for cluster‐specific and shared fixed effects coefficients (βU,βV(1),βV(2)), 1.00113 for the cluster‐specific covariance matrices of random effects (G(1),G(2)), 0.99995 for the shared covariance matrix of the random errors, R, and 1.00148 for the coefficients in the baseline probability of fast progressors (α0,αW). Given that the multivariate PSRF is close to 1 when the samples have converged to the target distribution, we conclude that the posterior samples from the estimated model demonstrate sufficient convergence.

Model Estimation Results

3.4

In the upper part of Table 1, we summarized the variables used in the model stratified by the two identified groups. For example, 47 patients in the dataset are male, constituting 16.3% of the data population, and the estimation procedure classified 19 of them into the fast progressor group and the remaining 28 males into the stable group. The age of disease onset defined by the first non‐RP symptom has an overall mean of 46.3, while the mean age of onset is 51.1 and 41.8 in the fast progressor and stable groups, respectively. In the lower part of Table 1, we further studied the relationship between the identified subgroups and other variables that may have scientific relevance but were not included in the model. We observe that within the topo autoantibody subset, most autoantibodies do not demonstrate a large discrepancy between the identified clusters, except for anti‐U1RNP, where there are 23 and 17 patients in the fast progressor and stable groups, respectively.

Figure 3 summarizes (β0k(1),β1k(1)) and (β0k(2),β1k(2)), and Figure 4 summarizes (β2k,β3k), where k=1 and 2 corresponds to pFVC and pDLCO, respectively. The estimated values and the corresponding 95% credible intervals are presented in Table S1 of the Supporting Information. Specifically, Figure 2 visualizes the population‐level temporal trends in the two identified subgroups, which is equivalent to setting (β2k,β3k) to zero. This describes the estimated trends for female, non‐Black, and non‐diffuse patients with disease onset before they are 50 years old. The fast progressor group has a rapid deterioration of pulmonary measures compared to the stable group, which has a significantly slower progression. We present the estimated posterior predictive distribution for the cluster‐specific random‐effects covariances (G(1),G(2)), the shared covariance matrix of the random errors, R, and the coefficients for the variables in the baseline probability of being in the fast progressor group, (α0,αW) in Table S2 of Supporting Information. On average, patients whose baseline characteristics are self‐identified Black race, diffuse cutaneous subtype, and later disease onset tend to have a significantly higher chance of being fast progressors, i.e., a faster deterioration in pulmonary measures.

Posterior mean and 95% credible interval of the estimated coefficients of the cluster‐specific time trends that describe the progression of pFVC and pDLCO trajectories. Spline 1 and Spline 2 are obtained from natural cubic splines, with a knot at 5 years and bounded at 0 and 10 years.

Posterior mean and 95% credible interval of the estimated coefficients of the shared factors and their interactions with time trends (Spline 1 and Spline 2, obtained from natural cubic splines, with a knot at 5 years and bounded at 0 and 10 years) that describe the progression of pFVC and pDLCO trajectories.

At time 0, when no pFVC and pDLCO values are observed, the probability of being in the fast progressor group, pi0, is calculated based on only baseline clinical and demographic characteristics. We obtain more precise estimations of individualized health trajectories and pit as more longitudinal observations become available. Figure 5 displays the dynamic change in the posterior probability of belonging to the fast progressor group over the years since disease onset across all patients in the observed dataset. The diagram presents the posterior probability in 5 intervals, from between 0 and 0.2 to between 0.8 and 1, representing an increasing likelihood of rapid pulmonary progression. As the follow‐up time increases in the flowchart, the proportion of individuals with a risk between 0.2 and 0.8 decreases. This is because our confidence in distinguishing between progressors and stable patients increases as we accumulate more observations per patient. By the eighth year, the algorithm had reached a high level of certainty regarding the classification of the patients.

Flowchart demonstrating changes in the probability of being in the fast progressor cluster of 289 anti‐topo positive patients. NA (white space) indicates no probability estimated due to pFVC or pDLCO not observed. The estimated probabilities at baseline (0 years since first visit) are more evenly distributed across the 5 probability groups, indicating there is low certainty of group membership for most patients. As more pFVC and pDLCO data are observed, estimated probabilities are closer to 0 or 1 for most patients, demonstrating an increase in certainty.

In Figure 6, we present a real‐time projection of pFVC and pDLCO trajectories and their 95% credible intervals for a randomly selected individual. This figure is an illustration of our proposed dynamic prediction schemes described in Section 2.5. The selected patient is female, non‐Black, has diffuse cutaneous disease, and was 42 years old when she developed SSc. At baseline (t=0), when no observations of pFVC or pDLCO were available, the estimated probability of belonging to the fast progressor group is 0.34, an estimate entirely based on her baseline covariates. After the first three observations of pFVC and pDLCO, the probability increases to 0.84 and stays close to 1 for the rest of the visits. We also see that her projected pFVC and pDLCO trajectories as a fast progressor represent the pattern in her observed data better than those of the stable group.

Observed and estimated pFVC and pDLCO trajectories in a standardized scale by group membership and estimated probabilities of belonging to the fast progressor group. A randomly selected patient's normalized observed pFVC (top) and pDLCO (middle) are shown in black, and her sequentially updated estimated trends (solid lines) and their 95% credible intervals (shades) are shown in blue and pink, given that the person is in the fast progressor or stable group, respectively. The estimated probabilities of belonging to the fast progressor group given available observed data are also sequentially updated after each visit (bottom). The first point shown in the bottom plot (at 0 year since onset) is the probability of being a fast progressor given only baseline information (without any pFVC or pDLCO measurement), that is exp(α01+WiαW1)/{∑ℓ=12α0ℓ+WiαWℓ} where ℓ=1 corresponds to the fast progressor group. The estimated probability of being in the fast progressor group is close to 1 after observing pFVC and pDLCO around 1 year since onset.

Discussion

4

There is substantial clinically relevant variability in disease progression among SSc patients as manifested in both biomarkers and risk of major clinical events. Autoantibodies serve as useful clinical tools for risk stratification, yet even within an autoantibody subgroup, significant heterogeneity remains. There is still a huge unmet need to discover homogeneous subgroups of patients that share similar disease features, underlying mechanisms, complication risks, and therapeutic responses. In our work, we developed a modeling framework to identify latent subgroup(s) of patients who share similar progression patterns of multiple biomarkers.

Our model is designed to quantify risk factors of progression in pulmonary measures and to provide real‐time predictions for individual patients of cluster membership and future trajectories at the point of care. The model assumes that baseline characteristics, including age of onset, sex, race, and cutaneous subtype, affect both the cluster membership and the disease trajectories. Beyond these known factors, the model also assumes that there are unmeasured factors that make some patients stable and others progress more rapidly. The goal of the model is to identify subgroup(s) that correspond to different temporal patterns in the biomarkers and to use the data observed for a given patient for classifying her into the more likely progression pattern.

Among anti‐topo patients, we used a multivariate latent class mixture model to segregate progressors from those who are likely to remain stable. The subgroups are identified by the joint trajectories of the two measures, pFVC and pDLCO, while allowing for cluster‐specific correlation structures between them. This is crucial in our setting, as the two biomarkers are known to be biologically correlated, and accounting for their joint evolution while allowing the correlation structure to vary across latent subgroups enables more accurate characterization of unmeasured patient heterogeneity. To our knowledge, fitting models with this level of flexibility is not feasible in existing likelihood‐based implementations of multivariate growth mixture models. Rather than providing an estimated cluster membership for each measure per patient, we assume that there is one underlying individual‐specific latent class reflected jointly by multiple biomarker trajectories. This assumption leads to more clinically useful and relevant results, particularly because pFVC and pDLCO have a strong correlation. Additionally, employing such a joint modeling strategy results in more statistically efficient inferences. Reference Kim et al. [20] quantified the inefficiency of modeling biomarkers separately as compared to jointly modeling them in a combined model. In the motivating clinical case study of scleroderma trajectories with five biomarkers, including pFVC and pDLCO, they demonstrated that greater efficiency gains in estimating random effects are expected in the case of high across‐measure correlation.

Further, an important feature of our approach is the sequential updating of class membership probabilities and individual‐level predictions as new data accumulate over time. This dynamic, personalized updating is essential for precision medicine applications, where treatment or monitoring decisions depend not only on a patient's baseline characteristics, but also on evolving risk or trajectory profiles. Implementing this coherently within a Bayesian framework allows us to incorporate uncertainty and borrow strength across patients and time points. We observed that the certainty of belonging to the fast progressor or stable group increases as the observations of pFVC and pDLCO accumulate over time. This approach enables us to utilize past information to enhance prediction beyond the estimated model parameters, even for patients whose cluster membership was largely uncertain at baseline. This type of updating is not readily achievable under standard likelihood‐based estimation approaches, which typically assume a fixed dataset and do not naturally support real‐time re‐estimation without retraining the full model.

The future direction of this research is to discover novel biomarkers that can effectively distinguish individuals with rapid disease progression from those with slow or moderate progression. Biomarkers that can differentiate these patient subgroups will provide valuable insight into the etiology of the disease and serve as useful tools for risk assessment. Identifying these biomarkers can yield crucial information that helps medical decision‐making, including determining which patients may potentially receive greater benefits from intensive monitoring and therapeutic strategies, such as combined immunosuppressive treatments. For future research, we plan to investigate the association of rapid progression in pulmonary function with molecular and immunological biomarkers for SSc patients in the early stages of the disease.

Our analysis has several limitations. Because we were interested in studying heterogeneity within a high‐risk subgroup, we focused the study population on individuals who tested positive for anti‐topo antibody. The analysis only included observations of pFVC and pDLCO when both biomarkers were collected simultaneously. Our future work involves expanding the analysis to include all patients in the Johns Hopkins Scleroderma Center Research Registry. Additionally, we will refine the model likelihood to handle missing data more efficiently. Further research can be done to broaden the model's scope by integrating additional types of biomarkers, such as genetic, molecular, and immunological biomarkers, and by developing appropriate variable selection methodology. Lastly, our model does not account for the influence of treatment on the progression of biomarkers over time, which could potentially affect the identification of subgroups. Our model assumes that cluster membership remains constant over time, representing latent patient traits. The probability of a person belonging to each cluster is updated depending on observed longitudinal measurements. In future work, we will address this limitation by defining time‐invariant cluster membership as a function of immunosuppressive medications to account for the distinct temporal patterns in how patients respond to treatment. Additionally, we can continue to adopt the subgroup definition in this proposal and incorporate the treatment effect by assuming that cluster membership is time‐varying.

Conclusion

5

We developed a practical tool that satisfies the clinical need of identifying latent subgroups of SSc patients defined based on the temporal patterns of their pulmonary measures. We proposed a Bayesian framework utilizing MLCMM that can provide a probabilistic estimate of the initial cluster membership for new patients. Our method also allows time‐evolving predictions of future biomarker trajectories and cluster membership based on patients' accumulating longitudinal observations. We implemented our method on data from anti‐topo positive patients in the Johns Hopkins Scleroderma Center Cohort and discovered two subgroups that correspond to the commonly observed patient types known as “fast progressors” and “stable” as referred to by clinicians. Our tool can be flexibly adapted to find distinct progression patterns based on other clinically important longitudinal biomarkers, from which we can discover novel biomarkers that associate with the identified disease progression patterns.

Funding

This work was supported in part by the NIH/NIAMS under Grant Nos. P30AR070254, R01AR073208, and K24AR080217, and in part by the Jerome L. Greene Foundation, Rheumatology Research Foundation, Johns Hopkins inHealth initiative, the Scleroderma Research Foundation, the Nancy and Joachim Bechtle Precision Medicine Fund for Scleroderma, the Manugian Family Scholar, the Donald B. and Dorothy L. Stabler Foundation, the Chresanthe Staurulakis Memorial Fund, and the Sara and Alex Othon Research Fund.

Ethics Statement

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1: sim70450‐sup‐0001‐Supinfo.pdf.

Bibliography20

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1D. Pattanaik , M. Brown , and A. E. Postlethwaite , “Vascular Involvement in Systemic Sclerosis (Scleroderma),” Journal of Inflammation Research 4 (2011): 105–125.22096374 10.2147/JIR.S 18145 PMC 3218751 · doi ↗ · pubmed ↗
2Y. Allanore , R. Simms , O. Distler , et al., “Systemic Sclerosis,” Nature Reviews Disease Primers 1, no. 1 (2015): 1–21.10.1038/nrdp.2015.227189141 · doi ↗ · pubmed ↗
3V. D. Steen , “The Many Faces of Scleroderma,” Rheumatic Disease Clinics of North America 34, no. 1 (2008): 1–15.18329529 10.1016/j.rdc.2007.12.001 · doi ↗ · pubmed ↗
4C. Muangchan , M. Baron , J. Pope , C. S. R. Group , et al., “The 15% Rule in Scleroderma: The Frequency of Severe Organ Complications in Systemic Sclerosis. A Systematic Review,” Journal of Rheumatology 40, no. 9 (2013): 1545–1556.23858045 10.3899/jrheum.121380 · doi ↗ · pubmed ↗
5C. S. Santos , C. M. Morales , C. Á. Castro , and E. D. Álvarez , “Clinical Phenotype in Scleroderma Patients Based on Autoantibodies,” Rheumatology Advances in Practice 7, no. Supplement_1 (2023): i 26–i 33.36968636 10.1093/rap/rkad 010PMC 10036993 · doi ↗ · pubmed ↗
6V. D. Steen , “Autoantibodies in Systemic Sclerosis,” in Seminars in Arthritis and Rheumatism, vol. 35 (Elsevier, 2005), 35–42.16084222 10.1016/j.semarthrit.2005.03.005 · doi ↗ · pubmed ↗
7N. M. Laird and J. H. Ware , “Random‐Effects Models for Longitudinal Data,” Biometrics 38, no. 4 (1982): 963–974.7168798 · pubmed ↗
8B. Muthén and L. K. Muthén , “Integrating Person‐Centered and Variable‐Centered Analyses: Growth Mixture Modeling With Latent Trajectory Classes,” Alcoholism, Clinical and Experimental Research 24, no. 6 (2000): 882–891.10888079 · pubmed ↗