Dynamic Factor Analysis for Sparse and Irregular Longitudinal Data: An Application to Metabolite Measurements in a COVID‐19 Study

Jiachen Cai; Robert J. B. Goudie; Brian D. M. Tom

PMC · DOI:10.1002/sim.70499·March 16, 2026

Dynamic Factor Analysis for Sparse and Irregular Longitudinal Data: An Application to Metabolite Measurements in a COVID‐19 Study

Jiachen Cai, Robert J. B. Goudie, Brian D. M. Tom

PDF

Open Access

TL;DR

This paper introduces a new dynamic factor analysis method to study metabolite data from a COVID-19 study, identifying key biological pathways like the kynurenine pathway and biomarkers such as taurine.

Contribution

The novel contribution is a dynamic factor analysis model with a multi-output Gaussian process prior and a scalable StEM algorithm for sparse and irregular longitudinal data.

Findings

01

The proposed model identifies a kynurenine pathway linked to clinical severity in COVID-19 patients.

02

The biomarker taurine is uncovered as significant in the study.

03

The StEM algorithm is 20 times faster and more accurate than previous methods in simulations.

Abstract

Factor analysis (FA) can be used to identify key biomarkers in biological processes by assuming that latent biological pathways (statistically, “latent factors”) drive the activity of measurable biomarkers (“observed variables”). However, biological pathways often interact, meaning that the classical FA assumption of independence between factors is questionable. Motivated by sparsely and irregularly collected longitudinal measurements of metabolites in a COVID‐19 study, we propose a dynamic factor analysis model that accounts for cross‐correlations between pathways via a multi‐output Gaussian processes (MOGP) prior on the factor trajectories. To mitigate against overfitting caused by sparsity of longitudinal measurements, we introduce a roughness penalty upon MOGP hyperparameters and allow for non‐zero mean functions. We also propose a scalable stochastic expectation maximization (StEM)…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Severe acute respiratory syndrome coronavirus 2(no rank)Homo sapiens(human · species)

Chemicals7

taurine

quinolinic acid kynurenine 3-hydroxykynurenine oxygen MCEM Tryptophan

Diseases7

COVID-19 cancer viral infection dementia death infection FA

Figures10

Click any figure to enlarge with its caption.

Illustration of the kernel convolution framework for MOGP. The star (∗) denotes a convolutional operation and directed arrows indicate direct dependence. t denotes time; a,b are indexes of factor trajectories; τa(t),τ0(t),τb(t) are independent Gaussian white noise processes; ha1(t),ha0(t),hb0(t),hb1(t) are Gaussian kernel functions; ϵa(t),ϵb(t) are residuals; ya(t),yb(t) are the ath and bth factor trajectories, respectively.

Comparison between StEM and MCEM in terms of computational time (Panels A and B) and estimation accuracy (Panel C) under different sample sizes in the simulation study. For each sample size, 100 datasets were simulated. Whiskers of boxplots extend to the smallest and largest values.

Subject‐specific observation times. Each line corresponds to a single subject, and connects subject‐specific observed times.

Estimated subject‐specific posterior factor trajectories (truncated at y=3 for clarity; see Supporting Information: Figure S2 for the full version), using the exact method with q=50. The number of latent factors is pre‐specified as k=2. For people who did not survive for the full 7‐week follow‐up, we plotted their trajectories only before the death date; otherwise trajectories were plotted within a 7‐week window after the onset of symptoms.

Posterior predictive trajectory distributions of the 9th biomarker for 20 subjects (randomly selected), under penalized likelihood. Solid and dotted lines represent point estimate and 95% credible intervals, respectively. Red points denote observed measurements.

Estimated subject‐specific posterior factor trajectories using the comparator method [10] with q=8 (truncated at y=3 for clarity). The number of latent factors is pre‐specified as k=2. For people who did not survive for the full 49‐day follow‐up, we plotted their trajectories only before the death date; otherwise trajectories were plotted within a 7‐week window after the onset of symptoms.

Posterior predictive trajectory distribution of the 9th biomarker for 20 subjects (randomly selected), by applying the model in Cai et al. [10] to the motivating example. Solid and dotted lines represent point estimate and 95% credible intervals, respectively. Red points denote observed measurements.

Estimated subject‐specific posterior factor trajectories obtained using the MEFISO model, with q=8 time points at the reference grid tref={0,7,14,…,49}. Each curve represents an individual's factor trajectory, obtained by connecting their estimated factor values across the reference grid.

Top 10 metabolites on Factor 2 based on absolute factor loadings. The sign of each loading is indicated, with “–” representing a negative loading and “+” representing a positive loading.

Pathway analysis results by the online bioinformatics platform MetaboAnalyst 5.0. The top left panel shows an overview of pathway analysis, with y‐axis denoting p value of pathway enrichment analysis and x‐axis representing impact of the selected metabolites on the pathway. The top right panel shows details of the enriched pathway: colored (i.e., not grey) elements represent metabolites in this pathway that are uploaded as “background”, with red and blue elements denoting “selected” and “unselected” metabolites by our method, respectively. The bottom panel supplies numerical details for the top figures.

Funding1

—United Kingdom Medical Research Council

Keywords

COVID‐19dynamic factor analysislongitudinal high‐dimensional datamulti‐output Gaussian processsparse datastochastic expectation maximization

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetabolomics and Mass Spectrometry Studies · Statistical Methods and Inference

Full text

Introduction

1

A single complex disease or multiple‐related diseases often represent a dense spectrum of innumerable, overlapping phenotypes resulting from the contribution of a (modest) number of important pathological mechanisms. In the field of precision medicine, the increasing availability of high‐dimensional and longitudinal molecular profiling provides opportunities for improving understanding of these mechanisms underlying diseases, which in turn will facilitate more accurate diagnosis and prognosis, and more targeted treatment.

Specifically, this work is motivated by longitudinal measurements of p=35 metabolites (the molecular biomarkers) in a COVID‐19 study [1]. In this study, blood samples of n=101 SARS‐CoV‐2 polymerase chain reaction (PCR) positive patients with clinical symptoms were collected at least twice during the 7‐week window after symptom onset. These measurements are sparse for each patient, with the subject‐specific number of time points ranging between 2 and 5, and approximately 50% of patients having only 2 observations. Moreover, the observation times are irregularly distributed, with 50 unique time points in total. We aim to identify the key metabolites involved in the biological response to virus infection.

To address this question, Ruffieux et al. [1] adopted univariate linear mixed models that take each metabolite as the dependent variable separately, and identified a metabolic signature of COVID‐19 patients, characterized by increased expression of several intermediates (3‐hydroxykynurenine, kynurenine, quinolinic acid) from the kynurenine pathway and decreased expression of the upstream amino acid tryptophan. However, biomarkers often function collaboratively during biological processes underlying complex diseases such as COVID‐19. Therefore, instead of characterizing the disease at the individual biomarker level, it may be more informative to characterize the disease at the level of pathways [2], a functional group of biomarkers that work together. Given the high dimensionality of these metabolic data, it is essential to adopt a method that can uncover the underlying biological structure in a compact and interpretable form. Factor analysis is particularly well suited for this purpose, as it identifies latent factors that capture shared variation across biomarkers, which can correspond to biological pathways. Factor analysis approach is also consistent with biological expectations that there is a modest number of underlying pathological drivers of disease, and that each pathway involves only a small subset of metabolites. The examples of the use of FA to discover latent biological structure abound in biomedical research, including in chronic diseases such as cancer, dementia, and acute viral infections [3, 4, 5].

In the context of longitudinal data, dynamic factor analysis approaches have been developed to capture time‐varying pathway‐level responses. For example, Chen et al. [3] used spline functions to model latent factor trajectories. Closely related, Nyamundanda et al. [6] developed a dynamic probabilistic principal components model. However, both approaches were designed for densely sampled data and their performance under sparse measurement settings remains unclear. To address sparse and irregular longitudinal data, Yao et al. [7] extended classical functional principal component analysis (FPCA) and found that their method performed better than traditional FPCA even in dense and regular designs. More recently, Velten et al. [8] modeled each factor trajectory using a Gaussian process, illustrating its utility in sparse longitudinal microbiome data. Nevertheless, both approaches assume independence among latent constructs: Yao et al. [7] assume orthogonality of eigenfunctions within the FPCA framework, and Velten et al. [8] model each factor trajectory separately without considering potential cross‐correlations. This assumption of independence would translate biologically to assuming that different biological pathways work independently of each other without any interactions. This may not accurately reflect existing biological knowledge, as pathways often work collectively to achieve a complex biological function [9], suggesting the need to allow for potential cross‐correlations between factor trajectories. Recently, Cai et al. [10] developed a method to relax this assumption. They demonstrated that, by the incorporation of cross‐correlations between factors, performance gains were realized in recovering the shape of the underlying pathways' trajectories, uncovering and better interpreting relationships between biomarkers and pathways, and predicting the future trajectories of biomarkers. However, their method is not entirely applicable to our motivating example here. First, the longitudinal biomarkers were measured frequently in the application considered in Cai et al. [10], with the majority of the individuals having measurements at 16 regular time points. In contrast, in many biomedical applications, including the motivating example here, measurements/visits are sparse and irregular across subjects. It is unclear how Cai et al.'s method performs in this setting. Second, Cai et al.'s application has a small sample size, with n=17; whereas in many applications a larger sample size is anticipated and so an approach that scales to larger sample sizes is necessary.

To bridge the gap, we develop a dynamic factor analysis model that not only permits correlations between factors but also can handle irregular and sparse longitudinal biomarkers in larger sample size studies. Specifically, the proposed model maps the observed biomarker expression into the latent pathway‐level expression via a Bayesian sparse factor analysis (BSFA) and models the potentially correlated pathway trajectories using multi‐output Gaussian processes (MOGP). There are two major methodological developments in this work compared with Cai et al. [10] First, to mitigate against the issue of over‐fitting due to sparsely collected data, we allow for non‐zero mean functions on the MOGP and penalize the hyperparameters that control smoothness. Second, for estimating MOGP hyperparameters, we develop a stochastic expectation maximization (StEM) [11] algorithm that we will show scales better than the Monte Carlo expectation maximization (MCEM) algorithm used in the previous work. StEM also naturally handles the irregularity of longitudinal measurements across subjects, as it is based on sampling methods. We have developed an R package DFA4SIL (dynamic factor analysis for sparse and irregular longitudinal data) that implements the proposed method.

The remaining paper is structured as follows. Section 2 introduces our proposed model that targets sparse and irregular longitudinal measurements of biomarkers. Section 3 discusses inference, which comprises of a StEM algorithm for estimating MOGP hyperparameters and a Gibbs sampler for other variables in the model. In Section 4, we describe our simulation study that assesses the performance of StEM under different sample sizes, with comparison to the MCEM. Section 5 applies our proposal to the motivating COVID‐19 data, and discusses biological implications of results.

Model

2

Uncovering the Sparse Latent Structure via BSFA

2.1

FA connects observed data to latent underlying factors via factor loadings. Mathematically, let xijg denote the gth biomarker expression of the ith subject at the jth time point, yija denote the (latent) ath pathway expression of the ith subject at the jth time point, and lga represent the loading of the gth biomarker on the ath pathway. Then the FA model corresponds to

[eqn]

where μig is the intercept term for the gth biomarker of the ith individual (hereafter the “subject‐biomarker mean”), eijg is the residual error assumed to follow a normal distribution N(0,ϕg2); i=1,…,n; j=1,…,qi; g=1,…,p; and a=1,…,k, where n, qi, p, k denote the number of subjects, subject‐specific number of observed time points, number of biomarkers and latent factors, respectively. Note that our method allows for subject‐specific observed times ti=∪j=1qitij for the ith subject, which is a key feature of the motivating example that our method aims to accommodate. Throughout this work, the number of factors k is assumed to be pre‐specified based on previous knowledge. We recommend, in practice, comparing results across several ks to see if different insights would emerge and then choosing the most appropriate k.

The factor loading lga characterizes the relationship between the gth biomarker and the ath pathway. A larger absolute value suggests the biomarker is more important in relation to the pathway, whereas lga=0 indicates no association at all. We assume only a small proportion of biomarkers are involved in each pathway, an assumption that is often supported by previous biological knowledge. This assumption implies that for any pathway, say the ath, most of the lga,g=1,…,p are zeroes (i.e., mathematical sparsity). Under the Bayesian framework, we desire a prior that promotes shrinkage of lga to exactly 0, so that we do not need to artificially set a threshold for determining whether to include the gth biomarker as a member of the ath pathway. Out of this consideration, we choose the point‐mass mixture prior [4] for lga, which can be expressed as follows. First, we decompose lga as a product of a binary variable Zga and a continuous variable Aga, then assign a Bernoulli‐Beta prior for Zga and a Normal‐Inverse‐Gamma prior for Aga:

[eqn]

where g=1,…,p, a=1,…,k; and c0,d0,c1,d1 are pre‐specified positive constants. Zga indicates the inclusion status of the gth biomarker on the ath pathway: Zga=1 means the gth biomarker relates to the ath pathway, Zga=0 means it does not. Prior belief about sparsity can be represented via the hyperparameters c0 and d0, which control the proportion of biomarkers πa associated with the ath pathway. Finally, we complete the model specification by assigning a Normal‐Inverse‐Gamma before subject‐biomarker means μig and variances ϕg in Equation (1):

[eqn]

where μg is fixed to the mean of the gth biomarker expression across all time points of all subjects, and c2,d2,c3,d3 are pre‐specified positive constants.

Modelling Correlated Factor Trajectories Using MOGP

2.2

We model the latent pathway trajectories using Gaussian processes (GP), which have been widely used for analyzing functional data [12]. More specifically, to reflect our prior belief that multiple pathway (factor) trajectories may be correlated, we adopt multi‐output Gaussian processes (MOGP). MOGP provide improved prediction relative to assuming independent GP across outputs [13, 14, 15, 16, 17] when the true data generating process induces correlation between outputs. A major challenge when constructing MOGP is to appropriately define cross‐covariance functions that imply a positive definite covariance matrix. We adopt the commonly used framework of convolution processes (CP) [18]. We introduce the CP framework in our context in Section 2.2.1 and propose in Section 2.2.2 methods to avoid overfitting with sparse measurements under CP.

Convolution Process for Constructing the MOGP Prior on Factor Trajectories

2.2.1

In the CP framework, correlated processes are generated by introducing common “base processes”. To introduce the idea, consider the special case of two stochastic processes ya(t) and yb(t) as an example. In this case, the CP construction, as illustrated in Figure 1, is,

[eqn]

where ϵa(t),ϵb(t) are residual errors drawn from a N(0,ψ2) distribution, and ηa(t),ηb(t),ξa(t),ξb(t) are processes constructed as follows.

Illustration of the kernel convolution framework for MOGP. The star (∗) denotes a convolutional operation and directed arrows indicate direct dependence. t denotes time; a,b are indexes of factor trajectories; τa(t),τ0(t),τb(t) are independent Gaussian white noise processes; ha1(t),ha0(t),hb0(t),hb1(t) are Gaussian kernel functions; ϵa(t),ϵb(t) are residuals; ya(t),yb(t) are the ath and bth factor trajectories, respectively.

First, three independent, zero‐mean base processes τ0(t), τa(t) and τb(t) are introduced, which are all Gaussian white noise processes. The first process τ0(t) is shared by both ya(t) and yb(t), thereby inducing dependence between them; whereas τa(t) and τb(t) are specific to ya(t) and yb(t), respectively, and are responsible for capturing the unique aspects of each process. Second, Gaussian kernel functions ha0(t),ha1(t),hb0(t),hb1(t) are applied to convolve the base processes: with h−0(t) applied to the shared process τ0(t) and h−1(t) to the output‐specific processes τa(t) and τb(t):

[eqn]

where the convolution operator ∗ is defined as h(t)∗τ(t)=∫−∞∞h(t−s)τ(s)ds. All kernel functions h(t) considered here take the form h(t)=vexp−12Bt2, where v and B are parameters that are specific to each kernel function; B>0. Note the model in Equation (3) can be easily extended to more than two processes. A simple way is as follows [12],

[eqn]

where k is the number of processes; ηa(t), ξa(t), and ϵa(t) are defined in a similar way as above.

In our context, we will associate each pathway trajectory with a component of the MOGP. That is, we assume a MOGP prior for the vector of k pathway trajectories of the ith subject,

[eqn]

where yia(t) denotes the ath factor trajectory of the ith subject. The mean function of MOGP is temporarily assumed to be 0 following common practice, though we will relax this assumption later in Section 2.2.2 as one of the measures to handle the sparse measurements. The kernel function g(Θ) is dependent upon a set of MOGP hyperparameters Θ={v,B}, the specific form of which will be discussed in Section 2.2.2. Note that we have assumed a common prior shared by all subjects, as we do not wish to differentiate subjects a priori. Nevertheless, the posterior pathway trajectories (after observing subject‐specific biomarker expression) are subject‐specific, which we anticipate will reveal the distinction between subjects with different clinical outcomes.

The prior in Equation (4) induces a multivariate normal (MVN) prior distribution for the k latent factors at a finite set of observed time points. Specifically, for the ith subject with qi measurements, the kqi‐dimensional vector has the distribution

[eqn]

where Yi=(yi1,…,yik)T∈ℝk×qi is the matrix of pathway expression of the ith individual, with yia=(yi1a,…,yiqia)T denoting the ath factor's expression across observation times ti. Below, we will use the notation vec(YiT) to denote the column vector (yi11,…,yiqi1,……,yi1k,…,yiqik)T that is a vectorized form of the matrix YiT. The matrix ∑Yi(Θ,ti)∈ℝkqi×kqi is the covariance matrix of vec(YiT), which is dependent on the kernel function g(Θ) and observation times ti.

Handling sparse measurements using penalized MOGP with non‐zero mean functions

2.2.2

While GP work well with a moderate or large amount of data, in our setting with sparse data, GP (Equation 4) are prone to overfitting [19, 20], resulting in unreasonably “wiggly” fitted curves. To address this issue, regularization is necessary: specifically, we will constrain the hyperparameters that control smoothness and avoid assuming a zero‐mean function prior.

Under the CP framework introduced in Section 2.2.1, the auto covariance function between two time points dt distance apart within a single process a, denoted as CaaY(t+dt,t), takes the following form:

[eqn]

where δt=I(dt=0), and Ba0 and Ba1 are positive parameters. This expression for the auto covariance function indicates that va0 and va1 control the amplitudes of MOGP, while Ba0 and Ba1 determine smoothness, as they regulate how quickly the correlation decreases as the absolute time difference |dt| increases. Supporting Information: Figure S8 illustrates how the smoothness changes with these hyperparameters. Since va0 and va1 have similar behavior, and Ba0 and Ba1 have similar behavior, we present results under varying va0 and Ba0 only. It is clear that larger values of Ba0 and Ba1 correspond to rougher curves. This observation motivates the roughness penalty that we will introduce in Section 3.1.

In addition to the aforementioned constraints on the covariance function, we also adopt non‐zero mean functions for MOGP to handle the sparse measurements. GP with zero mean functions can approximate arbitrary continuous functions, if given enough data, and so zero‐means are often used in practice. However, in our setting with few observations, GP with zero mean functions predict zeroes in regions with little data, which is problematic [21]. The importance of the prior mean function when data are sparse has been highlighted in previous research as potentially affecting both the prediction performance [22] and the estimation of hyperparameters of the covariance function [23]. Specifically, we assume a constant mean function ca for the ath pathway, a=1,…,k, and will estimate ca from the data, along with the hyperparameters Θ that determine the covariance function.

Proposed Model

2.3

Finally, we present the proposed model in its matrix form. Let Xi=(xi1,…,xip)T∈ℝp×qi be the matrix of biomarker expressions at qi observation times for the ith individual, with xig=(xi1g,…,xiqig)T denoting the gth biomarker's longitudinal measurements. Correspondingly, let Mi=(μi1,…,μip)T∈ℝp×qi be the matrix of subject‐biomarker means, with μig=μig1i, where 1i is a qi‐dimensional column vector consisting of the scalar 1; Ci=(c1iT,…,ckiT)T∈ℝkqi is the mean vector for vec(YiT), with cai=ca1i. Furthermore, let L={lga}g=1,…,p,a=1,…,k∈ℝp×k be the matrix of factor loadings, A={Aga}g=1,…,p;a=1,…,k∈ℝp×k be the matrix of regression coefficients and Z={Zga}g=1,…,p;a=1,…,k∈ℝp×k be the matrix of inclusion indicators,

[eqn]

where Ei is the residual matrix, and ∘ denotes element‐wise matrix multiplication.

Inference

3

To achieve a balance between the computational cost and statistical performance, we use an empirical Bayes (EB) approach to obtain point estimates of MOGP hyperparameters Θ and C, where C={ca}a=1,…,k. Then we implement a Gibbs sampler for other parameters under the fixed estimate of Θ and C in order to quantify the uncertainty of estimating quantities of major interest, such as latent factor loadings L and factor trajectories Yi.

Specifically, we develop a StEM algorithm [11] in Section 3.1 to calculate the maximum penalized likelihood estimate (MPLE) of Θ and C (denoted as Θ^MPLE and C^MPLE, respectively), which maximizes a penalized likelihood that penalizes MOGP roughness. Note that this added methodological development extends the method of Cai et al. [10], which used an MCEM algorithm to maximize the likelihood function (without roughness penalty). In Section 3.2, we describe the Gibbs sampler for all parameters in the model except for Θ and C, represented as Ω={M,Y,A,Z,ρ,π,σ,ϕ}, where M={Mi}i=1,…,n, Y={Yi}i=1,…,n, ρ={ρa2}a=1,…,k, π={πa}a=1,…,k, σ={σg2}g=1,…,p, ϕ={ϕg2}g=1,…,p. This sampler serves two purposes. First, within the StEM algorithm, it simulates samples of Ω in the stochastic step (S‐step) to update the estimates of Θ and C in the maximization step (M‐step). Second, after the StEM algorithm, it generates samples of Ω for posterior inference.

Estimating MOGP Hyperparameters by Maximizing the Penalized Likelihood

3.1

To calculate Θ^MPLE and C^MPLE, we maximize the following objective function,

[eqn]

where f(X|Θ,C) is the likelihood of observing all biomarker expressions X={Xi}i=1,…,n, λ is the parameter that tunes the extent to which roughness is penalized. The likelihood f(X|Θ,C) involves high‐dimensional integration taking the following form,

[eqn]

which cannot be evaluated analytically.

Stochastic Expectation Maximization (StEM)

3.1.1

To deal with the integration in Equation (9), we consider expectation maximization (EM)‐type algorithms; by viewing Ω as hidden variables and then iteratively constructing a sequence of Θ^(l) and C^(l) that converges to the truth, l=1,2,3,…. As a closed‐form of the expectation of the complete data log‐likelihood is unavailable due to the complexity of the model, we resort to a variant of EM based on simulations [24]. The StEM algorithm is particularly suitable for our use here, as it only requires a single random simulation/sample at each iteration [25] and therefore is scalable to the large sample size of the motivating example. StEM has also been found to be less sensitive to initial values than alternative algorithms [26, 27]. Specifically, the risk of getting stuck in local optima is reduced.

The sequence of Θ^(l) generated from StEM constitutes a homogeneous Markov chain (the same applies to the sequence of C^(l)), which means that its convergence is with regard to the average of the last m iterated values of the parameter [28]. Therefore, it can be treated like Markov chain Monte Carlo (MCMC) samples [29]. For instance, the burn‐in size is essential, and techniques related to assessing MCMC convergence can also be used. Specifically, StEM comprises two steps. In the S‐step of the lth iteration, one random sample of Ω (denoted as Ω(l−1)) is generated from the conditional distribution f(Ω|X,Θ^(l−1),C^(l−1)) to constitute the complete data (X,Ω(l−1)). Then in the M‐step, the update Θ^(l) and C^(l) can be found by maximizing the penalized log‐likelihood of observing the complete data,

[eqn]

Applying the factorization in Equations (9) and (10) reveals that the only term in lnf(X,Ω(l−1)|Θ,C) that depends on Θ is f(Y(l−1)|Θ,C), which immediately reduces the objective function to be maximized in Equation (10) to

[eqn]

where f(Y(l−1)|Θ,C) is the product of MVN distributions as a result of the MOGP prior on factor trajectories of all subjects (Equation 7). Expanding this term leads to

[eqn]

which is then maximized in the M‐step.

Adaptations to Facilitate StEM Implementation via an Existing R Package

3.1.2

To find the estimate of Θ and C that can maximize Equation (11), we use the R package “GPFDA” (short for Gaussian processes for functional data analysis) [30], which return the MLEs of hyperparameters of the MOGP modelled via the CP framework. GPFDA uses the efficient optimization algorithm L‐BFGS, an iterative method based on second‐order derivatives. When calculating the MLEs for the longitudinal vec(YiT)(l−1) of a single individual, the computational complexity of GPFDA is O(k2qi2). This complexity is drastically reduced compared with the naive approach of directly inverting the covariance matrix ∑Yi(Θ,ti) via a Cholesky decomposition, where the computational complexity would be O(k3qi3). In addition, GPFDA allows for the specification of different types of mean functions, such as the output‐specific mean functions that we assume in our model.

However, there are two major limitations that prevent us from directly using the GPFDA package. First, the objective function used by GPFDA is the observed log‐likelihood function and it does not allow specification of a penalty term. Second, GPFDA assumes all input processes are measured at common time points (rather than subject‐specific observed times ti), which is neither the case in our motivating example nor in many other practical settings.

To overcome these limitations, we first modified the GPFDA package so that users can specify the penalized objective function in Equation (11). Second, we introduce auxiliary variables to augment the parameter space to enable the use of GPFDA in the setting of irregular longitudinal data. Let t=∪i=1nti denote the vector of all unique observation times across all subjects and denote its length as q, and let Yi,aug=(yi1,…,yik)T∈ℝk×q denote an augmented version of the matrix Yi∈ℝk×qi, where yia=(yi1a,…,yiqa)T denotes the ath factor's expression across all observation times t. In this augmented space, the unknown parameters are Ωaug={M,Yaug,A,Z,ρ,π,σ,ϕ}, where Yaug={Yi,aug}i=1,…,n. By replacing Ω with Ωaug in Equations ((8), (9), (10)), the objective function in Equation (11) changes accordingly to,

[eqn]

where Yi,aug(l−1)∈ℝkq denotes a sample of the augmented matrix Yi,aug, generated by sampling from the appropriate marginal of f(Ωaug|X,Θ^(l−1)). Such samples can be conveniently obtained using the Gibbs sampler that we will introduce in Section 3.2. The quantities Caug and ∑Yaug(Θ,t) are, respectively, the corresponding mean vector and the covariance matrix that are common to all subjects. The modified objective function in Equation (12) after data augmentation enables the estimation of the MLE using GPFDA despite the package permitting only processes measured at common time points. Finally, we point out that the computational complexity of finding the maximizer for Equation (12) is O(nk2q2), constant across all iterations.

Choice of the Tuning Parameter λ

3.1.3

To choose the value for the tuning parameter λ in the penalized log‐likelihood function, we use a cross‐validation approach to find the value with the smallest averaged mean absolute error (MAE) in predicting biomarker expressions. Specifically, all subjects are first divided randomly into l approximately equal groups. For each candidate value of λ, the following steps are iterated over each of the l groups,

Use the group corresponding to the current iteration as the test data, and the remaining (l−1) groups as the training data;
Fit the model with the penalized log‐likelihood in Equation (8) to the training data to calculate Θ^MPLE;
Under Θ^MPLE predict for biomarker expressions in the test data, and compare predictions with the ground truth to calculate MAE;

Finally, the average MAE across all l groups is calculated for the candidate value under consideration.

A Gibbs Sampler for Other Variables in the Model

3.2

Under a fixed estimate of MOGP hyperparameters (Θ^ and C^ denoting estimates of hyperparameters in the covariance function and mean function, respectively), the conditional distribution f(Ωaug|X,Θ^,C^) does not have a closed‐form. To acquire samples from this distribution, we developed an efficient Gibbs sampler, which simulates a Markov chain whose stationary distribution is f(Ωaug|X,Θ^,C^). Note that to obtain one sample from f(Ωaug|X,Θ^(l−1),C^(l−1))) required by the M‐step of StEM for maximization, we run a long chain (the number of iterations S needs to be pre‐specified), discard samples in the burn‐in period, then take one random sample from the remaining iterations and input it to GPFDA for acquiring Θ^(l) and C^(l). It is worthwhile to run a long chain to generate the single sample required in the M‐step because the Gibbs sampler run in the S‐step is not time‐consuming compared with the M‐step where Θ^(l) and C^(l) are calculated. Therefore, running a long chain does not notably affect the computation time.

Full conditionals for all variables in Ωaug are available analytically, and the details are available in Supporting Information: Section A.1. Here, we highlight the updating of the augmented variable Yaug. When updating vec(Yi,augT), the vectorized form of the ith individual's factor scores at common times t, we adopt a block Gibbs sampler. Specifically, we factorize f(vec(Yi,augT)|X,Θ^,C^,Ωaug∖vec(Yi,augT)), according to the partition vec(Yi,augT)=(vec(YiT),vec(Yi,addT)), where Ωaug∖vec(Yi,augT) denotes the remaining parameters excluding vec(Yi,augT), and vec(Yi,addT) represents the newly added variables. The first factor f(vec(YiT)|X,Θ^,C^,Ωaug∖vec(Yi,augT)) follows a MVN distribution dependent on observed biomarker expression, as can be seen from the model in Equation (7). The second factor f(vec(Yi,addT)|vec(YiT),Θ^,C^) also follows a MVN distribution according to standard properties of the MOGP model. Our sampler samples from these two factors in turn.

Handling Non‐Identifiability

3.3

A common challenge related to factor analysis is non‐identifiability. In the specific model we propose here, there are two types of non‐indentifiability. First, the covariance matrix for latent factors is not identifiable or indeterminate (see Supporting Information: Section A.2 for a detailed discussion). To address this non‐identifiability, we constrain the main diagonal elements of the covariance matrix to be 1 (i.e., the variance of each factor at each time point is set to 1). Second, under the above constraint, factor loadings L and factor scores vec(YiT) are still not identifiable due to potential sign changes or permutations (we will refer to them together as signed‐permutation hereafter). “Sign change” means that the likelihood remains the same if the sign of a factor changes (i.e., from positive to negative, and vice versa). “Permutation” (also known as “label switching” in the literature) means that if we shuffle the indices of the factors, the likelihood will not change. To address the signed‐permutation associated with L and vec(YiT), we post‐align samples using the R package “factor.switch” [31].

Simulation Study

4

To investigate the performance of the proposed StEM algorithm under different sample sizes, we conducted a simulation study. The data generation mechanism is described in Section 4.1. We compare the results under StEM with those under an alternative algorithm, Monte Carlo expectation algorithm (MCEM), which was used by Cai et al. [10]. Implementation details and results are described in Section 4.2.

Data Generation Mechanism

4.1

We followed one data generation mechanism considered in Cai et al. [10], which mimics longitudinal gene expression data obtained from a human challenge study. We varied the sample size, taking values ranging from n=10 to n=250. For each sample size, we simulated 100 datasets according to the setting described below.

The number of observed time points was fixed at qi=8 for all individuals, the number of true latent factors was k=4 and the number of biomarkers was p=100. The mean value for each factor score yija was fixed to 0, and the covariance matrix ∑Y encodes non‐zero cross‐correlations between factors (i.e., factors are truly correlated). The exact form of the cross‐correlation matrix is found in Supporting Information: Figure S8. Each factor was expected to regulate 10% of all biomarkers: we set hyperparameters c0=0.1·p and d0=0.9·p, leading to 𝔼πa=0.1 (according to Equation (2)). If a biomarker was regulated by an underlying factor, then the corresponding factor loading was generated from a normal distribution N(4,12); otherwise the factor loading was set to 0. Each subject‐biomarker mean μig was generated from N(μg,σg2), where μg ranged between 4 and 16 and σg=0.5. Finally, observed biomarkers were generated according to Equation (1), where ϕg=0.5.

Implementation Details and Results

4.2

When implementing the StEM algorithm, we ran 200 iterations (we monitored the sequence of iterated estimates via traceplots and there were no apparent non‐convergence concerns). For the existing MCEM algorithm to which we compare our StEM algorithm, we used the R package DGP4LCF [32]. MCEM stops once the number of increasing the size of Markov chain Monte Carlo samples (as implemented in MCEM by Cai et al. [10]) has reached the pre‐defined limit w, and the default value in the package is w=5; to fully explore its performance, we perform separate runs of MCEM with w set to 5, 8 and 10. To assess the statistical accuracy in estimating MOGP hyperparameters Θ, we calculate the mean absolute difference (MAD) between estimated cross‐correlations constructed by Θ^MPLE and the ground truth. Note that the (concurrent or lag 0) cross‐correlations between factors are constant across time, under the CP framework introduced in Section 2.2.1. Specifically, the cross‐correlation coefficient between factor a and factor b (denoted as ρab) is ρab=CabYCaaYCbbY, where CaaY and CbbY are variances of factor a and b, respectively. Specific forms can be found in Equation (6). The covariance between factors a and b is also constant across time, taking the form of CabY=va0vb0(2π)12Ba0+Bb0.

Under Θ^MPLE, we ran the Gibbs sampler three times with different starting values. We ran each chain for 100,000 iterations, with a 20% burn‐in proportion and retained only every 200th iteration. Due to the identifiability issue caused by sign‐permutation (discussed in Section 3.3), we post‐aligned samples within and across chains using R package “factor.switch” [31]. To enable the calculation of MAD, the final estimate also needs to be aligned with the ground truth. To do this, we developed an automatic algorithm to determine the order and sign of final estimated factors. Specifically, the total number of possible signed permutations is 2kk! given k factors; we evaluated the difference between the final estimate and the truth under each possible combination, and ultimately chose the combination that had the smallest MAD (in order to maximize the similarity of all estimates to the ground truth). In addition, we recorded the computational time for each run. We limited computation to a maximum of 36 h for each simulation.

Figure 2 displays results. StEM is much faster than MCEM, regardless of the value of w, under all sample sizes (Panels A and B). On average (across sample sizes), StEM is 20 times faster than MCEM, and this computational gain is more obvious as the sample size increases. With w=10, MCEM never returned results within 36 h when n≥50; with w=8, MCEM never returned results within 36 h when n≥150. Therefore, corresponding results were not displayed, and when comparing the statistical performance of MCEM with StEM, we focus on the setting w=5 for MCEM (Panel C). Overall, estimation performance of StEM generally improves as the sample size n increases, and the median MAD is always better than that of MCEM regardless of n. Across all sample sizes, the median MAD under StEM is at least 30% lower than MCEM. In addition, the variability of MAD (across 100 simulated data) using MCEM is often higher compared with StEM (i.e., StEM has more stable results), which may be due to StEM being less likely to get stuck in local optima, as discussed in Section 3.1.1. These results confirm the added value of the StEM development when applying the proposed dynamic factor analysis to studies with larger sample sizes, such as the motivating COVID‐19 example.

Comparison between StEM and MCEM in terms of computational time (Panels A and B) and estimation accuracy (Panel C) under different sample sizes in the simulation study. For each sample size, 100 datasets were simulated. Whiskers of boxplots extend to the smallest and largest values.

Data Application

5

Study Description

5.1

SARS‐CoV‐2 PCR‐positive patients were recruited at Cambridge University Hospitals between 31st March 2020 and 7th August 2020, and blood samples were collected at and following enrollment to quantify biomarker expressions. Day 0 is defined as the date of symptom onset, and our analysis focused on the 7‐week window after symptom onset. Figure 3 visualizes subject‐specific observed times when the metabolite samples were collected during this period. The unique times across individuals are t={0,1,2,…,49}, and the total number of unique times is q=50. A tabular summary of the number of observations versus the number of subjects is provided in Supporting Information: Table S4.

Subject‐specific observation times. Each line corresponds to a single subject, and connects subject‐specific observed times.

In total, there are n=101 patients in the study. Each patient was assigned a clinical label (1–4) denoting their peak severity, with a higher number corresponding to a more severe state. Meanings of labels are defined in Ruffieux et al. [1] as follows (with corresponding sample size in parentheses): 1, mild symptomatic (31); 2, hospitalized without supplemental oxygen (12); 3, hospitalized with supplemental oxygen (19); and 4, hospitalized with assisted ventilation (39). Among those labelled severity level 4, 4 patients did not survive for the full 7‐week window; the exact dates of death are days 16, 20, 44, and 47 after symptom onset, respectively.

We apply our proposed dynamic factor analysis approach to the longitudinal (but sparse and irregular) measurements of the 35 metabolites recorded in the study. As very few patients (4%) died during the study period, we chose to not explicitly model the survival aspect of these data here. We preprocessed the metabolites in the same way as Ruffieux et al. [1]. Our aim is to discover biological mechanisms and their possible interplay underlying the biological heterogeneity in patient's response to COVID‐19 viral infection. Our proposed analysis strategy is described in the next section.

Analysis Strategy

5.2

Ageing and metabolism are inextricably linked, and many changes in metabolites may be related to the ageing process. For example, Singh et al. [33] claim that the abundance of the metabolite taurine (measured in this study) decreases during ageing. To characterize the biological heterogeneity in metabolite dynamics not due to exogenous age but to the response to COVID‐19 infection, we regress out age. More precisely, we perform metabolite‐specific linear regression, x˜ijg=β0g+β1gai+ϵijg, where x˜ijg corresponds to the subject and time‐specific measurement of the gth metabolite, ai denotes subject‐specific age, and ϵijg is the residual error assumed to follow a normal distribution N(0,σg2); i=1,…,n and j=1,…,qi. We use the measurements from these age‐regressed out metabolites (i.e., the residuals) as the observed variables, xijg, in our dynamic factor analysis model; that is, xijg=x˜ijg−x^ijg, where x^ijg is the fitted value of the linear regression.

Our primary analysis corresponds to the fitting of our proposed dynamic factor model for sparse and irregular data with q=50 (corresponding to the set of unique times t={0,1,2,…,49}) and for several different k, the number of latent factors, from 2 to 5. The value for the tuning parameter, λ was obtained through cross‐validation. We consider values of lnλ between −4 and 4, with step‐length 0.5; resulting in 17 candidate values of λ range from 0.02 to 54.60.

We compare the results from our primary analysis with the results from two comparator methods. The first comparator method is the model of Cai et al. [10], which assumes zero‐mean functions for MOGP, without any penalization of roughness and implements MCEM for inference on the MOGP hyperparameters. However, for q=50, the MCEM implementation failed to return results within 36 h. Therefore to allow some level of comparison, we coarsened the set of 50 unique times t into a reference/canonical grid, tref={0,7,14,…,49}, of q=8 unique time points that could be used within the MCEM algorithm. This grid choice was informed by the actual subject‐specific observation times, as displayed in Figure 3. Moreover, it was motivated by the COVID‐19 study design, where in‐patients were to be sampled at enrollment then approximately weekly up to 4 weeks from enrollment and thereafter every 2 weeks up to 12 weeks. Out‐patients were sampled at enrollment and then approximately 2 and 4 weeks after enrollment. The mapping of the observed times to the times in the reference grid tref, where adjacent reference times are 1‐week apart, resulted in an approximately constant patient‐specific shift across patient's observation times (as shown in Supporting Information: Figure S1). As the GP we adopt are stationary, their covariance functions will only depend on the time difference rather than specific times. Therefore given this knowledge and that the mapping results in approximately constant patient‐specific shifts, there will be minimal impact of the mapping on the estimation of the GP hyperparameters. The specific details of how the mapping to the reference grid was accomplished are found in Supporting Information: Section B.1. Additionally, we compared the metabolites selected by our approach with those identified in Ruffieux et al. [1]. The second comparator method is the model of Velten et al. [8], which provides a package named MEFISTO for implementation.

Statistical Results

5.3

Primary Results Under Our Proposed Method

5.3.1

The primary results using our proposed method with q=50 are presented below. Regardless of the choice of k, we consistently identify one factor that can differentiate patients with different clinical severities. Therefore, we chose the parsimonious model with k=2 as our final model and display results under k=2 hereafter. Supporting Information: Figures S6 and S7 display results under k=3 as an example of other specifications.

Supporting Information: Figure S4 indicates that the optimum of λ (denoted as λopt) is around 3, as this value leads to the smallest averaged MAE under cross‐validation. Under λopt, the magnitude (i.e., absolute value) of the estimated cross‐correlation between factors 1 and 2 is 0.003, almost negligible. Figure 4 displays estimated subject‐specific factor trajectories. Visually, it appears that the second factor trajectory is able to distinguish patients with the mildest symptoms (i.e., severity level 1) from those with the severest symptoms (i.e., severity level 4) most clearly (see Supporting Information: Figure S5 for trajectories of patients from severity levels 1 and 4 only). Whereas patients with severity levels 2 and 3 (i.e., intermediate‐level severity) could not be differentiated clearly from patients in the two extreme classes.

Estimated subject‐specific posterior factor trajectories (truncated at y=3 for clarity; see Supporting Information: Figure S2 for the full version), using the exact method with q=50. The number of latent factors is pre‐specified as k=2. For people who did not survive for the full 7‐week follow‐up, we plotted their trajectories only before the death date; otherwise trajectories were plotted within a 7‐week window after the onset of symptoms.

Moreover, our method is able to estimate the “complete” trajectories of metabolites (i.e., over 7 weeks) via the posterior predictive distribution (described in Supporting Information: Section A.1). Figure 5 shows the posterior predictive distribution for the 9th metabolite (quinolinic acid), which has the largest absolute factor loading on the second factor. We plotted estimated curves of this biomarker for 20 subjects (randomly selected).

Posterior predictive trajectory distributions of the 9th biomarker for 20 subjects (randomly selected), under penalized likelihood. Solid and dotted lines represent point estimate and 95% credible intervals, respectively. Red points denote observed measurements.

Results Under Comparator Methods

5.3.2

When k=2, DGP4LCF could not return estimates within 36 h for the q=50 unique observation times. To speed up its computation, we implemented the reference grid mapping strategy with q=8. Figure 6 displays estimated pathway trajectories, which are implausibly spiky. This reflects little correlation between adjacent time points, and it is implausible biologically. In contrast, their counterparts in Figure 4 (under our model) are much smoother. Moreover, the poor performance in recovering latent factor trajectories results in poor performance in the estimation of biomarker trajectories (Figure 7). These curves are almost flat, and only change towards measured values at observed time points, reflecting severe overfitting. In contrast, the estimated biomarker trajectories under our method display a more plausible temporal trend (Figure 5). Sparse observations may be the cause of the comparator model failing to return meaningful temporal trends. The model utilized GPs to describe the trajectories of pathways, and the flexibility of GPs tends to lead to overfitting when the number of measurements is small if no regularization is imposed. On the contrary, our model can handle sparse data well, due to the constraints we introduced in Section 2.2.2.

Estimated subject‐specific posterior factor trajectories using the comparator method [10] with q=8 (truncated at y=3 for clarity). The number of latent factors is pre‐specified as k=2. For people who did not survive for the full 49‐day follow‐up, we plotted their trajectories only before the death date; otherwise trajectories were plotted within a 7‐week window after the onset of symptoms.

Posterior predictive trajectory distribution of the 9th biomarker for 20 subjects (randomly selected), by applying the model in Cai et al. [10] to the motivating example. Solid and dotted lines represent point estimate and 95% credible intervals, respectively. Red points denote observed measurements.

For q=50, the MEFISTO model estimated factor values to be zeroes for nearly all time points in t for all subjects, producing non‐zero values only on days when metabolite measurements are available. For example, a subject with measurements only on days 35 and 42 has non‐zero factor estimates only on those days. In this setting, the MEFISTO model failed to produce meaningful results. To allow some level of comparison, we also fitted the MEFISTO model with q=8, corresponding to the reference grid tref={0,7,14,…,49}. The estimated factor trajectories are shown in Figure 8. Consistent with the results from our proposed model, the second factor trajectory appears to distinguish patients with the mildest symptoms (i.e., severity level 1) from those with the severest symptoms (i.e., severity level 4) most clearly. However, the noticeable wiggling of the curves suggests that MEFISTO may be overfitting to local fluctuations.

Estimated subject‐specific posterior factor trajectories obtained using the MEFISO model, with q=8 time points at the reference grid tref={0,7,14,…,49}. Each curve represents an individual's factor trajectory, obtained by connecting their estimated factor values across the reference grid.

Estimated loadings on factor 2 (Figure 9) highlighted 3‐hydroxykynurenine (index 2), kynurenine (index 6), quinolinic acid (index 9) as important metabolites, which is consistent with the findings reported in Ruffieux et al. [1]. However, the MEFISTO model does not identify taurine, a novel metabolite detected by our model (see Section 5.4). Furthermore, the MEFISTO model does not provide uncertainty quantification for the estimated factor loadings, whereas our method provides 95% credible intervals, as shown in Section 5.4.

Top 10 metabolites on Factor 2 based on absolute factor loadings. The sign of each loading is indicated, with “–” representing a negative loading and “+” representing a positive loading.

Biological Interpretation

5.4

To interpret the factor loading results biologically, we used the online bioinformatics platform MetaboAnalyst 5.0 to perform pathway enrichment analysis for the top 10 metabolites with the highest absolute loading on the second factor. All 35 metabolites measured in the COVID‐19 study were inputted as the background/reference.

Figure 10 displays the result. The enrichment analysis returns a p value of 0.018 for the tryptophan metabolism pathway (see column “p” in the bottom panel or through the y‐axis corresponding to “−log(p)” in the top left panel), suggesting selected metabolites are significantly enriched in this pathway. Its specific mechanism is provided in the top right panel. The importance of this pathway in COVID‐19 has been previously established. Tryptophan metabolism via the kynurenine pathway has been found to be the top pathway influenced by COVID‐19 infection, and it has been argued that this should be the focus of COVID‐19 immunotherapy [34]. Furthermore, the pathway topology analysis, which evaluates the impact of the chosen metabolites on the pathway, returns a value of 0.30 (the column “impact” in the bottom panel) for tryptophan metabolism pathway. This number indicates the selected metabolites are in key positions of this pathway.

Pathway analysis results by the online bioinformatics platform MetaboAnalyst 5.0. The top left panel shows an overview of pathway analysis, with y‐axis denoting p value of pathway enrichment analysis and x‐axis representing impact of the selected metabolites on the pathway. The top right panel shows details of the enriched pathway: colored (i.e., not grey) elements represent metabolites in this pathway that are uploaded as “background”, with red and blue elements denoting “selected” and “unselected” metabolites by our method, respectively. The bottom panel supplies numerical details for the top figures.

In addition, all metabolites identified as signatures of COVID‐19 patients in Ruffieux et al. [1] (introduced in Section 1) have been selected by our method. Furthermore, we identified a novel target “taurine” which has been receiving increased attention clinically [35, 36, 37] yet was overlooked in the previous analysis [1]. Our analysis suggests that a higher level of taurine is associated with milder symptoms, which is consistent with existing knowledge; for example, van Eijk et al. claims that “taurine should be regarded as a promising supplementary therapeutic option in COVID‐19” [38]. In addition, the association between taurine and the kynurenine pathway, characterized by the corresponding factor loading estimate, would be a lot weaker if age was not regressed out. The median estimate of the factor loading after adjusting for age is −0.25 (95% credible interval (−0.33,−0.17)), whereas it is −0.09 (95% credible interval (−0.21,0)) without age adjustment. Adjusting for age (e.g., known to affect taurine level from previous biological knowledge [33]) has resulted in a refinement of the factor loading structure, leading to increased specificity in the identification of metabolites instrumental in the underlying mechanisms of COVID‐19 disease dynamics.

Discussion

6

In this paper, we propose a dynamic factor analysis approach that can uncover latent biological structure and estimate the “complete” trajectories of biomarkers, from sparsely and irregularly collected longitudinal biomarker data in studies with relatively large sample size. Specifically, we introduce a penalized MOGP with non‐zero mean functions to model the latent factor trajectories, which not only accounts for the cross‐correlations between factors but also mitigates against overfitting caused by sparse observations. For inference on MOGP hyperparameters, we develop an StEM algorithm that scales well with “large” sample size. Our method is implemented in our R package DFA4SIL. It can be viewed as an extension to the model proposed in Cai et al. [10], which works well in small studies with frequent measurements. We anticipate this extension will facilitate the use of dynamic factor analysis with potential interactions between factors in a wider range of biomedical data applications. Application to the metabolite data in a COVID‐19 study demonstrates the ability of our model to identify and characterize essential biological pathways (and corresponding instrumental metabolites) during disease progression and to estimate/predict the trajectories of metabolites over time after infection.

Although the cross‐correlation in the COVID‐19 metabolite data application is estimated to be close to zero (indicating the independence of the two latent factors), non‐zero estimates are possible with our method if factors are truly correlated, which has been shown in the simulation study. Theoretically, it is preferable to apply a model that allows for dependence among factors (as is proposed here), rather than a model that assumes independence a priori. The data would then allow us to determine whether or not independence amongst factors is appropriate. Moreover, our method is applicable to multiple biomarker types and able to capture the potential cross‐correlations among different types. As multiple data types can capture a wider range of biological mechanisms that are in play after virus infection, further insight may arise in this setting.

An alternative approach to control the roughness of GP is to directly limit the support of key hyperparameters. For instance, a bounded length‐scale has been implemented for squared exponential and Matérn covariance functions, based on the spectral density [19]. However, it is unclear how to extend this approach to complex covariance functions, such as the covariance function induced by the convolution process framework we introduced in Section 2.2.1.

In the future, there are several research directions worth pursuing. First, the computational complexity of the StEM algorithm used to determine the MLE of MOGP hyperparameters scales quadratically with kq. When kq is large (e.g., when observed times are highly irregular across individuals or when the pre‐specified number of latent factors is large), approximation methods may be used to further speed up the computation if necessary. For example, a reference grid approach could be implemented along the lines as was done in the real‐data application when applying the original method of Cai et al. [10] to the COVID‐19 data. However, some careful consideration should be given to what is an appropriate reference grid to choose which balances computational speed gains with the resulting loss of information from coarsening and allows the coarsening mechanism to be ignored in the Bayesian analysis. Alternatively, approximation methods based on variational inducing kernels, as proposed by Alvarez et al. [39], or alternative algorithms such as variational inference [40, 41, 42] may be propitious.

Moreover, the current model in Equation (7) does not take the clinical outcome information into account. Instead, we first fit an unsupervised model that does not use clinical labels, then check if there is a difference in the estimated factor trajectories of people with different levels of clinical severity by visual inspection. A model that jointly models the underlying factor structure and the relationship between factors and outcomes would be of future interest, as previous researchers have found that an outcome‐guided model can improve the estimation of the underlying factor pattern (e.g., better estimation of the factor loading matrix) [43, 44] and lead to the discovery of more clinically relevant structure.

In addition, FA assumes a linear relationship between observed biomarkers and latent biological pathways. A potential future direction is to extend our framework to accommodate non‐linear interactions, which are common in biological systems. Accounting for survival outcomes would also be a valuable extension, as it could help link pathways not only to clinical symptoms but also to patient prognosis.

Funding

This work was supported through the United Kingdom Medical Research Council programme grants MC_UU_00002/2, MC_UU_00002/20, MC_UU_00040/02, and MC_UU_00040/04.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1. Supporting information.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1H. Ruffieux , A. L. Hanson , S. Lodge , et al., “A Patient‐Centric Modeling Framework Captures Recovery From SARS‐Co V‐2 Infection,” Nature Immunology 24, no. 2 (2023): 349–358.36717723 10.1038/s 41590-022-01380-2PMC 9892000 · doi ↗ · pubmed ↗
2S. Richardson , G. C. Tseng , and W. Sun , “Statistical Methods in Integrative Genomics,” Annual Review of Statistics and Its Application 3 (2016): 181–209.10.1146/annurev-statistics-041715-033506 PMC 496303627482531 · doi ↗ · pubmed ↗
3M. Chen , A. Zaas , C. Woods , et al., “Predicting Viral Infection From High‐Dimensional Biomarker Trajectories,” Journal of the American Statistical Association 106, no. 496 (2011): 1259–1279.23704802 10.1198/jasa.2011.ap 10611 PMC 3659825 · doi ↗ · pubmed ↗
4C. M. Carvalho , J. Chang , J. E. Lucas , J. R. Nevins , Q. Wang , and M. West , “High‐Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics,” Journal of the American Statistical Association 103, no. 484 (2008): 1438–1456.21218139 10.1198/016214508000000869 PMC 3017385 · doi ↗ · pubmed ↗
5B. O. Taddé , H. Jacqmin‐Gadda , J. F. Dartigues , D. Commenges , and C. Proust‐Lima , “Dynamic Modeling of Multivariate Dimensions and Their Temporal Relationships Using Latent Processes: Application to Alzheimer's Disease,” Biometrics 76, no. 3 (2020): 886–899.31647111 10.1111/biom.13168 · doi ↗ · pubmed ↗
6G. Nyamundanda , I. C. Gormley , and L. Brennan , “A Dynamic Probabilistic Principal Components Model for the Analysis of Longitudinal Metabolomics Data,” Journal of the Royal Statistical Society. Series C, Applied Statistics 63, no. 5 (2014): 763–782.
7F. Yao , H. G. Müller , and J. L. Wang , “Functional Data Analysis for Sparse Longitudinal Data,” Journal of the American Statistical Association 100, no. 470 (2005): 577–590.
8B. Velten , J. M. Braunger , R. Argelaguet , et al., “Identifying Temporal and Spatial Patterns of Variation From Multimodal Data Using MEFISTO,” Nature Methods 19, no. 2 (2022): 179–186.35027765 10.1038/s 41592-021-01343-9PMC 8828471 · doi ↗ · pubmed ↗