Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies

Zhongzhe Ouyang; Lu Wang

PMC · DOI:10.3390/math12070951·February 7, 2025

Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies

Zhongzhe Ouyang, Lu Wang

PDF

Open Access

TL;DR

This paper introduces a new method for selecting important variables when combining longitudinal data with block-wise missing values, improving accuracy in studies like Alzheimer's.

Contribution

A novel imputation-based variable selection method for longitudinal data with block-wise missingness.

Findings

01

The proposed method outperforms existing approaches in numerical experiments.

02

It successfully identifies early-stage Alzheimer's biomarkers in real-world data.

03

The method is theoretically grounded with established asymptotic properties.

Abstract

When integrating data from multiple sources, a common challenge is block-wise missing. Most existing methods address this issue only in cross-sectional studies. In this paper, we propose a method for variable selection when combining datasets from multiple sources in longitudinal studies. To account for block-wise missing in covariates, we impute the missing values multiple times based on combinations of samples from different missing pattern and predictors from different data sources. We then use these imputed data to construct estimating equations, and aggregate the information across subjects and sources with the generalized method of moments. We employ the smoothly clipped absolute deviation penalty in variable selection and use the extended Bayesian Information Criterion criteria for tuning parameter selection. We establish the asymptotic properties of the proposed estimator, and…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

Alzheimer’s Disease Alzheimer's Disease

Figures2

Click any figure to enlarge with its caption.

Keywords

multiple imputationcorrelated datadata integration62H99

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models

Full text

Introduction

Multi-sources data are now attracting more attention in scientific research. A practical problem with multi-source data is block-wise missing. Our work is motivated by the existence of block-wise missingness in Alzheimer’s Disease Neuroimaging Initiative (ADNI) data when investigating the biomarkers that are associated with Alzheimer’s Disease (AD). In the ADNI study, healthy elderly subjects, as well as subjects with normal cognition (NC), mild cognitive impairment (MCI), or AD, were recruited to identify neuroimaging measures, cognitive measures and biomarkers that can effectively and timely detect cognitive and functional changes [1]. The ADNI data exhibit a block-wise missing structure along with the long duration of the study, and the high cost of certain measurements, etc. Besides the ADNI data, datasets with block-wise missing structure also exist across many other fields including environmental science, sociology, and economics. For example, a block-wise missing structure appears in human mortality data integrated from Italy and Switzerland [2] and in credit data collected from various institutions (Lan and Jiang [3]; Li et al. [4]).

Statistical analysis with missing covariates has been widely studied due to the prevalence of missing values in many datasets. Common methods for dealing with missing data include complete case analysis, maximum likelihood, inverse probability weighting, and imputation. While complete case analysis is the easiest approach to implement, it has several drawbacks, such as potential bias in certain cases and a significant loss of information when the proportion of missingness is high. The maximum likelihood approach (e.g., Sabbe et al. [5]; Bondarenko and Raghunathan [6]; Audigier et al. [7]; von Hippel and Bartlett [8]) requires a specification on the distribution of variables, though this is unknown and unverifiable in practice. Inverse probability weighting (e.g., Chen et al. [9]; Creemers et al. [10]; Zubizarreta [11]; Hughes et al. [12]) heavily relies on the information from complete cases, which can be problematic when the fraction of completely observed subject is small.

Two big challenges with the above ADNI data are the high proportion of missingness and the large number of covariates, which make the complete case analysis and maximum likelihood approach inefficient. In addition to these two challenges, weighted methods cannot handle the problem in presence of multiple missing patterns. Compared to these methods with notable limitations, imputation methods are more appropriate for the ADNI data. Recently, multi-source data with block-wise missing, exemplified by the ADNI data, have drawn extensively attention in statistically research. Ref. [13] developed a classification framework, which was accomplished by three steps: feature selection, sample selection, and matrix completion. Ref. [2] proposed a dimension reduction method called generalized integrative principal component analysis (GIPCA). Under the assumption of identical type of distribution in the exponential family within each data source, GIPCA decomposed the overall effect into joint and individual effect across data sources. Ref. [14] imputed the missing data using a factor structure model, which considered the correlation between predictors and does not depend on missing mechanism. Ref. [15] developed a multiple block-wise imputation (MBI) approach by constructing estimating functions based on both complete and incomplete observation. Other related literature include those of [4,16,17].

However, these methods are not applicable to longitudinal studies. Using these methods on the ADNI data, they only select baseline measurement for each patient and simply delete the following measurements. Thus, these methods are inefficient for the ADNI data since they fail to take account of with-subject correlations. In this paper, we aim to develop a method for variable selection when integrating longitudinal data from multiple sources in the existing block-wise missing structure. We impute the block-wise missing data multiple times by using the information from both subjects with complete observation and subjects with missing values. We construct estimating equations based on imputed data and incorporate working correlation matrices to account for within-cluster correlation. With the generalized method of moment, we are capable of integrating data from multiple sources and identifying the relevant variables by introducing a penalty term.

This paper is organized as follows. Section 2 describes the setup and formalize the proposed method. In Section 3, we study the asymptotic properties of the proposed estimator. In Section 4, we develop an algorithm to implement the proposed method, followed by a simulation study conducted in Section 5 to evaluate the performance of the proposed method. In Section 6, we apply the proposed method to the ADNI study. Section 7 provides further discussions.

Methods

Setup

2.1.

Suppose the dataset consists of $[eqn]$ independent and identically distributed (i.i.d.) samples drawn from independent sources with disjoint covariates. Without loss of generality, we assume that the data are already sorted by missing patterns, and the total number of missing patterns is $[eqn]$ with $[eqn]$ samples in each pattern, where $[eqn]$ and $[eqn]$ . Within each missing pattern, all subjects have the same missing structure and the covariates from any specific source are either fully observed or fully missing. Let $[eqn]$ be the response vector for the ith subject in the kth pattern with $[eqn]$ measurements. For ease of presentation, we assume that each sample has the same number of measurements $[eqn]$ . Furthermore, let $[eqn]$ be the corresponding covariate matrix for the ith subject in the kth pattern across all the measurements, where $[eqn]$ is the number of covariates. We assume the underlying population-level model is as follows:

[eqn]

where $[eqn]$ is a known monotonic link function and $[eqn]$ is a $[eqn]$ -dimentional vector in the parameter space. Let $[eqn]$ and $[eqn]$ denote the index of observed covariates and missing covariates in the kth pattern, respectively. Define $[eqn]$ if $[eqn]$ is fully observed, otherwise 0. We assume the missing mechanism of $[eqn]$ is missing completely at random [18].

Figure 1 is an example illustrated what block-wise missing data look like, which consist of three sources with three missing patterns. Note that covariates in source 1 are completely observed in all three patterns, while covariates in source 2 are only observed in pattern 1 and 2, and covariates in source 3 are only observed in pattern 1 and 3. A similar structure also exists in the ADNI data. For example, variables in cerebrospinal fluid (CSF) are only measured in a subsample since CSF collection were mainly performed in phase II. Although complete cases analysis is feasible for ADNI data, it is inefficient especially the number of subjects with complete observation is limited. Thus, it is essential to leverage information from incomplete observation.

Proposed Method

2.2.

One approach to utilizing incomplete data is by imputing missing values and performing statistical analysis based on the imputed dataset. Traditional methods impute missing values using information solely from complete cases. However, in scenarios involving block-wise missing data, the proportion of complete cases can be relatively small, resulting in unstable imputed values. To further illustrate how to incorporate information from subjects with partially observed values when imputing missing values, we continue to use the example given in Figure 1. Let $[eqn]$ be the rth imputed covariate vector for the ith subject from pattern $[eqn]$ . For instance, the missing values of $[eqn]$ , i.e., the covariates of source 3 in pattern 2, can be imputed using the information of all sources in pattern 1, which we denoted as $[eqn]$ . Additionally, these can also be imputed based on the covariates in source 1 and source 3 for subjects from pattern 1 and pattern 3, which we denoted as $[eqn]$ . Figure 2 illustrate how the above procedures work. When all the covariates are observed, $[eqn]$ . Similarly, we can define $[eqn]$ as the corresponding imputed conditional mean.

The intuition behind the proposed method stems from generalized estimating equations (GEE) and quadratic inference functions (QIF). Suppose $[eqn]$ is the unknown true covariance matrix of $[eqn]$ . Ref. [19] proposed that $[eqn]$ can be estimated by $[eqn]$ , where $[eqn]$ is the diagonal matrix of the conditional variance of $[eqn]$ and $[eqn]$ is a working correlation matrix that fully specified by a vector of parameter $[eqn]$ . Ref. [20] proposed the QIF using the fact that the inverse of the correlation matrix $[eqn]$ can be approximated by $[eqn]$ , where $[eqn]$ are some basis matrices. For example, if we assume the within-cluster correlation structure is exchangeable, $[eqn]$ can be approximated by $[eqn]$ , where $[eqn]$ is the identity matrix and $[eqn]$ is a matrix with elemtents in the diagonal to be 0 and elements in the off-diagonal to be 1. The estimation of inverse of correlation matrix using linear combination has been intensively studied by [21]. The advantage of this linear approximation is that the parameter $[eqn]$ can be treated as a nuisance parameter, leading to some improvement in computational efficiency. Then, the estimating function for the subject $[eqn]$ in the kth pattern with the rth imputation is defined as:

[eqn]

Here, we only take derivative with respect to $[eqn]$ to enhance numerical stability. Recall that $[eqn]$ is a linear coefficient that used to approximate the inverse of correlation matrix, and thus, it is the nuisance parameter. To avoid estimating these nuisance parameters, we define the extended score vector:

[eqn]

Similarly, we obtain extended score vectors for all imputed covariate vectors and subjects. To integrate all score vectors, we aggregate the information by stacking them into a long vector:

[eqn]

where $[eqn]$ . Note that this might be an overdetermined system because the number of equations can exceed the number of parameters. To overcome this difficulty, we adopt generalized method of moment [22] and add a penalty term. Therefore, the objective function becomes:

[eqn]

where:

[eqn]

is a block-diagnoal matrix under the assumption of independence among samples from different missing patterns and $[eqn]$ is an arbitrary, investigator’s chosen, penalty function with a tuning parameter $[eqn]$ . Among many optional penalty functions, we adopt the non-convex smoothly clipped absolute deviation (SCAD) penalty [23]:

[eqn]

for some $[eqn]$ , which possess desirable oracle property.

Asymptotic Properties

In this section, we investigate the asymptotic properties of the proposed estimator. In Section 3.1, we assume the sample size $[eqn]$ is increasing while the number of parameters $[eqn]$ is fixed, and demonstrate that the proposed estimator is $[eqn]$ -consistent and asymptotically normal. As sample size goes to infinity, the proposed method is capable of selecting out the relevant variables with probability goes to 1. We also show that the proposed estimator is asymptotically more efficient than single imputation method via incorporating information of samples with missing values. In Section 3.2, we suppose both the sample size $[eqn]$ and the number of parameters $[eqn]$ are increasing but $[eqn]$ increases faster than $[eqn]$ . We show that the consistency and sparsity still hold with diverging $[eqn]$ . Without loss of generality, we assume $[eqn]$ can be partitioned into two parts, i.e., $[eqn]$ , where $[eqn]$ corresponds to relevant variables with a non-zero true value, while $[eqn]$ consists of coefficients of irrelevant variables with a zero true value. For any function $[eqn]$ , we use $[eqn]$ to denote the first derivative of $[eqn]$ evaluated at $[eqn]$ . We use similar notation for its other order derivatives.

Fixed Number of Parameters

3.1.

To establish the asymptotic properties of the proposed estimator in the setting of increasing sample sizes and fixed number of parameter, we require the following regularity conditions:

C.1 $[eqn]$ and $[eqn]$ , for any $[eqn]$ $[eqn]$ , and $[eqn]$ , where the inner expectation is with respect to the imputed values.

C. 2 All the variance matrix $[eqn]$ and $[eqn]$ , for any $[eqn]$ and $[eqn]$ .

C. 3 Let $[eqn]$ . For any $[eqn]$ and $[eqn]$ , $[eqn]$ and the fourth moment of $[eqn]$ exists.

C.4 $[eqn]$ , for any $[eqn]$ and $[eqn]$ .

C.5 The penalty function satisfied:

$[eqn]$ ; $[eqn]$ ; $[eqn]$ .

C.6 $[eqn]$ , where $[eqn]$ and $[eqn]$ , with $[eqn]$ and $[eqn]$ to be a diagonal matrix with $[eqn]$ dimension and each element equals to $[eqn]$ .

C.1–C.3 are conditions that require the existence of the moment, which are easily satisfied. C.4 requires the imputed conditional mean converges to the true conditional mean in probability, which is satisfied as long as the imputed model is correctly specified and the missing mechanism is either missing completely at random. C.5 is a standard condition for SCAD penalty which is commonly used in variable selection method (Gao et al. [24]; Cho and Qu [25]; Tian et al. [26]). More specifically, (a) ensures the property of sparsity is satisfied, (b) and (c) ensure the property of consistency is satisfied, and (c) also guarantees that the objective function (1) is dominated by the first term. C.6 is used to establish the asymptotic normality of the proposed estimator.

Theorem 1.

Under C.1–C.5, there exists a local minimizer $[eqn]$ of $[eqn]$ such that $[eqn]$ .

Theorem 1 states the existence of a minimizer of the objective function and the minimizer will converge to the true coefficients at a rate of $[eqn]$ as the sample size increases. Next, we demonstrate that this estimator possesses the sparsity property and the estimator for the non-zero coefficient is asymptotically normal, as outlined in the following theorem.

Theorem 2.

Under C.1–C.5, if $[eqn]$ and there exist a sequence such that $[eqn]$ as $[eqn]$ , where $[eqn]$ , then the proposed estimator $[eqn]$ satisfies the following properties:

(Sparsity) $[eqn]$ ;(Asymptotic Normality) Let $[eqn]$ and $[eqn]$ and if C. 6 holds, then $[eqn]$ .

The sparsity of the proposed estimator guarantees that the probability of selecting the true model approaches 1. We also obtained in Theorem 2 the asymptotic normality of $[eqn]$ , the estimator of coefficients for the relevant variables, which allows us to estimate its variance if $[eqn]$ and $[eqn]$ are known. However, in practice, these are unknown to us. We can obtain the empirical variance covariance matrix of $[eqn]$ by replacing $[eqn]$ with $[eqn]$ and replacing $[eqn]$ with $[eqn]$ , i.e., $[eqn]$ . Next, we compare the empirical variance of the proposed estimator with the empirical variance of the single imputation approach.

Theorem 3.

If a single imputation is used based on complete cases and denotes the asymptotic covariance matrix of $[eqn]$ as $[eqn]$ , then under the conditions of Theorem 2 $[eqn]$ is positive semi-definite.

Theorem 3 claims that the proposed estimator is asymptotically more efficient than the single imputation approach, as it incorporates information from incomplete cases during imputation. The result of this Theorem is intuitive because the proposed method incorporates more samples into the imputation process.

Diverging Number of Parameters

3.2.

In this subsection, we consider the setting where sample size $[eqn]$ and number of coefficients $[eqn]$ increase simultaneously. For certain properties to remain true, we require that $[eqn]$ increases faster than $[eqn]$ . We replace the notation $[eqn]$ by $[eqn]$ to indicate that the number of parameters also increases. We make the following assumptions:

D.1 For any $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ and:

[eqn]

D.2 There exist an open ball of $[eqn]$ and there exist a constant $[eqn]$ such that each entries of $[eqn]$ is bounded by $[eqn]$ , for any $[eqn]$ in this open ball.

D.3 The penalty function satisfied:

$[eqn]$ ; $[eqn]$ ; $[eqn]$ .

D.1 and D.2 are analogous to C.1–C.4. D.3 is the modification of C.5 for diverging number of parameters.

Theorem 4.

Under D.1–D.3, if $[eqn]$ , there exists a local minimizer $[eqn]$ of $[eqn]$ such that $[eqn]$ .

From the result of Theorem 4, we find that the consistency still holds for the proposed estimator, even with a diverging number of parameters. Not surprisingly, the convergence rate is no longer $[eqn]$ , but $[eqn]$ . We also require that $[eqn]$ does not increase faster than $[eqn]$ to ensure the model remains sparse. To be specific, the majority of the coefficients is zero.

Theorem 5.

Under D.1–D.3, if $[eqn]$ , and $[eqn]$ as $[eqn]$ , then with probability tending to 1, the estimator $[eqn]$ satisfies $[eqn]$ .

Theorem 5 states the sparsity of the proposed estimator with a diverging number of parameters. This property guarantees that the proposed method can still select the true model with a probability approaching 1, even when the number of parameters is diverging.

Implementation

Since directly minimizing the objective function is difficult, we incorporate an iterative procedure inspired by the implementation in [27], where they combined the minorization–maximization algorithm [28] with the Newton–Raphson algorithm. Given the current estimate of $[eqn]$ and tuning parameter $[eqn]$ , the objective function $[eqn]$ can be locally approximated by (except a constant term):

[eqn]

where:

[eqn]

and $[eqn]$ is a sufficiently small number (e.g., $[eqn]$ ). Thus, the search for estimator minimizing the objective function is equivalent to find an estimator that minimize (2). Notice that both $[eqn]$ and $[eqn]$ are unknown. Fortunately, from Lemma S2 in Supplementary Materials, $[eqn]$ can be approximated by:

[eqn]

and $[eqn]$ can be approximated by:

[eqn]

Plugging (3) and (4) into (2) and applying the Newton–Raphson algorithm, we obtain the following formula to update $[eqn]$ :

[eqn]

We repeat the above procedure until $[eqn]$ is smaller than a pre-specified threshold or reach a pre-specified maximum number of iteration.

It is known that the sampling covariance matrix $[eqn]$ may be singular in some cases [29]. To overcome the difficulty in computing the inverse of $[eqn]$ , we adopt the Moore–Penrose generalized inverse, which exists and is unique for any matrix.

In the implementation of the proposed method, we select tuning parameter $[eqn]$ with extended Bayesian Information Criterion (EBIC) criteria proposed by [30]:

[eqn]

where $[eqn]$ is the number of parameters of the model with tuning parameter $[eqn]$ and $[eqn]$ is the residual sum of square of all the missing pattern with:

[eqn]

Simulation

In this section, we implement a simulation study to compare the performance of the proposed method in variable selection against complete case analysis (CC) with SCAD penalty, single imputation (SI) with SCAD penalty, and the penalized generalized estimating Equation (PGEE). We use the same data structure as shown in Figure 1, where we have three missing patterns and three sources. The number of measurement is set to be three throughout this section. We replicate the simulation 100 times and use false positive rate (FPR) and false negative rate (FNR) to evaluate the performance of each method, which reflect the proportion of covariates that are irrelevant but falsely selected and the proportion of covariates that are relevant but fail to be selected, respectively. In the tuning parameter selection procedure, the parameter $[eqn]$ was set to 0.5 in EBIC. At the end of the iterative algorithm in Section 4, the estimated coefficient is considered as zero, if its absolute value is smaller than 0.01.

In the first setting, we simulate a dataset with a small proportion of complete cases, where $[eqn]$ , $[eqn]$ , $[eqn]$ , and the missing rate is around 87%. The data with continuous outcome are generated from the model:

[eqn]

where $[eqn]$ , $[eqn]$ is a vector consisting of 30 covariates, and $[eqn]$ . Here, each source consists of 10 covariates with the first two covariates having non-zero coefficients. $[eqn]$ is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are time-varying covariates and follow multivariate normal distribution with mean zero and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.5. We generate random error $[eqn]$ from the multivariate normal distribution with mean 0 and exchangeable covariance matrix with marginal variance 1 and correlation coefficient $[eqn]$ . We always assume the true within-cluster correlation structure is known and considered $[eqn]$ to be 0.3, 0.5, and 0.7 in each setting, which corresponded to mild, moderate, and strong within-cluster correlation. Let $[eqn]$ . Then, $[eqn]$ , $[eqn]$ , and $[eqn]$ samples were sequentially drawn with probability proportional to the $[eqn]$ and assigned to the pattern 1, pattern 2, and pattern 3, respectively. Obviously, subjects with higher covariates value from source 1 at the baseline are more likely to be assigned to pattern 1, followed by pattern 2 and then pattern 3. This data generating process implies a MAR mechanism for the missing covariates. The results of Table 1 summarize the performance of each method for three different $[eqn]$ . All of these methods effectively control the FNR. However, FPR of the proposed method is lower than the other three methods. In other words, the proposed method is able to select most of relevant variables while controlling the error of selecting irrelevant variables. In addition, we notice that the proposed method is more capable of utilizing within-cluster correlation compared with PGEE since the proposed method performs better as the within-cluster correlation becomes stronger. This result demonstrates the superiority of the proposed method when the percentage of complete cases is small in the block-missing data.

In the second setting, we continue to investigate the proposed method’s performance with a continuous outcome, but we proportionally increase the sample size in each missing pattern to demonstrate the proposed method’s effectiveness in larger samples, where $[eqn]$ , $[eqn]$ , $[eqn]$ . The results are described in Table 2. Unsurprisingly, the FPR and FNR of all the methods decreased compared with the first setting. We observe that the performance of the PGEE is very close to that of the single imputation method while the proposed method has a much lower FPR. In the meanwhile, complete cases analysis is still the worst option since the improvement is minor as the sample size increase, and even negligible when the within-cluster correlation is strong. Therefore, the proposed method is still able to maintain an appealing performance in the large sample size. The results from this setting further verify the efficiency gain of the proposed method in incorporating more information from the missing data compared to the single imputation.

In the third setting, we consider a correlated binary outcome with $[eqn]$ , $[eqn]$ , and $[eqn]$ . The data are generated from the model:

[eqn]

where $[eqn]$ , $[eqn]$ is a vector consisting of 15 covariates, and $[eqn]$ . Here, each source consists of five covariates, with the first covariate in each having non-zero coefficients. $[eqn]$ is a time-fixed covariate and we generate it from the standard normal distribution, whereas other covariates are time-varying covariates and follow multivariate normal distribution with mean zero and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.5. We generate random error $[eqn]$ from the multivariate normal distribution with mean 0 and exchangeable covariance matrix with marginal variance 1 and correlation coefficient 0.3. In this setting, $[eqn]$ . The results are summarized in Table 3. Although the PGEE outperforms other methods in terms of FPR, its performance in FNR is poor. In contrast, the proposed method possesses a better balance between FPR and FNR. We still observed a better performance of the proposed method.

Application

We apply our proposed method to the ADNI study. This study was launched in 2003 and has undertaken three different phases so far: ADNI 1, ADNI GO/2, and ADNI 3, which is designed to develop the effective treatment that can slow or stop the progression of AD. Our goal is to identify sensitive biomarkers of AD in the early stage from three data sources: magnetic resonance imaging (MRI), positron emission tomography (PET), and cerebrospinal fluid (CSF). We choose the mini-mental state examination (MMSE) [31] score as response variable, which has been widely used in the early diagnosis of AD [32]. The MRI data were analyzed by UCSF, who performed cortical reconstruction and volumetric segmentation with FreeSurfer. The processed MRI data primarly summarized average cortical thickness, standard deviation in cortical thickness, the volumes of cortical parcellations, the volumes of specific white matter parcellations, and the total surface area of the cortex [33]. The PET data were processed by UCB and quantities variables were obtained by standard uptake value ratio (SUVR) in amyloid florbetapir. The CSF data were acquired by ADNI Biomarker Core and Laboratory Medicine and Center for Neurodegenerative Diseases Research at UPENN. The block-wise missing emerged in this data. Less than half of patients lacked MRI measurements, few patients missed PET measurements, and only a small proportion of patients had CSF measurements. One of the reasons for the block-wise missing data is that obtaining CSF measurements requires more invasive procedures (such as lumbar puncture), which are refused by the majority of patients. The goal of this analysis is to identify biomarkers that are highly predictive of MMSE.

We only use the ADNI GO/2 dataset and consider measurements at baseline, month 24, and month 48, since the majority of patients have records at these time points. We also notice that there exist some low-quality data, such as those missed baseline measurement or belonged to a missing pattern with few patients. For simplicity of analysis, we discard these low-quality data, which leads us to a study cohort of 669 patients. Among them, 280 patients missed the measurement at month 24 and 487 patients missed the measurements at month 48. There are 340 features in MRI data, 229 features in PET data, and 3 features in CSF data. These three datasets and MMSE data are joined by a unique identifier “visit code” provided by the ADNI study. In total, we have three missing patterns. Table 4 describes the missing pattern of this dataset. The number of patients with fully observed variables is 63, with a missing rate around 90.6%. From this extremely high proportion of missing data, we will see how the proposed method can substantially improve the prediction ability by incorporating the information of related samples with missing values. To assess the predictive performance of the proposed method, data are randomly split into a test data with a sample size 30 (roughly 5%) and the remaining data as training data, where the test data are drawn from the data with fully observed variables (missing pattern 1). This random split process was replicated 30 times. A variable is marked as a potential predictor of AD if its absolute coefficient value is greater than 0.01.

Table 5 summarizes the average number of biomarkers selected by each method, along with the most frequently selected biomarkers. We also report the post-model-selection p-value. Our method successfully identifies biomarkers that align with findings reported in existing Alzheimer’s Disease research literature. In comparison to PGEE, the other three methods consistently select amyloid-β as a biomarker of AD, whose accumulation in cells is an early event of AD [34]. Phosphorylated tau, another widely accepted biomarker, has been validated by multiple large-scale, multi-center studies [35]. Studies found that neurons in AD patients are more likely to loss the superior temporal sulcus [36]. Two distinct normalization methods of summary measures for the standardized uptake value ratio (SUVR) of the florbetapir tracer, in the composite reference region and the whole cerebellum reference region, may potentially serving as AD biomarkers [37]. Besides these biomarkers, the proposed method additionally identifies several well-established and potential biomarkers. The size of the region of interest (ROI) in the left and right hemisphere precuneus area of the cortex, as well as cortical volume of left precuneus, summarize the health status of precuneus, which may be atrophy in the early stage of AD. The size and volume of the ROI in the left and right inferior lateral ventricle reflect disease progression (Bartos et al. [38]; Song et al. [39]). White matter changes in cerebral or subcortical areas can appear in other neurological conditions and normal aging, their connections with AD potentially make them useful biomarkers for distinguishing AD from normality, especially when considered along with other biomarkers in future investigations. While the surface area of the left caudal middle frontal and the cortical volume of the right caudal anterior cingulate are both associated with AD, more research is required to further explore these associations.

Discussion

It is well known that variable selection is a challenge for model robustness, estimator stableness and efficiency, as well as precise predictability. However, another non-negligible issue when integrating longitudinal studies is missingness in the covariate, especially in block-wise missing data. Specifically, with block-wise missing data, the percentage of complete observations is relatively small while traditional statistical methods heavily rely on information of complete cases. In this paper, we develop new methods to extend the MBI approach in a longitudinal study under the setting of block-wise missing data. Under certain regularity conditions, the desirable properties, consistency, sparsity, and asymptotic normality still hold. In addition, the proposed method demonstrates superior efficiency compared to the single imputation approach. It is worth noting that dropout missing data are also very common in longitudinal studies, which typically cause bias in many cases. In future work, it will be of great interest to develop methods to handle dropout missingness and incorporate inverse probability weighting in the proposed method.

One limitation of this paper is that we assume a homogeneous missing pattern across measurements within a single patient. Although this assumption may be restrictive in real data analysis, it is not hard to fulfill in multi-source data.

Supplementary Material

supplement

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mueller SG; Weiner MW; Thal LJ; Petersen RC; Jack C; Jagust W; Trojanowski JQ; Toga AW; Beckett L The Alzheimer’s disease neuroimaging initiative. Neuroimaging Clin. 2005, 15, 869–877.10.1016/j.nic.2005.09.008PMC 237674716443497 · doi ↗ · pubmed ↗
2Zhu H; Li G; Lock EF Generalized integrative principal component analysis for multi-type data with block-wise missing structure. Biostatistics 2020, 21, 302–318.30247540 10.1093/biostatistics/kxy 052PMC 7868054 · doi ↗ · pubmed ↗
3Lan Q; Jiang S A method of credit evaluation modeling based on block-wise missing data. Appl. Intell. 2021, 51, 6859–6880.
4Li J; Zhang Q; Chen S; Fang K Weighted multiple blockwise imputation method for high-dimensional regression with blockwise missing data. J. Stat. Comput. Simul. 2023, 93, 459–474.
5Sabbe N; Thas O; Ottoy JP EM Lasso: Logistic lasso with missing data. Stat. Med. 2013, 32, 3143–3157.23440969 10.1002/sim.5760 · doi ↗ · pubmed ↗
6Bondarenko I; Raghunathan T Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. Stat. Med. 2016, 35, 3007–3020.26952693 10.1002/sim.6926 · doi ↗ · pubmed ↗
7Audigier V; Husson F; Josse J Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. 2016, 86, 2140–2156.
8von Hippel PT; Bartlett JW Maximum likelihood multiple imputation: Faster imputations and consistent standard errors without posterior draws. Stat. Sci. 2021, 36, 400–420.