Optional subsampling for generalized estimating equations in growing-dimensional longitudinal Data
Chunjing Li, Jiahui Zhang, Xiaohui Yuan

TL;DR
This paper introduces an optimal Poisson subsampling method for generalized estimating equations to efficiently analyze large-scale longitudinal data with high-dimensional covariates, addressing computational challenges.
Contribution
It develops a novel subsampling algorithm with proven asymptotic properties and practical two-step probability construction for large-scale longitudinal data analysis.
Findings
Method remains effective under misspecified correlation matrices.
Achieves computational efficiency in large datasets.
Demonstrated successful application on real CHFS data.
Abstract
As a powerful tool for longitudinal data analysis, the generalized estimating equations have been widely studied in the academic community. However, in large-scale settings, this approach faces pronounced computational and storage challenges. In this paper, we propose an optimal Poisson subsampling algorithm for generalized estimating equations in large-scale longitudinal data with diverging covariate dimension, and establish the asymptotic properties of the resulting estimator. We further derive the optimal Poisson subsampling probability based on A- and L-optimality criteria. An approximate optimal Poisson subsampling algorithm is proposed, which adopts a two-step procedure to construct these probabilities. Simulation studies are conducted to evaluate the performance of the proposed method under three different working correlation matrices. The results show that the method remains…
| Method | |||||||
| EX | AR(1) | MA(1) | EX | AR(1) | MA(1) | ||
| 100 | pUnif | 0.206 | 0.222 | 0.529 | 0.250 | 0.273 | 0.775 |
| pMV | 0.713 | 0.699 | 1.088 | 0.814 | 0.827 | 1.277 | |
| pMVc | 0.657 | 0.642 | 0.984 | 0.706 | 0.715 | 1.151 | |
| 200 | pUnif | 0.354 | 0.345 | 0.790 | 0.402 | 0.450 | 1.001 |
| pMV | 0.884 | 0.864 | 1.370 | 0.980 | 0.976 | 1.640 | |
| pMVc | 0.804 | 0.781 | 1.281 | 0.844 | 0.862 | 1.469 | |
| 400 | pUnif | 0.686 | 0.683 | 1.432 | 0.713 | 0.710 | 1.604 |
| pMV | 1.268 | 1.234 | 2.058 | 1.420 | 1.384 | 2.504 | |
| pMVc | 1.176 | 1.159 | 1.946 | 1.280 | 1.232 | 2.241 | |
| 600 | pUnif | 1.073 | 1.050 | 2.127 | 1.102 | 1.051 | 2.229 |
| pMV | 1.681 | 1.686 | 2.629 | 1.936 | 1.861 | 3.488 | |
| pMVc | 1.588 | 1.555 | 2.494 | 1.772 | 1.676 | 3.134 | |
| 800 | pUnif | 1.470 | 1.384 | 2.688 | 1.518 | 1.492 | 3.088 |
| pMV | 2.148 | 2.151 | 3.218 | 2.469 | 2.399 | 4.450 | |
| pMVc | 2.005 | 1.991 | 3.038 | 2.249 | 2.195 | 4.007 | |
| 1000 | pUnif | 1.890 | 1.827 | 3.298 | 1.985 | 1.868 | 4.069 |
| pMV | 2.636 | 2.583 | 3.923 | 3.019 | 2.886 | 5.253 | |
| pMVc | 2.482 | 2.489 | 3.731 | 2.787 | 2.633 | 4.845 | |
| full time | 15.427 | 15.917 | 22.448 | 15.593 | 16.109 | 22.629 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoil Geostatistics and Mapping · Statistical Methods and Inference
Optional subsampling for generalized estimating equations in growing-dimensional longitudinal Data
**Chunjing Li
** School of Mathematics and Statistics, Changchun University of Technology, China
**Jiahui Zhang
** School of Mathematics and Statistics, Changchun University of Technology, China
**Xiaohui Yuan∗
** School of Mathematics and Statistics, Changchun University of Technology, China
This version: \usdateAugust 28, 2025
00footnotetext: ∗Corresponding author, † equal authors contribution.
Abstract
As a powerful tool for longitudinal data analysis, the generalized estimating equations have been widely studied in the academic community. However, in large-scale settings, this approach faces pronounced computational and storage challenges. In this paper, we propose an optimal Poisson subsampling algorithm for generalized estimating equations in large-scale longitudinal data with diverging covariate dimension, and establish the asymptotic properties of the resulting estimator. We further derive the optimal Poisson subsampling probability based on A- and L-optimality criteria. An approximate optimal Poisson subsampling algorithm is proposed, which adopts a two-step procedure to construct these probabilities. Simulation studies are conducted to evaluate the performance of the proposed method under three different working correlation matrices. The results show that the method remains effective even when the working correlation matrices are misspecified. Finally, we apply the proposed method to the CHFS dataset to illustrate its empirical performance.
Keywords:
longitudinal data; generalized estimating equations; growing dimension; massive data; Poisson subsampling
1 Introduction
Longitudinal data are commonly encountered in medical research, economics, and the social sciences, and have garnered significant attention in statistical research. Liang & Zeger (1986) developed generalized estimating equations (GEE) for the analysis of longitudinal data, extending quasi-likelihood approaches by incorporating a working correlation matrix to account for within-subject dependence. The resulting estimators remain consistent despite potential misspecification of the working correlation matrix. Chaganty (1997) demonstrated that the parameter estimates are consistent and asymptotically normal. Li (1997) studied the asymptotic properties of GEE estimates using the maxmin method. Xie & Yang (2003) analyzed the asymptotic properties of GEE in the case of a single covariate, as the number of individuals, the number of observations per individual, or both grow to infinity. Balan & Schiopu-Kratina (2005) employed pseudo-likelihood equations to demonstrate the existence, weak consistency, and asymptotic normality of GEE estimators when the covariate dimension is fixed. For the analysis of high-dimensional longitudinal data, Wang (2011) extended the asymptotic properties of GEE estimators with binary response variable when the number of covariates grows to infinity. Wang et al. (2012) consider the penalized GEE for analyzing longitudinal data with high-dimensional covariates.
With the rapid advancement of economic development and information technology, data collection capabilities have significantly improved, leading to a dramatic increase in the scale of longitudinal data. As a consequence, we are facing not only the difficulties associated with high dimensionality but also the computational and storage challenges brought about by the explosive growth in data volume. Take the China Household Finance Survey (CHFS) as an example: since its launch in 2011, it has covered over 40,000 households and conducted follow-up surveys every two years, resulting in millions of observations across thousands of economic and social variables, such as assets, liabilities, and consumption, that capture household financial behavior over time. Similarly, the U.S. National Health and Nutrition Examination Survey (NHANES) has accumulated tens of thousands of longitudinal records with hundreds of health indicators, further illustrating the pressure of rapidly increasing sample sizes alongside high dimensionality. These characteristics render conventional storage and analytical approaches computationally infeasible, highlighting the urgent need for more efficient algorithms, scalable computing frameworks, and distributed processing techniques to effectively handle modern longitudinal data.
To efficiently process such large-scale datasets, several methods have been proposed, including divide-and-conquer strategies (Lin & Xi 2011; Xu et al. 2020), online updating techniques for streaming data (Luo et al. 2023; Schifano et al. 2016 ), and subsampling approaches (Fithian & Hastie 2014; Ma et al. 2015; Wang et al. 2018). Among them, subsampling methods have received significant attention for their effectiveness in reducing resource consumption and preserving data representativeness, and have resulted in substantial theoretical and practical achievements. For cross-sectional data, Fithian & Hastie (2014) introduced a Poisson sampling method in the context of logistic regression. Ma et al. (2015) conducted a statistical analysis of leverage-based subsampling. Wang et al. (2018) derived optimal subsampling probabilities for logistic regression based on the A-optimality criterion, and subsequently proposed a two-step adaptive algorithm aimed at approximating this optimal subsampling scheme. Yu et al. (2022) constructed optimal Poisson subsampling probabilities for pseudo-likelihood estimation, guided by A- and L-optimality criteria, and developed a distributed framework to handle data partitioned across multiple blocks or locations. Yao & Wang (2019) and Yao et al. (2023) applied optimal subsampling and Poisson-based subsampling methods for softmax regression models. Ai et al. (2021) applied optimal subsampling methods for generalized linear models. Yuan et al. (2024) incorporated subsampling strategies into distributed composite quantile regression frameworks. For longitudinal data, Wang et al. (2023) developed a new subsampling strategy that incorporates leverage and gradient information. Han & Fu (2023) developed optimal subsampling algorithms for marginal model.
Recently, for high-dimensional data, Gao et al. (2024) and Shan & Wang (2024) investigated subsampling strategies based on decorrelated score approaches for generalized linear models. Li et al. (2024) investigated a Poisson-based subsampling method for expectile regression in large-scale data. To the best of our knowledge, no prior work has explored optimal subsampling algorithms in the context of growing-dimensional longitudinal settings. We aim to develop an optimal Poisson subsampling algorithm for GEE with high-dimensional covariates. This study makes the following main contributions: (i) We propose a Poisson subsampling algorithm for GEE in growing-dimensional longtitual data and establish the consistency and asymptotic normality of the resulting estimator. (ii) We further develop a two-step algorithm aimed at approximating the optimal Poisson sampling probabilities, thereby extracting more informative subsamples for estimation.
The remainder of this paper is organized as follows. In Section 2, we establish the asymptotic properties of the general Poisson subsampling estimator. Section 3 presents the optimal Poisson subsampling probabilities, which are determined according to the A- and L-optimality criteria. Simulation studies in Section 4 demonstrate the effectiveness of the proposed method. In Section 5, we apply the proposed method to CHFS dataset. The proof can be found in the Appendix.
2 Poisson subsampling method based on generalized estimating equations
2.1 Generalized estimating equations
For and , let denote the response variable and represent a covariate vector with diverging dimension . Define and . Without loss of generality, we assume that . The conditional expectation of is given by , where Here, is a known link function, , and is the regression parameter vector. Observations within the same individual are assumed to be correlated, whereas those from different individuals are independent. Let be the marginal mean vector for the -th individual, and the covariance of the response variable is given by:
[TABLE]
where is a diagonal matrix, is the true correlation matrix of the response variable , and is the dispersion parameter, which may be known or unknown.
Liang & Zeger (1986) introduced the generalized estimating equation, which takes the following form:
[TABLE]
We use to denote the estimated working correlation matrix, and define the GEE estimator by:
[TABLE]
where , and represent the values obtained during the -th update. Given and , the estimator can be computed by:
[TABLE]
We repeat the iterative procedure until the norm of the difference between successive estimates of is less than . The resulting estimate corresponds to the previously defined .
To obtain consistent parameter estimates, the computational complexity is at least , where denotes the number of iterations. As the number of individuals increases, the computational burden also increases. Typically, subsampling algorithms help to reduce computational costs. The Poisson subsampling method can avoid memory overflow issues while maintaining efficient parameter estimation. Therefore, the next subsection will introduce the Poisson subsampling method for parameter estimation.
2.2 Poisson subsampling algorithm
Let represent the subsample dataset, where denotes the sampling probability for individual . Let and represent the estimates based on the subsample; and represent the values obtained during the -th update. Given the subsample , the weighted generalized estimating equation takes the following form:
[TABLE]
Under the assumption of a working independence correlation matrix, the initial estimate of can be directly obtained. is estimated using the Gaussian pseudo-likelihood method. With and , the value of can be estimated as:
[TABLE]
Equation (2.3) is iteratively applied until . The resulting estimate corresponds to the previously defined . We illustrate the steps of a general Poisson subsampling algorithm in Algorithm 1.
Algorithm 1
General Poisson Subsampling Algorithm.
- •
Step 1: Initialize the set . For each , generate an independent Bernoulli random variable . If , include the triplet in the set .
- •
Step 2: Based on the subsample , we use equation (2.3) to estimate the weighted generalized estimating equation in (2.2) and obtain the regression parameter estimate .
Whether the observation of individual is included in the subsample depends solely on its own probability , without considering the sampling probabilities of other individuals. In Algorithm 1, a random variable is generated through a Bernoulli trial to decide whether the observation of individual is included in the subsample. Therefore, for massive data sets, the Poisson subsampling method alleviates memory constraint issues.
The size of the drawn subsample, denoted by , satisfies . Let denote the expected size of the drawn subsample. Furthermore, assume that , which is a common assumption in big data scenarios. The following regularity conditions are imposed to ensure consistency and asymptotic normality.
- (C1)
.
- (C2)
The parameter vector is assumed to lie within a compact set , and the true parameter vector is also contained in .
- (C3)
There are positive constants such that
[TABLE]
and similarly, two other constants satisfy
[TABLE]
where and indicate the smallest and largest eigenvalues of a matrix, respectively.
- (C4)
The true correlation matrix is assumed to have eigenvalues bounded away from 0 and . The estimated working correlation matrix satisfies , where is a positive definite matrix with eigenvalues also bounded away from 0 and . We do not require to be the true working correlation matrix .
- (C5)
There are positive constants and such that
[TABLE]
- (C6)
Let be a constant satisfying , and , where are the first-, second-, and third-order derivatives of , respectively.
- (C7)
.
Assumption (C1) is a common condition for diverging-dimension M-estimators, which aligns with the setting in Portnoy(1985). Assumption (C2) is a necessary condition for the consistency of the estimator and has been widely adopted in many studies, including Newey & McFadden(1994). Assumption (C3) is frequently used in high-dimensional regression literature, with a similar formulation appearing in Wang (2011). Assumption (C4) extends the framework from fixed to high-dimensional . Assumption (C5) imposes moment conditions on the model. Assumption (C6) ensures the consistency of parameter estimation. Assumption (C7) restricts the weights in the weighted generalized estimating equations, primarily to prevent individuals with extremely small subsampling probabilities from unduly influencing the results.
Theorem 2.1
Under assumptions (C1)-(C7), if and is a solution to , then
[TABLE]
Theorem 2.2
Under assumptions (C1)-(C7), if , then for any with , we have
[TABLE]
where
[TABLE]
Theorem 2.3
Under assumptions (C1)-(C7), if , then
[TABLE]
where is a -dimensional vector satisfying ; ; , is similar to , where is used in place of .
Theorem 2.2 indicates that the estimation error of follows an asymptotic normal distribution, and its asymptotic distribution is related to the sampling probability . Regardless of the correctness of the working correlation matrix specification, increasing the sample size or the number of individual observations will lead to better estimation results for the regression parameter estimator.
3 Optimal Poisson Sampling Algorithm
3.1 Optimal Poisson Sampling Strategy
To obtain the regression parameter estimator using (2.2), the Poisson sampling probabilities need to be specified. The optimal subsampling probabilities can be determined by minimizing , that is, by using the A-optimality criterion.
Theorem 3.4
Definition , , let be the order statistics of . If the subsampling probability is
[TABLE]
the value of is minimized, where,
[TABLE]
and
[TABLE]
The computation of in (3.1) is required only for those individuals satisfying the condition . In this case, is the number of individuals for which . If all individuals satisfy , then we can directly set . To reduce computational complexity, the next theorem establishes the L-optimal subsampling strategy.
Theorem 3.5
Definition , , let be the order statistics of . If the subsampling probability is
[TABLE]
then is minimized, where,
[TABLE]
and
[TABLE]
3.2 Two-Step Algorithm
To simplify notation, let and denote or , and or , respectively, as defined in Theorems 3.4 and 3.5. Since the computation of is related to the true parameter , the optimal Poisson subsampling probability is
[TABLE]
which cannot be directly computed, where . To implement this procedure, a two-step algorithm is proposed. In the first step, a pilot subsample of expected size is drawn through uniform Poisson sampling, denoted as . Based on , and assuming an independent working correlation matrix, the resulting estimate is used as an initial approximation of . The working correlation matrix is accordingly replaced by , computed via the Gaussian pseudo-likelihood method. Next, the values of and can be computed. The purpose of is to control , and since is a common case, this implies that the situation where is very rare. Therefore, with sufficiently small subsampling rates, directly setting performs quite well. For the estimation of , it can be calculated as , where . When , this corresponds to the A-optimality criterion, when , this corresponds to the L-optimality criterion. Therefore, the optimal subsampling probability can be approximated by , where , , , , and are used in place of , , , , and .
To enhance the robustness of the estimator, we employ the shrinkage-based subsampling method studied by Ma et al. (2015), which combines the optimal subsampling probability with the uniform probability:
[TABLE]
where , and denotes the expected subsample size in the second step. In practice, it is possible that exceeds 1 due to the shrinkage adjustment. Therefore, the final subsampling probability is given by , and the final regression parameter estimator is denoted as . Algorithm 2 provides a full description of the two-step algorithm. The asymptotic properties of the regression parameter obtained from Algorithm 2 are presented in Theorem 3.6.
Algorithm 2
Two-Step Algorithm.
- •
Step 1: Use the uniform Poisson subsampling probability to draw the pilot sample and obtain , , , , and . Compute the optimal Poisson subsampling probability based on (3.4).
- •
Step 2: With the approximate optimal subsampling probabilities obtained from Step 1, draw the sample , and perform regression parameter estimation.
Theorem 3.6
Under assumptions (C1)-(C7), if and the condition , then for any with , we have
[TABLE]
where
[TABLE]
and
[TABLE]
According to Theorem 3.6, the covariance matrix of can be estimated as , where and
4 Numerical Simulation
We assess the effectiveness of the optimal Poisson subsampling algorithm through simulation studies, considering a linear regression model in the context of high-dimensional longitudinal data:
[TABLE]
The true value . We consider three settings for the dimensionality: = 30, 50, and 70. The covariates are generated from two different distributions:
- (1)
Case 1: follows a multivariate -distribution with degrees of freedom, i.e., , where .
- (2)
Case 2: follows a log-normal distribution .
The error term is generated from a multivariate normal distribution , where the correlation parameter is set to . We consider three different work correlation matrices : EX, AR(1), MA(1).
The number of observation individuals is set to 10000, with each individual having 5 observations, . The pilot subsample size is set to , and the second-stage subsample size is chosen from 100, 200, 400, 600, 800, 1000. The mean squared error (MSE) is calculated across 1000 simulation replications to assess the performance of the subsampling methods. Here, denotes the estimated parameter obtained from the -th subsample,
[TABLE]
We present the experimental results under Case 1, with AR(1) and EX as the true correlation structures, in Figures 1-2. For example, AR(1)-EX indicates that the left side shows the true correlation matrix, while the right side denotes the working correlation matrix. The uniform Poisson subsampling method is denoted as pUnif, the A-optimal Poisson subsampling as pMV, and the L-optimal Poisson subsampling as pMVc.
The results indicate that as the subsample size increases, the log(MSE) of all three methods decreases gradually. Both pMV and pMVc outperform pUnif significantly, with pMV consistently performing slightly better than pMVc. Even when the working correlation matrix is misspecified, the proposed methods still demonstrate robust performance. We display the experimental results under Case 2, with AR(1) and EX as the true correlation matrices, in Figures 3–4. Similar to the results in Case 1, both pMV and pMVc significantly outperform the uniform Poisson subsampling method.
We summarize the computation time under Case 1 for true correlation matrix AR(1) with and in Table 1. The results show that the computation times for the three Poisson subsampling methods increase with the sample size, but all remain smaller than the computation time using the full dataset. Among them, pMVc exhibits a lower computational cost compared to pMV.
5 Actual Data Analysis
The empirical analysis in this paper is based on data from the China Household Finance Survey (CHFS), which is administered by Southwest University of Finance and Economics (https://chfs.swufe.edu.cn). We perform the analysis at the household level, using total household income (Total_income) as the response variable. The independent variables include the household head’s residence area (Rural), age (Age), gender (Gender), marital status (Marry), education level (Edu), health status (Unhealth), pension insurance coverage (Endowment_insurance) and medical insurance coverage (Medinsurance); as well as household-level variables such as total number of family members (Familynum), number of unhealthy family members (Unhealthnum), utility expenditures (Expenditure1), consumption expenditures (Expenditure2), and financial assets (Finanasset). The dataset comprises 9,753 households, with three repeated observations per household.
The corresponding model is as follows:
[TABLE]
Given that the true values of model parameters are typically unobservable in real-world datasets, this paper utilizes the parameter estimation values derived from the full dataset to substitute for the unknown true values. Figure 5 presents the estimation results of the pUnif and the pMV and pMVc under different working correlation matrices. The results indicate that pMV and pMVc consistently outperform pUnif. Moreover, pMV and pMVc exhibit similar performance. Therefore, it can be concluded that the optimal Poisson subsampling is a more effective choice.
Acknowledgementss
Chunjing Li was partly supported by the National Social Science Fund of China (24BTJ061) and Scientific Research Project of Jilin Provincial Department of Education (JJKH20250702KJ). Xiaohui Yuan was partly supported by the National Social Science Fund of China (22BTJ019) and Scientific Research Project of Jilin Provincial Department of Science and Technology (20250102029JC).
Appendix
In the context of the subsample, the weighted generalized estimating equation (2.2) can be reformulated as:
[TABLE]
To prove these theorems, we first establish several lemmas.
Lemma A. 1
Under assumptions (C1)-(C3), if as , then
[TABLE]
The proof of Lemma 1. We first prove that
[TABLE]
where
[TABLE]
According to the Central Limit Theorem, it holds that . Combining with (A.1), we can derive that . Note that . We have
[TABLE]
where . Thus,
[TABLE]
By the triangle inequality, we observe
[TABLE]
then, applying this result yields
[TABLE]
Thus, we obtain the following result:
[TABLE]
By the Cauchy-Schwarz inequality,
[TABLE]
Since . Therefore,
[TABLE]
Let denote a generic positive constant. We use this notation consistently throughout the paper. Hence . Similarly, note that
[TABLE]
Therefore, we have . Similarly, . We next analyze . By applying the Cauchy–Schwarz inequality, we have
[TABLE]
Let , we have, , then
[TABLE]
Therefore,
[TABLE]
where . Hence, under Assumption (C3) and Lemma 3, , that is, (A.1) is proved. Since . Therefore, we have .
Lemma A. 2
Under assumptions (C1)-(C7), if , then
The proof of Lemma 2. Let denote . Then
[TABLE]
where . Since
[TABLE]
Thus, we have Combining with assumptions (C4) yields which completes the proof.
Lemma A. 3
if , where is the solution to , then
The proof of Lemma 3. It suffices to show that for any , there exists such that for sufficiently large , we have
[TABLE]
Since
[TABLE]
where We now analyse the term Applying the Cauchy–Schwarz inequality yields Note that
[TABLE]
Therefore, Next, we consider
[TABLE]
where We have
[TABLE]
and
[TABLE]
where When is sufficiently large, is dominated by , and for sufficiently large , this value can be made negative.
Lemma A. 4
Let , we have
[TABLE]
where
[TABLE]
with
[TABLE]
The proof of Lemma 4. See Xie & Wang (2003).
Lemma A. 5
Under assumptions (C1)-(C7), if , then for any and , we have
[TABLE]
The matrix is symmetric, and these results follow directly,
[TABLE]
The proof of Lemma 5. Let ,, be defined analogously to , , , but with replacing . The proof can be completed by establishing the following three asymptotic results:
[TABLE]
For (A.2), we observe that
[TABLE]
By assumptions (C2) and (C4), (A.2) is thus proved. Similarly, (A.3) and (A.4) can be verified.
Lemma A. 6
Under assumptions (C1)-(C7), if , then for any and , we have
[TABLE]
The proof of Lemma 6. The proof follows directly from Lemma 3.4 in Wang (2011)
Lemma A. 7
Under assumptions (C1)-(C7), if , then for any with , we have
[TABLE]
The proof of Lemma 7. Let , where
[TABLE]
Note that We have
[TABLE]
and
[TABLE]
Thus, we obtain Let where
[TABLE]
Then, Note that
[TABLE]
Let
[TABLE]
Since
[TABLE]
Thus, We now verify the Lyapunov condition.
[TABLE]
Observe that Thus,
[TABLE]
Lemma A. 8
Under assumptions (C1)-(C7), if , we have
The proof of Lemma 8. Let Then
[TABLE]
Note that
[TABLE]
and
[TABLE]
Applying Chebyshev’s inequality, the lemma 8 is thus proved.
Appendix A The proof of Theorem
**The proof of Theorem 2.1 **** ** We now show that, for any , there exists a constant such that when is sufficiently large,
[TABLE]
Note
[TABLE]
where . Next, we have
[TABLE]
By the Cauchy-Schwarz inequality, we have . Furthermore,
[TABLE]
Therefore, . Consequently, . By lemma 2, we can derive that . Next, we disuass ,
[TABLE]
Note that by Lemma 5, we have
[TABLE]
For we have
[TABLE]
By Lemmas 5 and 6, Finally, under assumptions (C3) and (C7), we analyze
[TABLE]
Therefore, for on the set , it is dominated by and . When is sufficiently large, it can be negative. This finishes the proof of Theorem 2.1.
**The proof of Theorem 2.2 ** First, we prove
[TABLE]
We have
[TABLE]
In the second equality, since , Taylor expansion yields
[TABLE]
By Lemma 7, Thus, to prove (A.6), it suffices to show that for any ,
[TABLE]
and
[TABLE]
First, we prove (A.8). By Lemma 2 and (A.5), observe that
[TABLE]
Next, we prove (A.7),
[TABLE]
By the Cauchy-Schwarz inequality and Lemma 5, we have
[TABLE]
By Lemmas 5 and 6, we can also derive . Thus, (A.6) is proved. By Lemma 8, we have . Combining this with Slutsky’s theorem yields This finishes the proof of Theorem 2.2.
**The proof of Theorem 2.3 ** It suffices to show that for any ,
[TABLE]
In our proof, we employ Theorem 2.1. Let , where
[TABLE]
Therefore, (A.9) can be derived from . Furthermore, we have
[TABLE]
To analyze the eigenvalues of ,
[TABLE]
Note that
[TABLE]
It can see
[TABLE]
We have and
[TABLE]
and
[TABLE]
Therefore, we have
[TABLE]
Similarly, and . Hence,
[TABLE]
Similarly, we have . Finally, observe that
[TABLE]
Therefore, we have . Using
[TABLE]
we also obtain This finishes the proof of Theorem 2.3.
**The proof of Theorem 3.4 and 3.5 **** ** If some elements of equal zero, their associated subsampling probabilities are set to zero, and the subsampling probabilities of the remaining individuals are considered. Thus, we assume all , which does not affect generality.
To minimize , which corresponds to the asymptotic mean squared error, the following optimization problem needs to be solved:
[TABLE]
For simplicity, define as We assume an ordered sequence , which does not restrict generality. Applying the Cauchy-Schwarz inequality,
[TABLE]
Equality holds if and only if . Thus, when the condition is satisfied, and provides the optimal solution.
Otherwise, if , then set . Thus, equation (LABEL:24) can be reformulated as an optimization problem for :
[TABLE]
This problem follows an iterative structure, where the optimal solution minimizes the objective function for some such that
[TABLE]
Assume that exists such that
[TABLE]
and that . It follows that .
Substituting into (LABEL:24), we obtain
[TABLE]
Thus, (LABEL:24) attains its minimum when is used..
Next, we show that there exists a value such taht . Observe that satisfies
[TABLE]
Setting gives
[TABLE]
This leads to Similarly, setting yields Since the function is continuous in given , the existence of is guaranteed.
On the other hand, for any , it follows that From this, it can be derived that Thus, given , the function is non-increasing. Therefore,
[TABLE]
which confirms that .
Since the proof of Theorem 3.5 follows similar arguments, it is omitted here.
**The proof of Theorem 3.6 **** ** The condition ensures that . The consistency of the estimator is guaranteed by Theorem 2.2. Since , it suffices to focus on the subsample drawn in the second step. The primary difference between and lies in the replacement terms, namely , and . Under Assumptions (C1) to (C7), the consistency of follows, and , , and are also consistent estimators of
[TABLE]
Thus, by Theorems 2.2 and the continuous mapping theorem, asymptotic normality is established.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 11 Ai M, Yu J, Zhang H, Wang H. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 2021, 31(2): 749-772.
- 22 Balan R M, Schiopu-Kratina I. Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics, 2005, 33(2): 522-541.
- 33 Chaganty N R. An alternative approach to the analysis of longitudinal data via generalized estimating equations. Journal of Statistical Planning and Inference, 1997, 63(1): 39-54.
- 44 Fithian W, Hastie T. Local case-control sampling: Efficient subsampling in imbalanced data sets. The Annals of Statistics, 2014, 42(5): 1693-1724.
- 55 Han H, Fu L. Optimal subsampling algorithm for the marginal model with large longitudinal data. ar Xiv preprint, 2023, ar Xiv:2311.08812.
- 66 Gao J, Wang L, Lian H. Optimal decorrelated score subsampling for generalized linear models with massive data. Science China Mathematics, 2024, 67: 405-430.
- 77 Liang K Y, Zeger S L. Longitudinal data analysis using generalized linear models. Biometrika, 1986, 73(1): 13-22.
- 88 Li B. On the consistency of generalized estimating equations. Lecture Notes-Monograph Series, 1997, 32: 115-136.
