Optional subsampling for generalized estimating equations in growing-dimensional longitudinal Data

Chunjing Li; Jiahui Zhang; Xiaohui Yuan

arXiv:2508.20803·stat.CO·August 29, 2025

Optional subsampling for generalized estimating equations in growing-dimensional longitudinal Data

Chunjing Li, Jiahui Zhang, Xiaohui Yuan

PDF

Open Access

TL;DR

This paper introduces an optimal Poisson subsampling method for generalized estimating equations to efficiently analyze large-scale longitudinal data with high-dimensional covariates, addressing computational challenges.

Contribution

It develops a novel subsampling algorithm with proven asymptotic properties and practical two-step probability construction for large-scale longitudinal data analysis.

Findings

01

Method remains effective under misspecified correlation matrices.

02

Achieves computational efficiency in large datasets.

03

Demonstrated successful application on real CHFS data.

Abstract

As a powerful tool for longitudinal data analysis, the generalized estimating equations have been widely studied in the academic community. However, in large-scale settings, this approach faces pronounced computational and storage challenges. In this paper, we propose an optimal Poisson subsampling algorithm for generalized estimating equations in large-scale longitudinal data with diverging covariate dimension, and establish the asymptotic properties of the resulting estimator. We further derive the optimal Poisson subsampling probability based on A- and L-optimality criteria. An approximate optimal Poisson subsampling algorithm is proposed, which adopts a two-step procedure to construct these probabilities. Simulation studies are conducted to evaluate the performance of the proposed method under three different working correlation matrices. The results show that the method remains…

Tables1

Table 1. Table 1: Computation time under case 1 with true correlation matrix AR(1) at different dimensions.

$r_{2}$	Method	$p_{n} = 30$			$p_{n} = 50$
$r_{2}$	Method	EX	AR(1)	MA(1)	EX	AR(1)	MA(1)
100	pUnif	0.206	0.222	0.529	0.250	0.273	0.775
	pMV	0.713	0.699	1.088	0.814	0.827	1.277
	pMVc	0.657	0.642	0.984	0.706	0.715	1.151
200	pUnif	0.354	0.345	0.790	0.402	0.450	1.001
	pMV	0.884	0.864	1.370	0.980	0.976	1.640
	pMVc	0.804	0.781	1.281	0.844	0.862	1.469
400	pUnif	0.686	0.683	1.432	0.713	0.710	1.604
	pMV	1.268	1.234	2.058	1.420	1.384	2.504
	pMVc	1.176	1.159	1.946	1.280	1.232	2.241
600	pUnif	1.073	1.050	2.127	1.102	1.051	2.229
	pMV	1.681	1.686	2.629	1.936	1.861	3.488
	pMVc	1.588	1.555	2.494	1.772	1.676	3.134
800	pUnif	1.470	1.384	2.688	1.518	1.492	3.088
	pMV	2.148	2.151	3.218	2.469	2.399	4.450
	pMVc	2.005	1.991	3.038	2.249	2.195	4.007
1000	pUnif	1.890	1.827	3.298	1.985	1.868	4.069
	pMV	2.636	2.583	3.923	3.019	2.886	5.253
	pMVc	2.482	2.489	3.731	2.787	2.633	4.845
full time		15.427	15.917	22.448	15.593	16.109	22.629

Equations324

cov (Y_{i}) = V_{i} (β) = ϕ A_{i}^{1/2} (β) R^{- 1} A_{i}^{1/2} (β),

cov (Y_{i}) = V_{i} (β) = ϕ A_{i}^{1/2} (β) R^{- 1} A_{i}^{1/2} (β),

i = 1 \sum n X_{i}^{T} A_{i} (β) V_{i}^{- 1} (β) (Y_{i} - μ_{i} (β)) = 0.

i = 1 \sum n X_{i}^{T} A_{i} (β) V_{i}^{- 1} (β) (Y_{i} - μ_{i} (β)) = 0.

S_{n} (β) = \frac{1}{n} i = 1 \sum n X_{i}^{T} A_{i}^{1/2} (β) \hat{R}^{- 1} A_{i}^{- 1/2} (β) (Y_{i} - μ_{i} (β)) = 0,

S_{n} (β) = \frac{1}{n} i = 1 \sum n X_{i}^{T} A_{i}^{1/2} (β) \hat{R}^{- 1} A_{i}^{- 1/2} (β) (Y_{i} - μ_{i} (β)) = 0,

\displaystyle\hat{\boldsymbol{\beta}}^{(k)}=\Bigg{(}\sum_{i=1}^{n}

\displaystyle\hat{\boldsymbol{\beta}}^{(k)}=\Bigg{(}\sum_{i=1}^{n}

\displaystyle\times\Bigg{(}\sum_{i=1}^{n}\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}(\hat{\boldsymbol{\beta}}^{(k-1)})\left(\hat{\mathbf{R}}^{(k)}\right)^{-1}\boldsymbol{\varepsilon}_{i}(\hat{\boldsymbol{\beta}}^{(k-1)})\Bigg{)}.

S_{r} (β) = \frac{1}{n} i = 1 \sum r^{*} \frac{1}{π _{i}^{*}} X_{i}^{* T} A_{i}^{* 1/2} (β) \tilde{R}^{- 1} ε_{i}^{*} (β) = 0.

S_{r} (β) = \frac{1}{n} i = 1 \sum r^{*} \frac{1}{π _{i}^{*}} X_{i}^{* T} A_{i}^{* 1/2} (β) \tilde{R}^{- 1} ε_{i}^{*} (β) = 0.

\displaystyle\tilde{\boldsymbol{\beta}}^{(k)}=\Bigg{(}\sum_{i=1}^{r^{*}}

\displaystyle\tilde{\boldsymbol{\beta}}^{(k)}=\Bigg{(}\sum_{i=1}^{r^{*}}

\displaystyle\times\Bigg{(}\sum_{i=1}^{r^{*}}\mathbf{X}_{i}^{*T}\mathbf{A}_{i}^{*1/2}(\tilde{\boldsymbol{\beta}}^{(k-1)})\left(\tilde{\mathbf{R}}^{(k)}\right)^{-1}\boldsymbol{\varepsilon}_{i}^{*}(\tilde{\boldsymbol{\beta}}^{(k-1)})\Bigg{)}.

b_{1} \leq λ_{m i n} (\frac{1}{n} i = 1 \sum n X_{i}^{T} X_{i}) \leq λ_{m a x} (\frac{1}{n} i = 1 \sum n X_{i}^{T} X_{i}) \leq b_{2},

b_{1} \leq λ_{m i n} (\frac{1}{n} i = 1 \sum n X_{i}^{T} X_{i}) \leq λ_{m a x} (\frac{1}{n} i = 1 \sum n X_{i}^{T} X_{i}) \leq b_{2},

b_{3} \leq λ_{m i n} (\frac{r}{n} i = 1 \sum n \frac{1}{n π _{i}} X_{i}^{T} X) \leq λ_{m a x} (\frac{r}{n} i = 1 \sum n \frac{1}{n π _{i}} X_{i}^{T} X_{i}) \leq b_{4},

b_{3} \leq λ_{m i n} (\frac{r}{n} i = 1 \sum n \frac{1}{n π _{i}} X_{i}^{T} X) \leq λ_{m a x} (\frac{r}{n} i = 1 \sum n \frac{1}{n π _{i}} X_{i}^{T} X_{i}) \leq b_{4},

E (A_{i}^{- 1/2} (β) (Y_{i} - μ_{i} (β))^{2 + δ}) \leq M_{1} .

E (A_{i}^{- 1/2} (β) (Y_{i} - μ_{i} (β))^{2 + δ}) \leq M_{1} .

∣∣ \tilde{β} - β_{0} ∣∣ = O_{p} (p_{n} / r) .

∣∣ \tilde{β} - β_{0} ∣∣ = O_{p} (p_{n} / r) .

c_{n}^{T} \overset{ˉ}{M}_{r}^{- 1/2} (β_{0}) \overset{ˉ}{H}_{n} (β_{0}) (\tilde{β} - β_{0}) d N (0, 1),

c_{n}^{T} \overset{ˉ}{M}_{r}^{- 1/2} (β_{0}) \overset{ˉ}{H}_{n} (β_{0}) (\tilde{β} - β_{0}) d N (0, 1),

\overset{ˉ}{M}_{r} (β_{0}) = \frac{1}{n ^{2}} i = 1 \sum n \frac{1}{π _{i}} X_{i}^{T} A_{i}^{1/2} (β_{0}) \overset{ˉ}{R}^{- 1} ε_{i} (β_{0}) ε_{i}^{T} (β_{0}) \overset{ˉ}{R}^{- 1} A_{i}^{1/2} (β_{0}) X_{i},

\overset{ˉ}{M}_{r} (β_{0}) = \frac{1}{n ^{2}} i = 1 \sum n \frac{1}{π _{i}} X_{i}^{T} A_{i}^{1/2} (β_{0}) \overset{ˉ}{R}^{- 1} ε_{i} (β_{0}) ε_{i}^{T} (β_{0}) \overset{ˉ}{R}^{- 1} A_{i}^{1/2} (β_{0}) X_{i},

\overset{ˉ}{H}_{n} (β_{0}) = \frac{1}{n} i = 1 \sum n X_{i}^{T} A_{i}^{1/2} (β_{0}) \overset{ˉ}{R}^{- 1} A_{i}^{1/2} (β_{0}) X_{i} .

c_{n} \tilde{Σ} c_{n}^{T} - c_{n} Σ c_{n}^{T} = o_{p} (1),

c_{n} \tilde{Σ} c_{n}^{T} - c_{n} Σ c_{n}^{T} = o_{p} (1),

π_{i}^{M V}

π_{i}^{M V}

T = i = 1 \sum n - w h_{(i)}^{M V} / (r - w),

T = i = 1 \sum n - w h_{(i)}^{M V} / (r - w),

w = min {s ∣ 0 \leq s \leq r, h_{(n - s)}^{M V} < i = 1 \sum n - s h_{(i)}^{M V} / (r - s)} .

w = min {s ∣ 0 \leq s \leq r, h_{(n - s)}^{M V} < i = 1 \sum n - s h_{(i)}^{M V} / (r - s)} .

π_{i}^{M V c}

π_{i}^{M V c}

T = i = 1 \sum n - w h_{(i)}^{M V c} / (r - w),

T = i = 1 \sum n - w h_{(i)}^{M V c} / (r - w),

w = min {s ∣ 0 \leq s \leq r, h_{(n - s)}^{M V c} < i = 1 \sum n - s h_{(i)}^{M V c} / (r - s)} .

w = min {s ∣ 0 \leq s \leq r, h_{(n - s)}^{M V c} < i = 1 \sum n - s h_{(i)}^{M V c} / (r - s)} .

π_{i}^{os}

π_{i}^{os}

\overset{π}{^}_{i}^{sos} = (1 - ρ) \frac{r _{2} L H _{r_{1}}^{- 1} ( β ~ _{r_{1}^{*}} ) X _{i}^{T} A _{i}^{1/2} ( β ~ _{r_{1}^{*}} ) R ~ _{r_{1}^{*}}^{- 1} ε _{i} ( β ~ _{r_{1}^{*}} )}{n Ψ ^} + ρ \frac{r _{2}}{n},

\overset{π}{^}_{i}^{sos} = (1 - ρ) \frac{r _{2} L H _{r_{1}}^{- 1} ( β ~ _{r_{1}^{*}} ) X _{i}^{T} A _{i}^{1/2} ( β ~ _{r_{1}^{*}} ) R ~ _{r_{1}^{*}}^{- 1} ε _{i} ( β ~ _{r_{1}^{*}} )}{n Ψ ^} + ρ \frac{r _{2}}{n},

c_{n}^{T} (\overset{ˉ}{H}_{n}^{- 1} (β_{0}) \overset{ˉ}{M}_{r}^{L} (β_{0}) \overset{ˉ}{H}_{n}^{- 1} (β_{0}))^{- 1/2} (\overset{˘}{β} - β_{0}) d N (0, 1),

c_{n}^{T} (\overset{ˉ}{H}_{n}^{- 1} (β_{0}) \overset{ˉ}{M}_{r}^{L} (β_{0}) \overset{ˉ}{H}_{n}^{- 1} (β_{0}))^{- 1/2} (\overset{˘}{β} - β_{0}) d N (0, 1),

\overset{ˉ}{M}_{r}^{L} (β_{0}) = \frac{1}{n ^{2}} i = 1 \sum n \frac{X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} ) ε _{i}^{T} ( β _{0} ) R ˉ ^{- 1} A _{i}^{1/2} ( β _{0} ) X _{i}}{π _{i}^{sos} \land 1},

\overset{ˉ}{M}_{r}^{L} (β_{0}) = \frac{1}{n ^{2}} i = 1 \sum n \frac{X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} ) ε _{i}^{T} ( β _{0} ) R ˉ ^{- 1} A _{i}^{1/2} ( β _{0} ) X _{i}}{π _{i}^{sos} \land 1},

π_{i}^{sos} = (1 - ρ) \frac{r _{2} L H ˉ _{n}^{- 1} ( β _{0} ) X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} )}{\sum _{i = 1}^{n} L H ˉ _{n}^{- 1} ( β _{0} ) X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} )} + ρ \frac{r _{2}}{n} .

π_{i}^{sos} = (1 - ρ) \frac{r _{2} L H ˉ _{n}^{- 1} ( β _{0} ) X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} )}{\sum _{i = 1}^{n} L H ˉ _{n}^{- 1} ( β _{0} ) X _{i}^{T} A _{i}^{1/2} ( β _{0} ) R ˉ ^{- 1} ε _{i} ( β _{0} )} + ρ \frac{r _{2}}{n} .

Y_{i} = X_{i} β + ε_{i}, i = 1, 2, \dots, n .

Y_{i} = X_{i} β + ε_{i}, i = 1, 2, \dots, n .

MSE = \frac{1}{1000} s = 1 \sum 1000 β_{0} - \tilde{β}^{(s)}^{2} .

MSE = \frac{1}{1000} s = 1 \sum 1000 β_{0} - \tilde{β}^{(s)}^{2} .

Total_income_{ij}

Total_income_{ij}

+ β_{6} Unhealth_{ij} + β_{7} Familynum_{ij} + β_{8} Unhealthnum_{ij}

+ β_{9} Endowment_insurance_{ij} + β_{10} Medinsurance_{ij} + β_{11} Expenditure1_{ij}

+ β_{12} Expenditure2_{ij} + β_{13} Finanasset_{ij},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoil Geostatistics and Mapping · Statistical Methods and Inference

Full text

Optional subsampling for generalized estimating equations in growing-dimensional longitudinal Data

**Chunjing Li

** School of Mathematics and Statistics, Changchun University of Technology, China

[email protected]

**Jiahui Zhang

** School of Mathematics and Statistics, Changchun University of Technology, China

[email protected]

**Xiaohui Yuan∗

** School of Mathematics and Statistics, Changchun University of Technology, China

[email protected]

This version: \usdateAugust 28, 2025

00footnotetext: ∗Corresponding author, † equal authors contribution.

Abstract

As a powerful tool for longitudinal data analysis, the generalized estimating equations have been widely studied in the academic community. However, in large-scale settings, this approach faces pronounced computational and storage challenges. In this paper, we propose an optimal Poisson subsampling algorithm for generalized estimating equations in large-scale longitudinal data with diverging covariate dimension, and establish the asymptotic properties of the resulting estimator. We further derive the optimal Poisson subsampling probability based on A- and L-optimality criteria. An approximate optimal Poisson subsampling algorithm is proposed, which adopts a two-step procedure to construct these probabilities. Simulation studies are conducted to evaluate the performance of the proposed method under three different working correlation matrices. The results show that the method remains effective even when the working correlation matrices are misspecified. Finally, we apply the proposed method to the CHFS dataset to illustrate its empirical performance.

Keywords:

longitudinal data; generalized estimating equations; growing dimension; massive data; Poisson subsampling

1 Introduction

Longitudinal data are commonly encountered in medical research, economics, and the social sciences, and have garnered significant attention in statistical research. Liang & Zeger (1986) developed generalized estimating equations (GEE) for the analysis of longitudinal data, extending quasi-likelihood approaches by incorporating a working correlation matrix to account for within-subject dependence. The resulting estimators remain consistent despite potential misspecification of the working correlation matrix. Chaganty (1997) demonstrated that the parameter estimates are consistent and asymptotically normal. Li (1997) studied the asymptotic properties of GEE estimates using the maxmin method. Xie & Yang (2003) analyzed the asymptotic properties of GEE in the case of a single covariate, as the number of individuals, the number of observations per individual, or both grow to infinity. Balan & Schiopu-Kratina (2005) employed pseudo-likelihood equations to demonstrate the existence, weak consistency, and asymptotic normality of GEE estimators when the covariate dimension is ﬁxed. For the analysis of high-dimensional longitudinal data, Wang (2011) extended the asymptotic properties of GEE estimators with binary response variable when the number of covariates grows to inﬁnity. Wang et al. (2012) consider the penalized GEE for analyzing longitudinal data with high-dimensional covariates.

With the rapid advancement of economic development and information technology, data collection capabilities have significantly improved, leading to a dramatic increase in the scale of longitudinal data. As a consequence, we are facing not only the difficulties associated with high dimensionality but also the computational and storage challenges brought about by the explosive growth in data volume. Take the China Household Finance Survey (CHFS) as an example: since its launch in 2011, it has covered over 40,000 households and conducted follow-up surveys every two years, resulting in millions of observations across thousands of economic and social variables, such as assets, liabilities, and consumption, that capture household financial behavior over time. Similarly, the U.S. National Health and Nutrition Examination Survey (NHANES) has accumulated tens of thousands of longitudinal records with hundreds of health indicators, further illustrating the pressure of rapidly increasing sample sizes alongside high dimensionality. These characteristics render conventional storage and analytical approaches computationally infeasible, highlighting the urgent need for more efficient algorithms, scalable computing frameworks, and distributed processing techniques to effectively handle modern longitudinal data.

To efficiently process such large-scale datasets, several methods have been proposed, including divide-and-conquer strategies (Lin & Xi 2011; Xu et al. 2020), online updating techniques for streaming data (Luo et al. 2023; Schifano et al. 2016 ), and subsampling approaches (Fithian & Hastie 2014; Ma et al. 2015; Wang et al. 2018). Among them, subsampling methods have received significant attention for their effectiveness in reducing resource consumption and preserving data representativeness, and have resulted in substantial theoretical and practical achievements. For cross-sectional data, Fithian & Hastie (2014) introduced a Poisson sampling method in the context of logistic regression. Ma et al. (2015) conducted a statistical analysis of leverage-based subsampling. Wang et al. (2018) derived optimal subsampling probabilities for logistic regression based on the A-optimality criterion, and subsequently proposed a two-step adaptive algorithm aimed at approximating this optimal subsampling scheme. Yu et al. (2022) constructed optimal Poisson subsampling probabilities for pseudo-likelihood estimation, guided by A- and L-optimality criteria, and developed a distributed framework to handle data partitioned across multiple blocks or locations. Yao & Wang (2019) and Yao et al. (2023) applied optimal subsampling and Poisson-based subsampling methods for softmax regression models. Ai et al. (2021) applied optimal subsampling methods for generalized linear models. Yuan et al. (2024) incorporated subsampling strategies into distributed composite quantile regression frameworks. For longitudinal data, Wang et al. (2023) developed a new subsampling strategy that incorporates leverage and gradient information. Han & Fu (2023) developed optimal subsampling algorithms for marginal model.

Recently, for high-dimensional data, Gao et al. (2024) and Shan & Wang (2024) investigated subsampling strategies based on decorrelated score approaches for generalized linear models. Li et al. (2024) investigated a Poisson-based subsampling method for expectile regression in large-scale data. To the best of our knowledge, no prior work has explored optimal subsampling algorithms in the context of growing-dimensional longitudinal settings. We aim to develop an optimal Poisson subsampling algorithm for GEE with high-dimensional covariates. This study makes the following main contributions: (i) We propose a Poisson subsampling algorithm for GEE in growing-dimensional longtitual data and establish the consistency and asymptotic normality of the resulting estimator. (ii) We further develop a two-step algorithm aimed at approximating the optimal Poisson sampling probabilities, thereby extracting more informative subsamples for estimation.

The remainder of this paper is organized as follows. In Section 2, we establish the asymptotic properties of the general Poisson subsampling estimator. Section 3 presents the optimal Poisson subsampling probabilities, which are determined according to the A- and L-optimality criteria. Simulation studies in Section 4 demonstrate the effectiveness of the proposed method. In Section 5, we apply the proposed method to CHFS dataset. The proof can be found in the Appendix.

2 Poisson subsampling method based on generalized estimating equations

2.1 Generalized estimating equations

For $i=1,\cdots,n$ and $j=1,\cdots,m_{i}$ , let $y_{ij}$ denote the response variable and $\mathbf{x}_{ij}$ represent a covariate vector with diverging dimension $p_{n}$ . Define $\mathbf{Y}_{i}=\left(y_{i1},\cdots,y_{im_{i}}\right)^{T}$ and $\mathbf{X}_{i}=\left(\mathbf{x}_{i1},\cdots,\mathbf{x}_{im_{i}}\right)^{T}$ . Without loss of generality, we assume that $m_{1}=\cdots=m_{n}=m$ . The conditional expectation of $y_{ij}$ is given by $E\left(y_{ij}\mid\mathbf{x}_{ij}\right)=\mu_{ij}$ , where $\mu_{ij}=g\left(\eta_{ij}\right).$ Here, $g\left(\cdot\right)$ is a known link function, $\eta_{ij}=\mathbf{x}_{ij}^{T}\boldsymbol{\beta}$ , and $\boldsymbol{\beta}\in\mathcal{B}\subset R^{p_{n}}$ is the regression parameter vector. Observations within the same individual are assumed to be correlated, whereas those from different individuals are independent. Let $\boldsymbol{\mu}_{i}=\left(\mu_{i1},\cdots,\mu_{im}\right)^{T}$ be the marginal mean vector for the $i$ -th individual, and the covariance of the response variable $\mathbf{Y}_{i}$ is given by:

[TABLE]

where $\mathbf{A}_{i}=\mathrm{diag}\left(\mathrm{Var}\left(y_{i1}\right),\cdots,\mathrm{Var}\left(y_{im}\right)\right)$ is a diagonal matrix, $\mathbf{R}$ is the true correlation matrix of the response variable $\mathbf{Y}_{i}$ , and $\phi$ is the dispersion parameter, which may be known or unknown.

Liang & Zeger (1986) introduced the generalized estimating equation, which takes the following form:

[TABLE]

We use $\hat{\mathbf{R}}$ to denote the estimated working correlation matrix, and define the GEE estimator $\hat{\boldsymbol{\beta}}$ by:

[TABLE]

where $\boldsymbol{\varepsilon}_{i}\left(\boldsymbol{\beta}\right)=\mathbf{A}_{i}^{-1/2}\left(\boldsymbol{\beta}\right)(\mathbf{Y}_{i}-\boldsymbol{\mu}_{i}\left(\boldsymbol{\beta}\right))$ , $\hat{\mathbf{R}}^{(k)}$ and $\hat{\boldsymbol{\beta}}^{(k)}$ represent the values obtained during the $k$ -th update. Given $\hat{\mathbf{R}}^{(k)}$ and $\hat{\boldsymbol{\beta}}^{(k-1)}$ , the estimator $\hat{\boldsymbol{\beta}}^{(k)}$ can be computed by:

[TABLE]

We repeat the iterative procedure until the norm of the difference between successive estimates of $\boldsymbol{\beta}$ is less than $10^{-4}$ . The resulting estimate corresponds to the previously defined $\hat{\boldsymbol{\beta}}$ .

To obtain consistent parameter estimates, the computational complexity is at least $O(c\cdot nmp_{n}^{2})$ , where $c$ denotes the number of iterations. As the number of individuals $n$ increases, the computational burden also increases. Typically, subsampling algorithms help to reduce computational costs. The Poisson subsampling method can avoid memory overflow issues while maintaining efficient parameter estimation. Therefore, the next subsection will introduce the Poisson subsampling method for parameter estimation.

2.2 Poisson subsampling algorithm

Let $D=\left\{\left(\mathbf{X}_{i}^{*},\mathbf{Y}_{i}^{*},\pi_{i}^{*}\right)\right\}_{i=1}^{r^{*}}$ represent the subsample dataset, where $\pi_{i}^{*}$ denotes the sampling probability for individual $i$ . Let $\tilde{\mathbf{R}}$ and $\tilde{\boldsymbol{\beta}}$ represent the estimates based on the subsample; $\tilde{\mathbf{R}}^{(k)}$ and $\tilde{\boldsymbol{\beta}}^{(k)}$ represent the values obtained during the $k$ -th update. Given the subsample $D$ , the weighted generalized estimating equation takes the following form:

[TABLE]

Under the assumption of a working independence correlation matrix, the initial estimate of $\boldsymbol{\beta}$ can be directly obtained. $\tilde{\mathbf{R}}^{(k)}$ is estimated using the Gaussian pseudo-likelihood method. With $\tilde{\mathbf{R}}^{(k)}$ and $\tilde{\boldsymbol{\beta}}^{(k-1)}$ , the value of $\tilde{\boldsymbol{\beta}}^{(k)}$ can be estimated as:

[TABLE]

Equation (2.3) is iteratively applied until $\|\tilde{\boldsymbol{\beta}}^{(k+1)}-\tilde{\boldsymbol{\beta}}^{(k)}\|<10^{-4}$ . The resulting estimate corresponds to the previously defined $\tilde{\boldsymbol{\beta}}$ . We illustrate the steps of a general Poisson subsampling algorithm in Algorithm 1.

Algorithm 1

General Poisson Subsampling Algorithm.

•

Step 1: Initialize the set $\mathcal{D}=\emptyset$ . For each $i=1,\dots,n$ , generate an independent Bernoulli random variable $\delta_{i}\sim\text{Bernoulli}(\pi_{i})$ . If $\delta_{i}=1$ , include the triplet $(\mathbf{X}_{i},\mathbf{Y}_{i},\pi_{i})$ in the set $\mathcal{D}$ .

•

Step 2: Based on the subsample $\mathcal{D}$ , we use equation (2.3) to estimate the weighted generalized estimating equation in (2.2) and obtain the regression parameter estimate $\tilde{\boldsymbol{\beta}}$ .

Whether the observation $(\mathbf{X}_{i}^{*},\mathbf{Y}_{i}^{*})$ of individual $i$ is included in the subsample depends solely on its own probability $\pi_{i}$ , without considering the sampling probabilities of other individuals. In Algorithm 1, a random variable is generated through a Bernoulli trial to decide whether the observation $(\mathbf{X}_{i}^{*},\mathbf{Y}_{i}^{*})$ of individual $i$ is included in the subsample. Therefore, for massive data sets, the Poisson subsampling method alleviates memory constraint issues.

The size of the drawn subsample, denoted by $r^{*}$ , satisfies $E(r^{*})=\sum_{i=1}^{n}\pi_{i}$ . Let $r=\sum_{i=1}^{n}\pi_{i}$ denote the expected size of the drawn subsample. Furthermore, assume that $r\ll n$ , which is a common assumption in big data scenarios. The following regularity conditions are imposed to ensure consistency and asymptotic normality.

(C1)

$\sup_{i,j}\left\|\mathbf{X}_{ij}\right\|=O\left(\sqrt{p_{n}}\right)$ .

(C2)

The parameter vector $\boldsymbol{\beta}$ is assumed to lie within a compact set $\mathcal{B}\subseteq\mathbb{R}^{p_{n}}$ , and the true parameter vector $\boldsymbol{\beta}_{0}$ is also contained in $\mathcal{B}$ .

(C3)

There are positive constants $b_{1},b_{2}>0$ such that

[TABLE]

and similarly, two other constants $b_{3},b_{4}>0$ satisfy

[TABLE]

where $\lambda_{\min}$ and $\lambda_{\max}$ indicate the smallest and largest eigenvalues of a matrix, respectively.

(C4)

The true correlation matrix $\mathbf{R}_{0}$ is assumed to have eigenvalues bounded away from 0 and $+\infty$ . The estimated working correlation matrix $\tilde{\mathbf{R}}$ satisfies $\|\tilde{\mathbf{R}}^{-1}-\bar{\mathbf{R}}^{-1}\|=O_{p}(\sqrt{{p_{n}}/{r}})$ , where $\bar{\mathbf{R}}$ is a positive definite matrix with eigenvalues also bounded away from 0 and $+\infty$ . We do not require $\bar{\mathbf{R}}$ to be the true working correlation matrix $\mathbf{R}_{0}$ .

(C5)

There are positive constants $M_{1}$ and $\delta$ such that

[TABLE]

(C6)

Let $M_{2}>0$ be a constant satisfying $0\leq\dot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta})\leq\infty$ , and $0\leq\ddot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta}),\dddot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta})\leq M_{2}$ , where $\dot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta}),\ddot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta}),\dddot{\boldsymbol{\mu}}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta})$ are the first-, second-, and third-order derivatives of $\boldsymbol{\mu}(\mathbf{X}_{ij}^{T}\boldsymbol{\beta})$ , respectively.

(C7)

$\max_{1\leq i\leq n}\frac{1}{n\pi_{i}}=O_{p}\left(\frac{1}{r}\right)$ .

Assumption (C1) is a common condition for diverging-dimension M-estimators, which aligns with the setting in Portnoy(1985). Assumption (C2) is a necessary condition for the consistency of the estimator and has been widely adopted in many studies, including Newey & McFadden(1994). Assumption (C3) is frequently used in high-dimensional regression literature, with a similar formulation appearing in Wang (2011). Assumption (C4) extends the framework from fixed $p_{n}$ to high-dimensional $p_{n}$ . Assumption (C5) imposes moment conditions on the model. Assumption (C6) ensures the consistency of parameter estimation. Assumption (C7) restricts the weights in the weighted generalized estimating equations, primarily to prevent individuals with extremely small subsampling probabilities from unduly influencing the results.

Theorem 2.1

Under assumptions (C1)-(C7), if ${p_{n}^{2}}/{r}=o(1)$ and $\tilde{\boldsymbol{\beta}}$ is a solution to $\mathbf{S}_{r}(\boldsymbol{\beta})=0$ , then

[TABLE]

Theorem 2.2

Under assumptions (C1)-(C7), if ${p_{n}^{3}}/{r}=o(1)$ , then for any $\mathbf{c}_{n}\in\mathbf{R}^{p_{n}}$ with $\|\mathbf{c}_{n}\|=1$ , we have

[TABLE]

where

[TABLE]

Theorem 2.3

Under assumptions (C1)-(C7), if ${p_{n}^{3}}/{r}=o(1)$ , then

[TABLE]

where $\mathbf{c}_{n}$ is a $p_{n}$ -dimensional vector satisfying $\mathbf{c}_{n}\mathbf{c}_{n}^{T}=1$ ; $\tilde{\boldsymbol{\Sigma}}=\mathbf{H}_{n}^{-1}(\tilde{\boldsymbol{\beta}})\bar{\mathbf{M}}_{r}(\tilde{\boldsymbol{\beta}})\mathbf{H}_{n}^{-1}(\tilde{\boldsymbol{\beta}})$ ; $\boldsymbol{\Sigma}=\bar{\mathbf{H}}_{n}^{-1}(\boldsymbol{\beta}_{0})\bar{\mathbf{M}}_{r}(\boldsymbol{\beta}_{0})\bar{\mathbf{H}}_{n}^{-1}(\boldsymbol{\beta}_{0})$ , $\mathbf{H}_{n}(\boldsymbol{\beta})$ is similar to $\bar{\mathbf{H}}_{n}(\boldsymbol{\beta})$ , where $\tilde{\mathbf{R}}$ is used in place of $\bar{\mathbf{R}}$ .

Theorem 2.2 indicates that the estimation error of $\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}$ follows an asymptotic normal distribution, and its asymptotic distribution is related to the sampling probability $\pi=\begin{Bmatrix}\pi_{i}\end{Bmatrix}_{i=1}^{n}$ . Regardless of the correctness of the working correlation matrix specification, increasing the sample size or the number of individual observations will lead to better estimation results for the regression parameter estimator.

3 Optimal Poisson Sampling Algorithm

3.1 Optimal Poisson Sampling Strategy

To obtain the regression parameter estimator $\tilde{\boldsymbol{\beta}}$ using (2.2), the Poisson sampling probabilities $\pi=\left\{\pi_{i}\right\}_{i=1}^{n}$ need to be specified. The optimal subsampling probabilities can be determined by minimizing $tr(\boldsymbol{\Sigma})$ , that is, by using the A-optimality criterion.

Theorem 3.4

Definition $h_{i}^{MV}=\left\|\bar{\mathbf{H}}_{n}^{-1}\left(\boldsymbol{\beta}_{0}\right)\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\boldsymbol{\varepsilon}_{i}\left(\boldsymbol{\beta}_{0}\right)\right\|$ , $i=1,2,\cdots,n$ , let $h_{(1)}^{MV}\leq h_{(2)}^{MV}\leq\cdots\leq h_{(n)}^{MV}$ be the order statistics of $\left\{h_{i}^{MV}\right\}_{i=1}^{n}$ . If the subsampling probability is

[TABLE]

the value of $tr(\boldsymbol{\Sigma})$ is minimized, where,

[TABLE]

and

[TABLE]

The computation of $T$ in (3.1) is required only for those individuals $i$ satisfying the condition $rh_{(i)}^{MV}/\sum_{j=1}^{n}h_{j}^{MV}>1$ . In this case, $s$ is the number of individuals for which $p_{i}^{MV}=1$ . If all individuals satisfy $rh_{(i)}^{MV}/\sum_{j=1}^{n}(h_{j}^{MV})\leq 1$ , then we can directly set $p_{i}^{MV}=rh_{(i)}^{MV}/\sum_{j=1}^{n}(h_{j}^{MV})$ . To reduce computational complexity, the next theorem establishes the L-optimal subsampling strategy.

Theorem 3.5

Definition $h_{i}^{MVc}=\left\|\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\boldsymbol{\varepsilon}_{i}\left(\boldsymbol{\beta}_{0}\right)\right\|$ , $i=1,2,\cdots,n$ , let $h_{(1)}^{MVc}\leq h_{(2)}^{MVc}\leq\cdots\leq h_{(n)}^{MVc}$ be the order statistics of $\left\{h_{i}^{MVc}\right\}_{i=1}^{n}$ . If the subsampling probability is

[TABLE]

then $tr(\bar{\mathbf{M}}_{r}\left(\boldsymbol{\beta}_{0}\right))$ is minimized, where,

[TABLE]

and

[TABLE]

3.2 Two-Step Algorithm

To simplify notation, let $\pi_{i}^{os}$ and $h_{i}^{os}$ denote $\pi_{i}^{MV}$ or $\pi_{i}^{MVc}$ , and $h_{i}^{MV}$ or $h_{i}^{MVc}$ , respectively, as defined in Theorems 3.4 and 3.5. Since the computation of $h_{i}^{os}$ is related to the true parameter $\boldsymbol{\beta}_{0}$ , the optimal Poisson subsampling probability is

[TABLE]

which cannot be directly computed, where $\Psi=n^{-1}\sum_{i=1}^{n}h_{i}^{os}\wedge T$ . To implement this procedure, a two-step algorithm is proposed. In the first step, a pilot subsample of expected size $r_{1}$ is drawn through uniform Poisson sampling, denoted as $D_{r_{1}^{*}}=\left\{\left(\mathbf{X}_{i}^{*},\mathbf{Y}_{i}^{*},r_{1}/n\right)\right\}_{i=1}^{r_{1}^{*}}$ . Based on $D_{r_{1}^{*}}$ , and assuming an independent working correlation matrix, the resulting estimate $\tilde{\boldsymbol{\beta}}_{r_{1}^{*}}$ is used as an initial approximation of $\boldsymbol{\beta}_{0}$ . The working correlation matrix $\bar{\mathbf{R}}$ is accordingly replaced by $\tilde{\mathbf{R}}_{r_{1}^{*}}$ , computed via the Gaussian pseudo-likelihood method. Next, the values of $T$ and $\Psi$ can be computed. The purpose of $T$ is to control $\max_{1\leq i\leq n}\pi_{i}^{os}=1$ , and since $r\ll n$ is a common case, this implies that the situation where $h_{i}^{os}>T$ is very rare. Therefore, with sufficiently small subsampling rates, directly setting $T=\infty$ performs quite well. For the estimation of $\Psi$ , it can be calculated as $\hat{\Psi}=(r_{1}^{*})^{-1}\sum_{i\in D_{r_{1}^{*}}}\left\|\mathbf{L}\mathbf{H}_{r_{1}}^{-1}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})\mathbf{X}_{i}^{*T}\mathbf{A}_{i}^{*1/2}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})\tilde{\mathbf{R}}_{r_{1}^{*}}^{-1}\boldsymbol{\varepsilon}_{i}^{*}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})\right\|$ , where $\mathbf{H}_{r_{1}}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})=(r_{1}^{*})^{-1}\sum_{i\in D_{r_{1}^{*}}}\mathbf{X}_{i}^{*T}\mathbf{A}_{i}^{*1/2}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})\tilde{\mathbf{R}}^{-1}\mathbf{A}_{i}^{*1/2}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})\mathbf{X}_{i}^{*}$ . When $\mathbf{L}=\mathbf{I}$ , this corresponds to the A-optimality criterion, when $\mathbf{L}=\mathbf{H}_{r_{1}}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})$ , this corresponds to the L-optimality criterion. Therefore, the optimal subsampling probability can be approximated by $\hat{\pi}_{i}^{os}$ , where $\tilde{\boldsymbol{\beta}}_{r_{1}^{*}}$ , $\tilde{\mathbf{R}}_{r_{1}^{*}}$ , $T=\infty$ , $\mathbf{H}_{r_{1}}(\boldsymbol{\beta}_{r_{1}^{*}})$ , and $\hat{\Psi}$ are used in place of $\boldsymbol{\beta}_{0}$ , $\bar{\mathbf{R}}$ , $T$ , $\bar{\mathbf{H}}_{n}\left(\boldsymbol{\beta}_{0}\right)$ , and $\Psi$ .

To enhance the robustness of the estimator, we employ the shrinkage-based subsampling method studied by Ma et al. (2015), which combines the optimal subsampling probability $\hat{\pi}_{i}^{os}$ with the uniform probability:

[TABLE]

where $\rho\in(0,1)$ , and $r_{2}$ denotes the expected subsample size in the second step. In practice, it is possible that $\hat{\pi}_{i}^{sos}$ exceeds 1 due to the shrinkage adjustment. Therefore, the final subsampling probability is given by $\left(\hat{\pi}_{i}^{sos}\wedge 1\right)$ , and the final regression parameter estimator is denoted as $\breve{\boldsymbol{\beta}}$ . Algorithm 2 provides a full description of the two-step algorithm. The asymptotic properties of the regression parameter $\breve{\boldsymbol{\beta}}$ obtained from Algorithm 2 are presented in Theorem 3.6.

Algorithm 2

Two-Step Algorithm.

•

Step 1: Use the uniform Poisson subsampling probability $\{\pi_{i}=r_{1}/n\}_{i=1}^{n}$ to draw the pilot sample $D_{r_{1}^{*}}$ and obtain $\tilde{\boldsymbol{\beta}}_{r_{1}^{*}}$ , $\tilde{\mathbf{R}}_{r_{1}^{*}}$ , $T=\infty$ , $\mathbf{H}_{r_{1}}(\boldsymbol{\beta}_{r_{1}^{*}})$ , and $\hat{\Psi}$ . Compute the optimal Poisson subsampling probability $\left(\hat{\pi}_{i}^{sos}\wedge 1\right)$ based on (3.4).

•

Step 2: With the approximate optimal subsampling probabilities obtained from Step 1, draw the sample $D_{r_{2}^{*}}$ , and perform regression parameter estimation.

Theorem 3.6

Under assumptions (C1)-(C7), if ${p_{n}^{3}}/{r}=o(1)$ and the condition $r_{1}r_{2}^{-1/2}=o(1)$ , then for any $\mathbf{c}_{n}\in\mathbf{R}^{p_{n}}$ with $\|\mathbf{c}_{n}\|=1$ , we have

[TABLE]

where

[TABLE]

and

[TABLE]

According to Theorem 3.6, the covariance matrix of $\breve{\boldsymbol{\beta}}$ can be estimated as $\mathbf{H}_{r_{1}}^{-1}(\breve{\boldsymbol{\beta}})\breve{\mathbf{M}}_{r}(\breve{\boldsymbol{\beta}})\\ \mathbf{H}_{r_{1}}^{-1}(\breve{\boldsymbol{\beta}})$ , where $\mathbf{H}_{r_{1}}(\breve{\boldsymbol{\beta}})=n^{-1}\sum_{i\in D_{r_{2}^{*}}}\mathbf{X}_{i}^{*T}\mathbf{A}_{i}^{*1/2}(\breve{\boldsymbol{\beta}})\tilde{\mathbf{R}}_{r_{1}^{*}}^{-1}\mathbf{A}_{i}^{*1/2}(\breve{\boldsymbol{\beta}})\mathbf{X}_{i}^{*},$ and $\breve{\mathbf{M}}_{r}(\boldsymbol{\beta}_{r_{1}^{*}})=n^{-2}\sum_{i\in D_{r_{2}^{*}}}\\ \left(\hat{\pi}_{i}^{*sos}\wedge 1\right)^{-1}\mathbf{X}_{i}^{*T}\mathbf{A}_{i}^{*1/2}(\breve{\boldsymbol{\beta}})\tilde{\mathbf{R}}_{r_{1}^{*}}^{-1}\boldsymbol{\varepsilon}_{i}^{*}(\breve{\boldsymbol{\beta}})\boldsymbol{\varepsilon}_{i}^{*T}(\breve{\boldsymbol{\beta}})\tilde{\mathbf{R}}_{r_{1}^{*}}^{-1}\mathbf{A}_{i}^{*1/2}(\breve{\boldsymbol{\beta}})\mathbf{X}_{i}^{*}.$

4 Numerical Simulation

We assess the effectiveness of the optimal Poisson subsampling algorithm through simulation studies, considering a linear regression model in the context of high-dimensional longitudinal data:

[TABLE]

The true value $\boldsymbol{\beta}_{0}=(1,1.5,1,1.5,\cdots,1,1.5)^{T}$ . We consider three settings for the dimensionality: $p_{n}$ = 30, 50, and 70. The covariates $\mathbf{X}_{i}$ are generated from two different distributions:

(1)

Case 1: $\mathbf{X}_{i}$ follows a multivariate $t$ -distribution with $3$ degrees of freedom, i.e., $t_{3}(\mathbf{0},\boldsymbol{\Sigma})$ , where $\boldsymbol{\Sigma}=(0.5^{|i-j|})$ .

(2)

Case 2: $\mathbf{X}_{i}$ follows a log-normal distribution $LN(\mathbf{0},\,1.8\boldsymbol{\Sigma})$ .

The error term $\boldsymbol{\varepsilon}_{i}$ is generated from a multivariate normal distribution $N\left(\mathbf{0},\mathbf{R}(\boldsymbol{\alpha})\right)$ , where the correlation parameter is set to $\alpha=0.5$ . We consider three different work correlation matrices : EX, AR(1), MA(1).

The number of observation individuals is set to 10000, with each individual having 5 observations, $m=5$ . The pilot subsample size is set to $r_{1}=200$ , and the second-stage subsample size $r_{2}$ is chosen from 100, 200, 400, 600, 800, 1000. The mean squared error (MSE) is calculated across 1000 simulation replications to assess the performance of the subsampling methods. Here, $\tilde{\boldsymbol{\beta}}^{(s)}$ denotes the estimated parameter obtained from the $s$ -th subsample,

[TABLE]

We present the experimental results under Case 1, with AR(1) and EX as the true correlation structures, in Figures 1-2. For example, AR(1)-EX indicates that the left side shows the true correlation matrix, while the right side denotes the working correlation matrix. The uniform Poisson subsampling method is denoted as pUnif, the A-optimal Poisson subsampling as pMV, and the L-optimal Poisson subsampling as pMVc.

The results indicate that as the subsample size increases, the log(MSE) of all three methods decreases gradually. Both pMV and pMVc outperform pUnif significantly, with pMV consistently performing slightly better than pMVc. Even when the working correlation matrix is misspecified, the proposed methods still demonstrate robust performance. We display the experimental results under Case 2, with AR(1) and EX as the true correlation matrices, in Figures 3–4. Similar to the results in Case 1, both pMV and pMVc significantly outperform the uniform Poisson subsampling method.

We summarize the computation time under Case 1 for true correlation matrix AR(1) with $p_{n}=30$ and $p_{n}=50$ in Table 1. The results show that the computation times for the three Poisson subsampling methods increase with the sample size, but all remain smaller than the computation time using the full dataset. Among them, pMVc exhibits a lower computational cost compared to pMV.

5 Actual Data Analysis

The empirical analysis in this paper is based on data from the China Household Finance Survey (CHFS), which is administered by Southwest University of Finance and Economics (https://chfs.swufe.edu.cn). We perform the analysis at the household level, using total household income (Total_income) as the response variable. The independent variables include the household head’s residence area (Rural), age (Age), gender (Gender), marital status (Marry), education level (Edu), health status (Unhealth), pension insurance coverage (Endowment_insurance) and medical insurance coverage (Medinsurance); as well as household-level variables such as total number of family members (Familynum), number of unhealthy family members (Unhealthnum), utility expenditures (Expenditure1), consumption expenditures (Expenditure2), and financial assets (Finanasset). The dataset comprises 9,753 households, with three repeated observations per household.

The corresponding model is as follows:

[TABLE]

Given that the true values of model parameters are typically unobservable in real-world datasets, this paper utilizes the parameter estimation values derived from the full dataset to substitute for the unknown true values. Figure 5 presents the estimation results of the pUnif and the pMV and pMVc under different working correlation matrices. The results indicate that pMV and pMVc consistently outperform pUnif. Moreover, pMV and pMVc exhibit similar performance. Therefore, it can be concluded that the optimal Poisson subsampling is a more effective choice.

Acknowledgementss

Chunjing Li was partly supported by the National Social Science Fund of China (24BTJ061) and Scientific Research Project of Jilin Provincial Department of Education (JJKH20250702KJ). Xiaohui Yuan was partly supported by the National Social Science Fund of China (22BTJ019) and Scientific Research Project of Jilin Provincial Department of Science and Technology (20250102029JC).

Appendix

In the context of the subsample, the weighted generalized estimating equation (2.2) can be reformulated as:

[TABLE]

To prove these theorems, we first establish several lemmas.

Lemma A. 1

Under assumptions (C1)-(C3), if ${p_{n}^{2}}/{r}=o(1)$ as $r\rightarrow\infty$ , then

[TABLE]

The proof of Lemma 1. We first prove that

[TABLE]

where

[TABLE]

According to the Central Limit Theorem, it holds that $\|\mathbf{R}^{*}-\mathbf{R}_{0}\|=O_{p}(\sqrt{1/r})$ . Combining with (A.1), we can derive that $\|\tilde{\mathbf{R}}-\mathbf{R}_{0}\|=O_{p}(\sqrt{p_{n}/r})$ . Note that $\|\tilde{\mathbf{R}}-\mathbf{R}^{*}\|^{2}=\sum_{k=1}^{m}\sum_{j=1}^{m}[\tilde{\mathbf{R}}_{kj}-\mathbf{R}_{kj}^{*}]^{2}$ . We have

[TABLE]

where $\eta_{ijk}=\frac{\sqrt{\mathbf{A}_{ik}(\boldsymbol{\beta}_{0})\mathbf{A}_{ij}\left(\boldsymbol{\beta}_{0}\right)}}{\sqrt{\mathbf{A}_{ik}(\tilde{\boldsymbol{\beta}})\mathbf{A}_{ij}(\tilde{\boldsymbol{\beta}})}}-1$ . Thus,

[TABLE]

By the triangle inequality, we observe

[TABLE]

then, applying this result yields

[TABLE]

Thus, we obtain the following result:

[TABLE]

By the Cauchy-Schwarz inequality,

[TABLE]

Since $(\mu_{ik}(\boldsymbol{\beta}_{0})-\mu_{ik}(\tilde{\boldsymbol{\beta}}))=\mathbf{A}_{ik}(\boldsymbol{\beta}^{*})\mathbf{X}_{ik}(\boldsymbol{\beta}_{0}-\tilde{\boldsymbol{\beta}}),$ $\left\|\boldsymbol{\beta}_{0}-\boldsymbol{\beta}^{*}\right\|\leq\|\boldsymbol{\beta}_{0}-\tilde{\boldsymbol{\beta}}\|$ . Therefore,

[TABLE]

Let $C$ denote a generic positive constant. We use this notation consistently throughout the paper. Hence $I_{n11}=O_{p}(\|\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}\|^{4})$ . Similarly, note that

[TABLE]

Therefore, we have $I_{n12}=O_{p}(\|\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}\|^{2})$ . Similarly, $I_{n13}=O_{p}(\|\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}\|^{2})$ . We next analyze $I_{kj,2}$ . By applying the Cauchy–Schwarz inequality, we have

[TABLE]

Let $g(\boldsymbol{\beta})=\frac{\sqrt{\mathbf{A}_{ik}(\boldsymbol{\beta}_{0})\mathbf{A}_{ij}(\boldsymbol{\beta}_{0})}}{\sqrt{\mathbf{A}_{ik}(\boldsymbol{\beta})\mathbf{A}_{ij}(\boldsymbol{\beta})}}$ , we have, $\eta_{ijk}=g(\tilde{\boldsymbol{\beta}})-g\left(\boldsymbol{\beta}_{0}\right)$ , then

[TABLE]

Therefore,

[TABLE]

where $\left\|\boldsymbol{\beta}_{0}-\boldsymbol{\beta}^{*}\right\|\leq\|\boldsymbol{\beta}_{0}-\tilde{\boldsymbol{\beta}}\|$ . Hence, under Assumption (C3) and Lemma 3, $\|\tilde{\mathbf{R}}-\mathbf{R}^{*}\|^{2}\leq I_{n1}+I_{n2}=O_{p}(\|\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}\|^{2})=O_{p}(p_{n}/r)$ , that is, (A.1) is proved. Since $\|\tilde{\mathbf{R}}-\mathbf{R}_{0}\|\leq\|\tilde{\mathbf{R}}-\mathbf{R}^{*}\|+\|\mathbf{R}^{*}-\mathbf{R}_{0}\|=O_{p}(\sqrt{p_{n}/r})$ . Therefore, we have $\|\tilde{\mathbf{R}}^{-1}-\mathbf{R}_{0}^{-1}\|=\|\tilde{\mathbf{R}}^{-1}(\tilde{\mathbf{R}}-\mathbf{R}_{0})\mathbf{R}_{0}^{-1}\|=O_{p}(\sqrt{p_{n}/r})$ .

Lemma A. 2

Under assumptions (C1)-(C7), if $p_{n}^{2}/r=o(1)$ , then $\|\mathbf{S}_{r}(\boldsymbol{\beta}_{0})-\bar{\mathbf{S}}_{r}(\boldsymbol{\beta}_{0})\|=O_{p}\left(p_{n}/r\right).$

The proof of Lemma 2. Let $\mathbf{Q}=\left\{q_{j_{1}j_{2}}\right\}_{1\leq j_{1},j_{2}\leq m}$ denote $\tilde{\mathbf{R}}^{-1}-\bar{\mathbf{R}}^{-1}$ . Then

[TABLE]

where $\varepsilon_{ij_{2}}(\boldsymbol{\beta}_{0})=\mathbf{A}_{ij_{2}}^{-1/2}(\boldsymbol{\beta}_{0})(Y_{ij_{2}}-\mu_{ij_{2}}(\boldsymbol{\beta}_{0}))$ . Since

[TABLE]

Thus, we have $\|\frac{1}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\mathbf{A}_{ij_{1}}^{1/2}(\boldsymbol{\beta}_{0})\varepsilon_{ij_{2}}(\boldsymbol{\beta}_{0})\mathbf{X}_{ij_{1}}\|=O_{p}(\sqrt{p_{n}/r}),\forall 1\leq j_{1},j_{2}\leq m.$ Combining with assumptions (C4) yields $q_{j_{1}j_{2}}=O_{p}(\sqrt{p_{n}/r}),\forall 1\leq j_{1},j_{2}\leq m,$ which completes the proof.

Lemma A. 3

if ${p_{n}^{2}}/r=o(1)$ , $\tilde{\mathbf{S}}_{r}(\boldsymbol{\beta})=\frac{1}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\mathbf{X}_{i}^{T}(\mathbf{Y}_{i}-\boldsymbol{\mu}_{i}(\boldsymbol{\beta})),$ where $\tilde{\boldsymbol{\beta}}$ is the solution to $\tilde{\mathbf{S}}_{r}(\boldsymbol{\beta})=0$ , then $\|\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0}\|=O_{p}(\sqrt{p_{n}/r}).$

The proof of Lemma 3. It suffices to show that for any $\epsilon>0$ , there exists $\Delta>0$ such that for sufficiently large $r$ , we have

[TABLE]

Since

[TABLE]

where $\|\boldsymbol{\beta}^{*}-\boldsymbol{\beta}_{0}\|\leq\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|.$ We now analyse the term $I_{n1}.$ Applying the Cauchy–Schwarz inequality yields $|I_{n1}|\leq\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\|\tilde{\mathbf{S}}_{r}(\boldsymbol{\beta}_{0})\|=\Delta\sqrt{{p_{n}}/{r}}\|\tilde{\mathbf{S}}_{r}(\boldsymbol{\beta}_{0})\|.$ Note that

[TABLE]

Therefore, $\left|I_{n1}\right|\leq\Delta\sqrt{{p_{n}}/{r}}O_{p}(\sqrt{{p_{n}}/{r}})=\Delta O_{p}\left({p_{n}}/{r}\right).$ Next, we consider $I_{n2},$

[TABLE]

where $\frac{\partial\tilde{\mathbf{S}}\left(\boldsymbol{\beta}^{*}\right)}{\partial\boldsymbol{\beta}}=-\frac{1}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\mathbf{X}_{i}^{T}\mathbf{A}_{i}\left(\boldsymbol{\beta}^{*}\right)\mathbf{X}_{i}.$ We have

[TABLE]

and

[TABLE]

where $\mathbf{A}_{i}\left(\boldsymbol{\beta}^{*}\right)-\mathbf{A}_{i}\left(\boldsymbol{\beta}_{0}\right)=\mathbf{A}_{i}\left(\boldsymbol{\beta}^{**}\right)\mathbf{X}_{i}\left(\boldsymbol{\beta}^{*}-\boldsymbol{\beta}_{0}\right),\left\|\boldsymbol{\beta}^{**}-\boldsymbol{\beta}_{0}\right\|\leq\left\|\boldsymbol{\beta}^{*}-\boldsymbol{\beta}_{0}\right\|.$ When $\Delta$ is sufficiently large, $\left(\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\right)^{T}\tilde{\mathbf{S}}_{r}\left(\boldsymbol{\beta}\right)$ is dominated by $I_{n21}$ , and for sufficiently large $r$ , this value can be made negative.

Lemma A. 4

Let $\bar{\mathbf{D}}_{r}\left(\boldsymbol{\beta}\right)=-\frac{\partial\bar{\mathbf{S}}_{r}(\boldsymbol{\beta})}{\partial\boldsymbol{\beta}^{T}}$ , we have

[TABLE]

where

[TABLE]

with

[TABLE]

The proof of Lemma 4. See Xie & Wang (2003).

Lemma A. 5

Under assumptions (C1)-(C7), if ${p_{n}^{2}}/r=o(1)$ , then for any $\Delta>0$ and $\mathbf{b}_{n}\in\mathbf{R}^{p_{n}}$ , we have

[TABLE]

The matrix $\mathbf{D}_{r}\left(\boldsymbol{\beta}\right)-\bar{\mathbf{D}}_{r}\left(\boldsymbol{\beta}\right)$ is symmetric, and these results follow directly,

[TABLE]

The proof of Lemma 5. Let $\mathbf{H}_{r}(\boldsymbol{\beta})$ , $\mathbf{E}_{r}(\boldsymbol{\beta})$ , $\mathbf{G}_{r}(\boldsymbol{\beta})$ be defined analogously to $\bar{\mathbf{H}}_{r}\left(\boldsymbol{\beta}\right)$ , $\bar{\mathbf{E}}_{r}\left(\boldsymbol{\beta}\right)$ , $\bar{\mathbf{G}}_{r}\left(\boldsymbol{\beta}\right)$ , but with $\tilde{\mathbf{R}}$ replacing $\bar{\mathbf{R}}$ . The proof can be completed by establishing the following three asymptotic results:

[TABLE]

For (A.2), we observe that

[TABLE]

By assumptions (C2) and (C4), (A.2) is thus proved. Similarly, (A.3) and (A.4) can be verified.

Lemma A. 6

Under assumptions (C1)-(C7), if ${p_{n}^{2}}/r=o(1)$ , then for any $\Delta>0$ and $\mathbf{b}_{n}\in\mathbf{R}^{p_{n}}$ , we have

[TABLE]

The proof of Lemma 6. The proof follows directly from Lemma 3.4 in Wang (2011)

Lemma A. 7

Under assumptions (C1)-(C7), if ${p_{n}^{3}}/r=o(1)$ , then for any $\mathbf{b}_{n}\in\mathbf{R}^{p_{n}}$ with $\left\|\mathbf{b}_{n}\right\|=1$ , we have

[TABLE]

The proof of Lemma 7. Let $Var(\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right))=\frac{1}{n^{2}}\sum_{i=1}^{n}Var\left(\mathbf{T}_{i}\right)$ , where

[TABLE]

Note that $Var\left(\mathbf{T}_{i}\right)=Var\left(E\left(\mathbf{T}_{i}\mid\mathbf{Y}_{i}\right)\right)+E\left(Var\left(\mathbf{T}_{i}\mid\mathbf{Y}_{i}\right)\right).$ We have

[TABLE]

and

[TABLE]

Thus, we obtain $\bar{\mathbf{M}}_{r}\left(\boldsymbol{\beta}_{0}\right)=\frac{1}{n^{2}}\sum_{i=1}^{n}\frac{1}{\pi_{i}}\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\boldsymbol{\varepsilon}_{i}\left(\boldsymbol{\beta}_{0}\right)\boldsymbol{\varepsilon}_{i}^{T}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\mathbf{X}_{i}.$ Let $\mathbf{b}_{n}^{T}\bar{\mathbf{M}}_{r}^{-1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right)=\sum_{i=1}^{n}\mathbf{Z}_{i},$ where

[TABLE]

Then, $E\left(\mathbf{Z}_{i}\right)=0,Var\left(\sum_{i=1}^{n}\mathbf{Z}_{i}\right)=\mathbf{b}_{n}^{T}\mathbf{b}_{n}=1.$ Note that

[TABLE]

Let

[TABLE]

Since

[TABLE]

Thus, $\max_{1\leq i\leq n}\xi_{i}=\frac{O\left(p_{n}\right)}{O(1/{r})}=O_{p}\left({rp_{n}}\right).$ We now verify the Lyapunov condition.

[TABLE]

Observe that $\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\leq\frac{\lambda_{\max}\left(\frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_{i}^{T}\mathbf{X}_{i}\right)}{\lambda_{\min}\left(\bar{\mathbf{M}}\left(\boldsymbol{\beta}_{0}\right)\right)}=O_{p}\left({r}\right).$ Thus,

[TABLE]

Lemma A. 8

Under assumptions (C1)-(C7), if ${p_{n}^{2}}/{r}=o(1)$ , we have $\bar{\mathbf{H}}_{r}\left(\boldsymbol{\beta}_{0}\right)-\bar{\mathbf{H}}_{n}\left(\boldsymbol{\beta}_{0}\right)=o_{p}\left(1\right).$

The proof of Lemma 8. Let $\bar{\mathbf{H}}_{n}\left(\boldsymbol{\beta}_{0}\right)=\frac{1}{n}\sum_{i=1}^{n}\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\mathbf{X}_{i}.$ Then

[TABLE]

Note that

[TABLE]

and

[TABLE]

Applying Chebyshev’s inequality, the lemma 8 is thus proved.

Appendix A The proof of Theorem

**The proof of Theorem 2.1 **** ** We now show that, for any $\epsilon>0$ , there exists a constant $\Delta>0$ such that when $r$ is sufficiently large,

[TABLE]

Note

[TABLE]

where $\mathbf{D}_{r}\left(\boldsymbol{\beta}^{*}\right)=-\frac{\partial\mathbf{S}_{r}\left(\boldsymbol{\beta}^{*}\right)}{\partial\boldsymbol{\beta}},\|\boldsymbol{\beta}^{*}-\boldsymbol{\beta}\|\leq\|\boldsymbol{\beta}_{0}-\boldsymbol{\beta}\|$ . Next, we have

[TABLE]

By the Cauchy-Schwarz inequality, we have $\left|I_{n11}\right|\leq\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\|\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right)\|=\Delta\sqrt{{p_{n}}/{r}}\left\|\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right)\right\|$ . Furthermore,

[TABLE]

Therefore, $\left\|\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right)\right\|=O_{p}(\sqrt{{p_{n}}/{r}})$ . Consequently, $\left|I_{n1}\right|=\Delta O_{p}\left({p_{n}}/{r}\right)$ . By lemma 2, we can derive that $I_{n12}\leq\left\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\right\|\|\mathbf{S}_{r}\left(\boldsymbol{\beta}_{0}\right)-\bar{\mathbf{S}}_{r}\left(\boldsymbol{\beta}_{0}\right)\|=\Delta\sqrt{{p_{n}}/{r}}O_{p}\left({p_{n}}/{r}\right)=\Delta o_{p}\left({p_{n}}/{r}\right)$ . Next, we disuass $I_{n2}$ ,

[TABLE]

Note that by Lemma 5, we have

[TABLE]

For $I_{n2},$ we have

[TABLE]

By Lemmas 5 and 6, $I_{n21}^{b}=\Delta^{2}{p}/{r}O_{p}\left(\sqrt{{p_{n}}/{r}}\right)=\Delta^{2}o_{p}\left({p_{n}}/{r}\right),I_{n21}^{c}=I_{n21}^{d}=\Delta^{2}{p_{n}}/{r}O_{p}\left({p_{n}}/{\sqrt{r}}\right)=\Delta^{2}o_{p}\left({p_{n}}/{r}\right).$ Finally, under assumptions (C3) and (C7), we analyze $I_{n21}^{a},$

[TABLE]

Therefore, for $\left(\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\right)^{T}S_{r}\left(\boldsymbol{\beta}\right)$ on the set $\mathcal{B}=\left\{\boldsymbol{\beta}{:}\left\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\right\|\leq\Delta\sqrt{{p_{n}}/{r}}\right\}$ , it is dominated by $I_{n11}$ and $I_{n21}^{a}$ . When $\Delta>0$ is sufficiently large, it can be negative. This finishes the proof of Theorem 2.1.

**The proof of Theorem 2.2 ** First, we prove

[TABLE]

We have

[TABLE]

In the second equality, since $\mathbf{S}_{r}(\tilde{\boldsymbol{\beta}})=0$ , Taylor expansion yields

[TABLE]

By Lemma 7, $\mathbf{c}_{n}^{T}\bar{\mathbf{M}}_{r}^{-1/2}(\boldsymbol{\beta}_{0})\bar{\mathbf{S}}_{r}(\boldsymbol{\beta}_{0})(\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0})\overset{d}{\longrightarrow}N(0,1).$ Thus, to prove (A.6), it suffices to show that for any $\Delta>0$ ,

[TABLE]

and

[TABLE]

First, we prove (A.8). By Lemma 2 and (A.5), observe that

[TABLE]

Next, we prove (A.7),

[TABLE]

By the Cauchy-Schwarz inequality and Lemma 5, we have

[TABLE]

By Lemmas 5 and 6, we can also derive $I_{n2}=I_{n3}=o_{p}(1)$ . Thus, (A.6) is proved. By Lemma 8, we have $\mathbf{c}_{n}^{T}\bar{\mathbf{M}}_{r}^{-1/2}\left(\boldsymbol{\beta}_{0}\right)\left[\bar{\mathbf{H}}_{r}\left(\boldsymbol{\beta}_{0}\right)-\bar{\mathbf{H}}_{n}\left(\boldsymbol{\beta}_{0}\right)\right](\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0})=o_{p}\left(1\right)$ . Combining this with Slutsky’s theorem yields $\mathbf{c}_{n}^{T}\bar{\mathbf{M}}_{r}^{-1/2}(\boldsymbol{\beta}_{0})\bar{\mathbf{H}}_{n}(\boldsymbol{\beta}_{0})(\tilde{\boldsymbol{\beta}}-\boldsymbol{\beta}_{0})\overset{d}{\longrightarrow}N(0,1).$ This finishes the proof of Theorem 2.2.

**The proof of Theorem 2.3 ** It suffices to show that for any $\mathbf{c}_{n}\in\mathbf{R}^{p_{n}}$ ,

[TABLE]

In our proof, we employ Theorem 2.1. Let $\tilde{\boldsymbol{\Sigma}}-\boldsymbol{\Sigma}=I_{n1}+I_{n2}+I_{n3}$ , where

[TABLE]

Therefore, (A.9) can be derived from $\sup_{\|\mathbf{c}_{n}\|=1}\left|\mathbf{c}_{n}^{T}\mathbf{I}_{ni}\mathbf{c}_{n}\right|=o_{p}\left(1\right)$ . Furthermore, we have

[TABLE]

To analyze the eigenvalues of $\tilde{\mathbf{M}}_{r}(\tilde{\boldsymbol{\beta}})-\bar{\mathbf{M}}_{r}\left(\boldsymbol{\beta}_{0}\right)$ ,

[TABLE]

Note that

[TABLE]

It can see

[TABLE]

We have $\|\mathbf{A}_{i}^{1/2}(\tilde{\boldsymbol{\beta}})\mathbf{X}_{i}\mathbf{c}_{n}\|\leq\left\|\mathbf{X}_{i}\mathbf{c}_{n}\right\|$ and

[TABLE]

and

[TABLE]

Therefore, we have

[TABLE]

Similarly, $\sup_{\|\mathbf{c}_{n}=1\|}J_{n2}=o_{p}(1)$ and $\sup_{\|\mathbf{c}_{n}=1\|}J_{n3}=o_{p}(1)$ . Hence,

[TABLE]

Similarly, we have $\sup_{\|\mathbf{c}_{n}\|=1}|\mathbf{c}_{n}^{T}[\tilde{\mathbf{M}}_{r}\left(\boldsymbol{\beta}_{0}\right)-\bar{\mathbf{M}}_{r}\left(\boldsymbol{\beta}_{0}\right)]\mathbf{c}_{n}|=o_{p}\left(1\right)$ . Finally, observe that

[TABLE]

Therefore, we have $\sup_{\|\mathbf{b}_{n}\|=1}\left|\mathbf{b}_{n}^{T}\mathbf{I}_{n1}\mathbf{b}_{n}\right|=o_{p}\left(1\right)$ . Using

[TABLE]

we also obtain $\sup_{\|\mathbf{b}_{n}\|=1}\left|\mathbf{b}_{n}^{T}\mathbf{I}_{ni}\mathbf{b}_{n}\right|=o_{p}\left(1\right),i=2,3.$ This finishes the proof of Theorem 2.3.

**The proof of Theorem 3.4 and 3.5 **** ** If some elements of $\left\{h_{i}\right\}_{i=1}^{n}$ equal zero, their associated subsampling probabilities are set to zero, and the subsampling probabilities of the remaining individuals are considered. Thus, we assume all $h_{i}>0$ , which does not affect generality.

To minimize $tr(\bar{\mathbf{H}}_{n}^{-1}(\boldsymbol{\beta}_{0})\bar{\mathbf{M}}_{r}(\boldsymbol{\beta}_{0})\bar{\mathbf{H}}_{n}^{-1}(\boldsymbol{\beta}_{0}))$ , which corresponds to the asymptotic mean squared error, the following optimization problem needs to be solved:

[TABLE]

For simplicity, define $h_{i}$ as $h_{i}^{MV}=\|\bar{\mathbf{H}}_{n}^{-1}\left(\boldsymbol{\beta}_{0}\right)\mathbf{X}_{i}^{T}\mathbf{A}_{i}^{1/2}\left(\boldsymbol{\beta}_{0}\right)\bar{\mathbf{R}}^{-1}\boldsymbol{\varepsilon}_{i}\left(\boldsymbol{\beta}_{0}\right)\|,i=1,\dots,n.$ We assume an ordered sequence $h_{1}\leq h_{2}\leq\dots\leq h_{n}$ , which does not restrict generality. Applying the Cauchy-Schwarz inequality,

[TABLE]

Equality holds if and only if $\pi_{i}\propto h_{i}$ . Thus, when $\pi_{i}=rh_{i}/(\sum_{j=1}^{n}h_{j}),i=1,\dots,n,$ the condition $\pi_{i}\leq 1$ is satisfied, and $\pi_{i}$ provides the optimal solution.

Otherwise, if $rh_{n}/(\sum_{j=1}^{n}h_{j})>1$ , then set $\pi_{n}=1$ . Thus, equation (LABEL:24) can be reformulated as an optimization problem for $\pi_{1},\dots,\pi_{n-1}$ :

[TABLE]

This problem follows an iterative structure, where the optimal solution minimizes the objective function $\tilde{H}=\sum_{i=n-k+1}^{n}h_{i}^{2}+\left(r-k\right)^{-1}\left(\sum_{i=1}^{n-k}h_{i}\right)^{2},$ for some $k$ such that

[TABLE]

Assume that $T$ exists such that

[TABLE]

and that $h_{n-k}<T\leq h_{n-k+1}$ . It follows that $\sum_{i=1}^{n-k}h_{i}=(r-k)T$ .

Substituting $\pi_{i}^{MV}=(\sum_{j=1}^{n}h_{j}\wedge T)^{-1}r\left(h_{i}\wedge T\right)$ into (LABEL:24), we obtain

[TABLE]

Thus, (LABEL:24) attains its minimum when $\pi_{i}^{MV}$ is used..

Next, we show that there exists a value $T$ such taht $h_{n-k}<T\leq h_{n-k+1}$ . Observe that $k$ satisfies

[TABLE]

Setting $T=h_{n-k+1}$ gives

[TABLE]

This leads to $(h_{n}\wedge T)/(\sum_{j=1}^{n}h_{j}\wedge T)\geq 1/r.$ Similarly, setting $T=h_{n-k}$ yields $(h_{n}\wedge T)/(\sum_{j=1}^{n}h_{j}\wedge T)<1/r.$ Since the function $\max_{1\leq i\leq n}(h_{i}\wedge T)/(\sum_{j=1}^{n}h_{j}\wedge T)$ is continuous in $T$ given $h_{1},\dots,h_{n}$ , the existence of $T$ is guaranteed.

On the other hand, for any $h_{n}\geq T^{\prime}>T$ , it follows that $T^{\prime}\wedge h_{n}\geq T\wedge h_{n}.$ From this, it can be derived that $T^{\prime}/(\sum_{i=1}^{n}(h_{i}\wedge T^{\prime}))\geq T/(\sum_{i=1}^{n}(h_{i}\wedge T)).$ Thus, given $M\in(h_{1},h_{n})$ , the function $(h_{n}\wedge T^{\prime})/(\sum_{i=1}^{n}(h_{i}\wedge T))$ is non-increasing. Therefore,

[TABLE]

which confirms that $h_{n-k}<T\leq h_{n-k+1}$ .

Since the proof of Theorem 3.5 follows similar arguments, it is omitted here.

**The proof of Theorem 3.6 **** ** The condition $\hat{p}_{i}^{sos}\geq\rho r/n$ ensures that $\max_{1\leq i\leq n}(n\hat{p}_{i}^{sos})^{-1}=O_{P}(r^{-1})$ . The consistency of the estimator is guaranteed by Theorem 2.2. Since $r_{1}r_{2}^{-1/2}\to 0$ , it suffices to focus on the subsample drawn in the second step. The primary difference between $p_{i}^{os}$ and $\hat{p}_{i}^{sos}$ lies in the replacement terms, namely $\tilde{\boldsymbol{\beta}}_{r_{1}^{*}},\tilde{\mathbf{R}}_{r_{1}^{*}},\mathbf{H}_{r_{1}}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})$ , and $\hat{\Psi}$ . Under Assumptions (C1) to (C7), the consistency of $\tilde{\boldsymbol{\beta}}_{r_{1}^{*}}$ follows, and $\tilde{\mathbf{R}}_{r_{1}^{*}}$ , $\mathbf{H}_{r_{1}}(\tilde{\boldsymbol{\beta}}_{r_{1}^{*}})$ , and $\hat{\Psi}$ are also consistent estimators of

[TABLE]

Thus, by Theorems 2.2 and the continuous mapping theorem, asymptotic normality is established.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

11 Ai M, Yu J, Zhang H, Wang H. Optimal subsampling algorithms for big data regressions. Statistica Sinica, 2021, 31(2): 749-772.
22 Balan R M, Schiopu-Kratina I. Asymptotic results with generalized estimating equations for longitudinal data. The Annals of Statistics, 2005, 33(2): 522-541.
33 Chaganty N R. An alternative approach to the analysis of longitudinal data via generalized estimating equations. Journal of Statistical Planning and Inference, 1997, 63(1): 39-54.
44 Fithian W, Hastie T. Local case-control sampling: Efficient subsampling in imbalanced data sets. The Annals of Statistics, 2014, 42(5): 1693-1724.
55 Han H, Fu L. Optimal subsampling algorithm for the marginal model with large longitudinal data. ar Xiv preprint, 2023, ar Xiv:2311.08812.
66 Gao J, Wang L, Lian H. Optimal decorrelated score subsampling for generalized linear models with massive data. Science China Mathematics, 2024, 67: 405-430.
77 Liang K Y, Zeger S L. Longitudinal data analysis using generalized linear models. Biometrika, 1986, 73(1): 13-22.
88 Li B. On the consistency of generalized estimating equations. Lecture Notes-Monograph Series, 1997, 32: 115-136.