This paper introduces a novel penalized empirical likelihood method tailored for high-dimensional sparse models, effectively reducing the number of estimating equations while maintaining estimator validity and consistency.
Contribution
It proposes a new penalized EL approach that regularizes both model parameters and Lagrange multipliers, enabling dimension reduction and variable selection in high-dimensional settings.
Findings
01
Achieves effective dimension reduction of estimating equations.
02
Maintains estimator consistency and asymptotic normality.
03
Demonstrates promising results in simulations and real data.
Abstract
Statistical methods with empirical likelihood (EL) are appealing and effective especially in conjunction with estimating equations through which useful data information can be adaptively and flexibly incorporated. It is also known in the literature that EL approaches encounter difficulties when dealing with problems having high-dimensional model parameters and estimating equations. To overcome the challenges, we begin our study with a careful investigation on high-dimensional EL from a new scope targeting at estimating a high-dimensional sparse model parameters. We show that the new scope provides an opportunity for relaxing the stringent requirement on the dimensionality of the model parameter. Motivated by the new scope, we then propose a new penalized EL by applying two penalty functions respectively regularizing the model parameters and the associated Lagrange multipliers in the…
Tables3
Table 1. Table 1: Simulation results for mean of a normal distribution based on 100 random samples. Here 𝜽 nonzero subscript 𝜽 nonzero \boldsymbol{\theta}_{\rm nonzero} is the average number of selected nonzero components, 𝜽 true subscript 𝜽 true \boldsymbol{\theta}_{\rm true} is the average number of true nonzero components that are selected, ME reports the model error, and No.EE’s reports the number of estimating equations selected.
Method
ME
No. EE’s
MLE-Oracle
3 (0)
NA
0.062 (0.009)
NA
MLE
100 (0)
3 (0)
2.096 (0.287)
NA
PEL-BIC
24.06 (4.13)
0.72 (0.12)
33.276 (1.507)
100 (0)
PEL-BICC
23.15 (4.08)
0.69 (0.12)
33.635 (1.483)
100 (0)
PEL-EBIC
23.15 (4.08)
0.69 (0.12)
33.635 (1.483)
100 (0)
PEL2-BIC
3.41 (0.17)
2.81 (0.04)
0.332 (0.041)
5.11 (0.34)
PEL2-BICC
3.29 (0.15)
2.80 (0.04)
0.302 (0.041)
6.13 (0.33)
PEL2-EBIC
3.15 (0.13)
2.76 (0.05)
0.341 (0.052)
8.20 (0.21)
MLE-Oracle
3 (0)
NA
0.024 (0.003)
NA
MLE
200 (0)
3 (0)
1.743 (0.179)
NA
PEL-BIC
22.02 (6.02)
0.33 (0.09)
38.078 (1.073)
199.98 (0.02)
PEL-BICC
22.02 (6.02)
0.33 (0.09)
38.078 (1.073)
199.98 (0.02)
PEL-EBIC
22.02 (6.02)
0.33 (0.09)
38.078 (1.073)
199.98 (0.02)
PEL2-BIC
6.41 (1.84)
2.84 (0.04)
0.333 (0.091)
6.67 (0.23)
PEL2-BICC
6.18 (1.84)
2.82 (0.04)
0.352 (0.092)
6.64 (0.23)
PEL2-EBIC
5.82 (1.86)
2.80 (0.04)
0.372 (0.094)
6.69 (0.24)
MLE-Oracle
3 (0)
NA
0.031 (0.005)
NA
MLE
NA
NA
NA
NA
PEL-BIC
85.71 (22.69)
0.51 (0.14)
37.585 (1.193)
500 (0)
PEL-BICC
0 (0)
0 (0)
42 (0)
500 (0)
PEL-EBIC
0 (0)
0 (0)
42 (0)
500 (0)
PEL2-BIC
2.88 (0.11)
2.70 (0.06)
0.356 (0.057)
6.40 (0.36)
PEL2-BICC
2.82 (0.09)
2.70 (0.06)
0.376 (0.058)
6.53 (0.35)
PEL2-EBIC
2.83 (0.09)
2.71 (0.06)
0.369 (0.058)
6.97 (0.32)
Table 2. Table 2: Simulation results for linear regression based on 100 replicates. Here 𝜽 nonzero subscript 𝜽 nonzero \boldsymbol{\theta}_{\rm nonzero} is the average number of selected nonzero components, 𝜽 true subscript 𝜽 true \boldsymbol{\theta}_{\rm true} is the average number of true nonzero components that are selected, ME reports the model error, and No.EE’s reports the number of estimating equations selected.
Method
ME
No. EE’s
MLE-Oracle
3 (0)
NA
0.069 (0.005)
NA
LASSO
15.21 (0.88)
3 (0)
0.439 (0.034)
NA
PEL-BIC
0 (0)
0 (0)
28.75 (0)
100 (0)
PEL-BICC
0 (0)
0 (0)
28.75 (0)
100 (0)
PEL-EBIC
0 (0)
0 (0)
28.75 (0)
100 (0)
PEL2-BIC
6.39 (0.52)
2.98 (0.02)
0.497 (0.069)
10.46 (0.46)
PEL2-BICC
6.33 (0.52)
2.98 (0.02)
0.498 (0.069)
10.49 (0.46)
PEL2-EBIC
6.06 (0.52)
2.97 (0.02)
0.531 (0.07)
10.43 (0.47)
MLE-Oracle
3 (0)
NA
0.047 (0.005)
NA
LASSO
17.79 (0.87)
3 (0)
0.374 (0.019)
NA
PEL-BIC
0 (0)
0 (0)
28.75 (0)
200 (0)
PEL-BICC
0 (0)
0 (0)
28.75 (0)
200 (0)
PEL-EBIC
0 (0)
0 (0)
28.75 (0)
200 (0)
PEL2-BIC
9.22 (1.27)
3 (0)
0.647 (0.118)
5.38 (0.17)
PEL2-BICC
9.28 (1.28)
3 (0)
0.651 (0.119)
5.39 (0.17)
PEL2-EBIC
8.38 (1.03)
3 (0)
0.632 (0.119)
5.34 (0.17)
MLE-Oracle
3 (0)
NA
0.039 (0.003)
NA
LASSO
23.79 (1.23)
3 (0)
0.507 (0.028)
NA
PEL-BIC
0 (0)
0 (0)
28.75 (0)
500 (0)
PEL-BICC
0 (0)
0 (0)
28.75 (0)
500 (0)
PEL-EBIC
0 (0)
0 (0)
28.75 (0)
500 (0)
PEL2-BIC
6.28 (1.31)
3 (0)
0.601 (0.083)
5.48 (0.16)
PEL2-BICC
5.96 (1.31)
3 (0)
0.593 (0.085)
5.38 (0.17)
PEL2-EBIC
6.04 (1.32)
3 (0)
0.602 (0.086)
5.41 (0.16)
Table 3. Table 3: Simulation results for regression model for longitudinal data with repeated measures based on 100 replicates. Here 𝜽 nonzero subscript 𝜽 nonzero \boldsymbol{\theta}_{\rm nonzero} is the average number of selected nonzero components, 𝜽 true subscript 𝜽 true \boldsymbol{\theta}_{\rm true} is the average number of true nonzero components that are selected, ME reports the model error, and No.EE’s reports the number of estimating equations selected.
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Full text
A New Scope of Penalized Empirical Likelihood with High-Dimensional Estimating Equations
Jinyuan Chang
Southwestern University of
Finance and Economics
Cheng Yong Tang
Temple University
Tong Tong Wu
University of Rochester
Abstract
Statistical methods with empirical likelihood (EL) are appealing and effective especially in conjunction with estimating equations through which useful data information can be adaptively and flexibly incorporated. It is also known in the literature that EL approaches encounter difficulties when dealing with problems having high-dimensional model parameters and estimating equations. To overcome the challenges,
we begin our study with a careful investigation on high-dimensional EL from a new scope targeting at estimating a high-dimensional sparse model parameters.
We show that the new scope provides an opportunity for relaxing the stringent requirement on the dimensionality of the model parameter.
Motivated by the new scope, we then propose a new penalized EL by applying two penalty functions respectively regularizing the model parameters and the associated Lagrange multipliers in the optimizations of EL. By penalizing the Lagrange multiplier to encourage its sparsity, we show that drastic dimension reduction in the number of estimating equations can be effectively achieved without compromising the validity and consistency of the resulting estimators. Most attractively, such a reduction in dimensionality of estimating equations is actually equivalent to a selection among those high-dimensional estimating equations, resulting in a highly parsimonious and effective device for high-dimensional sparse model parameters. Allowing both the dimensionalities of model parameters and estimating equations growing exponentially with the sample size, our theory demonstrates that the estimator from our new penalized EL is sparse and consistent with asymptotically normally distributed nonzero components. Numerical simulations and a real data analysis show that the proposed penalized EL works promisingly.
Statistical approaches using estimating equations are widely applicable to solve a broad class of practical problems. The most influential special cases of estimating equations include the fundamental maximum likelihood score equations and those from the popular generalized methods of moments (Hansen, 1982). The approaches of using estimating equations are particularly appealing in practice with merits from requiring less stringent distributional assumptions on the data model, yet being adaptable to flexibly incorporate suitable information and conditions extracted from practical features in various scenarios of interests.
Empirical likelihood (EL, hereinafter) (Owen, 2001) coupled with estimating equations has been demonstrated successful since the seminal work of Qin and Lawless (1994). It is particularly appealing that the maximum EL estimator asymptotically achieves the semiparametric efficiency bound (Qin and Lawless, 1994). The properties of EL are also desirable through some higher order analyses (Newey and Smith, 2004; Chen and Cui,, 2006, 2007). Moreover, the Wilks’ theorems (Owen, 2001; Qin and Lawless, 1994) for EL ensure that EL ratio is asymptotically central chi-square distributed when evaluated at the truth. Hence, EL provides an analogous device to the conventional fully parametric likelihood for statistical inferences, but without requiring a fully parametric likelihood built upon more stringent distributional assumptions.
In recent years, high data dimensionality in practice has attracted increasing research attention and brought unprecedented challenges to approaches based on estimating equations and EL. On one hand, studies in Chen, Peng and Qin (2009), Hjort, McKeague and Van Keilegom (2009), Tang and Leng (2010), Leng and Tang (2012), and Chang, Chen and Chen (2015) reveal that conventional asymptotic schemes and results for EL are expected to work only when both the dimensionality of the parameter p and the number of the estimating equations r are growing at some rate slower than the sample size n. On the other hand, however, challenges due to high-dimensionality require a capacity to deal with cases where both p and r can be much larger than n. Tang and Leng (2010), Leng and Tang (2012), and Chang, Chen and Chen (2015) attempt to utilize sparsity of the model parameters by applying penalty functions on those parameters. Their results show that sparse estimators with good properties are achievable. However, the restriction from the data dimensionality is not alleviated by using penalized EL in their works.
The challenges for EL from high data dimensionality are well documented in the literature, and there are recent investigations on the remedies.
Tsao (2004) found that for fixed n with moderately large fixed p, the probability that the truth is contained in the EL based confidence region can be substantially smaller than the nominal level, resulting in the under-coverage problem. As remedies, Tsao and Wu (2013, 2014) propose extended EL to address the under-coverage problems due to the constraints on the parameter space. With a modification avoiding equality constraints, Bartolucci (2007) propose a penalized EL method via optimizing products of probability weights penalized by a loss function depending on the model parameter. Lahiri and Mukhopadhyay (2012) propose a different type of loss from that in Bartolucci (2007) and study its properties with high-dimensional model parameter and dependent data. To our best knowledge, no estimation problems have been investigated with the EL formulations of Bartolucci (2007) and Lahiri and Mukhopadhyay (2012).
In this paper, from a new scope on investigating high-dimensional sparse model parameters, we study the properties of EL by carefully examining the impacts from the data dimensionally, and exploring the opportunity from targeting at the sparse model parameter. We find that consistently estimating high-dimensional sparse model parameter by a penalized EL is feasible with fewer number of estimating functions than the model parameter.
Such an observation motivates us to propose a new penalized EL approach to tackle high-dimensional statistical problems where both the numbers of model parameters and estimating equations, p and r respectively, can grow at an exponential rate of the sample size n. We solve the problem by employing two penalty functions when constructing the EL with high-dimensional estimating equations. Specifically, the first penalty function is on the magnitude of the model parameters with the goal to encourage sparsity in the resulting estimator. Additionally, a second penalty function is imposed on the Lagrange multiplier to encourage its sparsity when optimizing the EL evaluated at given values of the parameters. We also observe that obtaining a sparse Lagrange multiplier in EL is equivalent to reducing the dimensionality r via an effective selection among those estimating equations, which itself is an interesting problem and a new scope; see our discussions in Sections 2 and 3.
Here we note that the effect of the sparsity encouraging penalty on the Lagrange multiplier relates to the methods for selecting moments in the GMM methods, a problem that has been extensively studied in the econometrics literature; see, among others, Cheng and Liao (2015) and reference therein.
Recently, Cheng and Liao (2015) and Shi (2016) study the problem with many moment conditions for estimating a fixed dimensional model parameter. Cheng and Liao (2015) propose to treat the sample averages of the moment conditions as additional parameters to be optimized, and to apply the L1 penalty on them to encourage sparsity so that effective moment selection can be achieved. The role of the L1 penalty in their approach is seen similar to ours on the Lagrange multiplier for the purpose of moment selection. In light of the Dantzig selector approach of Candes and Tao (2007), Shi (2016) propose a new EL formulation by relaxing the equality constraints to inequality ones involving some regularization parameter, so that effective selection of the moment conditions is also achieved.
Nevertheless, none of Cheng and Liao (2015) and Shi (2016) investigates the impacts from diverging number of model parameters that potentially can be sparse.
Our investigation contributes to the area of EL with high-dimensional statistical problems from a new scope. Our approach successfully extends the EL approach with estimating functions to scenarios allowing both p and r growing exponentially with the sample size n. As shown in Sections 2 and 3, new results for high-dimensional penalized EL are established, and many of them are interesting in both areas of EL and estimating equations. Our analysis first reveals a result of its own interests that substantially broadens the understanding of the relationship between the number of estimating equations r and the number of model parameters p with penalized EL. Surprisingly, we find that with an appropriate penalization, a consistent and sparse estimator of the model parameter actually does not require r≥p, thanks to the new scope from estimating a sparse model parameter. In particular, we show that a sparse estimator with s nonzero components for the p-dimensional parameter technically may only require that the number of estimating equations r to be no less than s. Such a result crucially supports the motivation in our new penalized EL approach for the second penalty function imposed on the Lagrange multiplier to reduce the effective number of estimating equations actually involved in the high-dimensional penalized EL. That is, the resulting sparse Lagrange multiplier from the penalization is equivalent to a selection among available estimating equations for the model parameters. Our theory shows that the penalized EL estimator is consistent and can estimate the zero components of the model parameters as zero with probability tending to one. Additionally, the nonzero components of the penalized EL estimator is asymptotically normally distributed.
The rest of this paper is organized as follows. The new scope with high-dimensional sparse model parameter on EL and penalized EL is investigated in Section 2. The new penalized EL with an additional penalty function on the Lagrange multiplier and its properties for estimating high-dimensional sparse model parameters are given in Section 3. An algorithm using coordinate descent for solving the penalized EL is presented in Section 4. Numerical examples with simulated and real data are shown in Section 5. Some discussions are given in Section 6. All technical details are provided in Section 7. The Supplementary Material contains more technical proofs of the theoretical results.
2 Empirical likelihood and penalized empirical likelihood
2.1 An overview of empirical likelihood with diverging dimensionality
Let us define some notations first. For a q-dimensional vector a=(a1,…,aq)T, let ∣a∣∞=max1≤k≤q∣ak∣, ∣a∣1=∑k=1q∣ak∣ and ∣a∣2=(∑k=1qak2)1/2 be its L∞-norm, L1-norm, and L2-norm, respectively. For a q×q matrix M=(mij)q×q, let ∥M∥∞=max1≤i≤q∑j=1q∣mij∣,
∥M∥2=λmax1/2(MTM) and ∥M∥F=(∑i,j=1qmij2)1/2 be the L∞-norm, L2-norm and Frobenius-norm of M, respectively.
Let X1,…,Xn be d-dimensional independent and identically distributed generic observations and θ=(θ1,…,θp)T be a p-dimensional parameter with support Θ. For an r-dimensional estimating function
g(X;θ)={g1(X;θ),…,gr(X;θ)}T,
the information for the model parameter θ is collected by the unbiased moment condition
[TABLE]
where θ0∈Θ is the unknown truth. When the sample size n grows, following Hjort, McKeague and Van Keilegom (2009) and Chang, Chen and Chen (2015), the observations {g(Xi;θ)}i=1n can be viewed as a triangular array where r, p, d, Xi, θ and g(X;θ) may all depend on the sample size n.
Following the idea of EL (Owen, 1988, 1990), Qin and Lawless (1994) investigate an EL with estimating equations:
[TABLE]
By maximizing L(θ) with respect to θ, one obtains the so-called maximum EL estimator θ=argmaxθ∈ΘL(θ). Maximizing (2.2) can be carried out equivalently by solving the corresponding dual problem, implying
[TABLE]
where Λn(θ)={λ∈Rr:λTg(Xi;θ)∈V,i=1,…,n} for θ∈Θ and V is an open interval containing zero.
In a conventional setting when p and r are fixed as n→∞, r≥p is required to ensure that all components of θ are identifiable. In high-dimensional cases, however, it is documented in the literature that accommodating a diverging r is a key difficulty for EL; see, among others, Hjort, McKeague and Van Keilegom (2009), Chen, Peng and Qin (2009), Leng and Tang (2012), and Chang, Chen and Chen (2015). The reason is that the Lagrange multiplier λ∈Rr in (2.3) is of the same high dimensionality r. Since ∣λ∣2 is required to be op(1) in theoretical analyses of EL, high-dimensional r is clearly cumbersome. A direct consequence is that dimensionality p and r for EL in (2.2) can only be accommodated at some polynomial rate of the sample size n .
To explore EL with high-dimensional statistical problems, let us begin with elucidating their impacts on the EL estimator synthetically from the sample size n, the number of estimating functions r, and the dimensionality of the model parameter p.
We first present a general result for the maximum EL estimator θ with r estimating equations.
Proposition 1**.**
Assume that there exist uniform constants C1>0, C2>1 and γ>2 such that
[TABLE]
and
[TABLE]
If r=o(n1/2−1/γ), then θ defined in (2.3) satisfies ∣gˉ(θ)∣2=Op(r1/2n−1/2) where gˉ(θ)=n−1∑i=1ng(Xi;θ).
Conditions for Proposition 1 are conventional ones and are mild. The requirement (2.4) ensures that some moments with order larger than 2 exist for the estimating functions, and (2.5) says that the sample covariance matrices of the estimating functions should behave reasonably well. Consistent with the finding in Hjort, McKeague and Van Keilegom (2009) and Chen, Peng and Qin (2009), the higher the order of the moment γ is, the more estimating functions can be accommodated. When the estimating functions are bounded, γ=∞, r is allowed to be o(n1/2).
The key implication of Proposition 1 is that the sample mean of the estimating functions is well behaving, regardless the dimensionality of the model parameter p is. That is, with r unbiased estimating functions, the optimum ∣gˉ(θ)∣2 is Op(r1/2n−1/2). Hence the impact on the behavior of the estimating function is the dimensionality r, which cannot grow faster than n1/2 as n→∞.
Clearly, the impact from p on the maximum EL estimator is on the identifiability of the model parameter. That is, θ in (2.3) is not uniquely defined when r<p with no further constraints, rendering ambiguity and inapplicability of the EL methods for estimating high-dimensional model parameters.
An example of the situation is that the identifiability issue happens in the classical linear models when the model matrix is not of full rank, so that the minimum of the least squares criterion function well exists but the ordinary least squares estimator is not uniquely defined in that case.
To solve the problem, our next objective is to illustrate that identifying a sparse p-dimensional model parameter is still feasible.
2.2 High-dimensional sparse model parameter
The intuition here is that if one concerns instead a high-dimensional sparse model parameter θ such that most of its components are zeros, then identification and estimation of such a model parameter are feasible with fewer estimating functions by EL with appropriate penalization.
Specifically, we write θ0=(θ10,…,θp0)T and let S={1≤k≤p:θk0=0} with s=∣S∣.
Here S is an unknown set, and the number of nonzero components s is much smaller than p. Without loss of generality, we let θ0=(θ0,ST,θ0,ScT)T where θ0,S∈Rs being the nonzero components and θ0,Sc=0∈Rp−s.
For identification of the sparse model parameter, we impose the following condition.
Condition 1**.**
Assume that
[TABLE]
for any ε>0, where Δ(⋅) is a positive function satisfying liminfε→0+ε−βΔ(ε)≥K1 for some uniform constants K1>0 and β>0.
The identification condition (2.6) can be viewed as a dedicated one for estimating sparse model parameters. Condition 1 is not stringent, and it ensures identifying the nonzero components of θ locally. Studying local optimums in high-dimensional statistical problems is common in the literature with reasonable technical conditions; see, for example, Lv and Fan (2009) and Zhang (2010).
Condition 1 means that the mean values of the estimating functions at the truth adequately differ from those outside a small neighborhood of the sparse support of θ0.
Here β is some generic constant related to the consistency result in Proposition 2. For estimating a high-dimensional mean parameter with g(X;θ)=X−θ, we can choose Δ(ε)=ε and β=1 in Condition 1. For linear regression model, g(X;θ)=Z(Y−ZTθ) with Z and Y being the covariates and response variable respectively, and X=(Y,ZT)T, we can select Δ(ε)=ε∥ΣZ,S−1∥∞−1 in Condition 1, where ΣZ,S=E(ZSZST). More generally, if there is a subset E⊂{1,…,r} with ∣E∣=s and [E{∇θSgE(Xi;θ)}]−1 exists where gE(⋅) collects the set of estimating functions indexed by E, then we can select Δ(ε)=εinfθ∈{θ=(θST,θScT)T:θSc=0}∥[E{∇θSgE(Xi;θ)}]−1∥∞−1 in Condition 1.
Intuitively, Condition 1 ensures the identifiability of the s nonzero components of θ0 so that a consistent sparse estimator is possible as n→∞, provided r≥s, r1/2n−1/2→0, and conditions in Proposition 2.
As a special case when Sc is empty,
Condition 1 for identification becomes a global one for a dense model parameter θ:
[TABLE]
where Δ(⋅) is a positive function satisfying liminfε→0+ε−βΔ(ε)≥K1 for some uniform constants K1>0 and β>0. Similar global identification conditions can be found in Chen (2007) and Chen and Pouzo (2012) for some other models.
To estimate a sparse model parameter with unknown zero components, we consider a penalized EL estimator as
[TABLE]
where θ=(θ1,…,θp)T, and P1,π(⋅) is a penalty function with tuning parameter π. For any penalty function Pτ(⋅) with tuning parameter τ, let ρ(t;τ)=τ−1Pτ(t) for any t∈[0,∞) and τ∈(0,∞).
We assume the penalty function P1,π(⋅) belongs to the following class as considered in Lv and Fan (2009):
[TABLE]
The class of penalty function by (2.9) is broad and general.
The commonly used L1 penalty, SCAD penalty (Fan and Li, 2001) and MCP penalty (Zhang, 2010) all belong to the class P.
For establishing the consistency of θn, we also assume the following condition.
Condition 2**.**
The function gj(X;θ) is continuously differentiable with respect to θ∈Θ for any X and j=1,…,r satisfying the conditions
[TABLE]
for some uniform constant K2>0, and
[TABLE]
holds for some φn>0, which may diverge with n.
Condition 2 is on the continuity of the estimating function with respect to θ. Typically, smooth estimating functions can be assumed to have bounded derivatives so that Condition 2 is easily satisfied. At the sample level, considering the high-dimensionality of the problem, we can accommodate diverging φn in (2.11) so that our results hold in broad situations.
If there exist envelop functions Bn,jk(⋅) such that ∣∂gj(X;θ)/∂θk∣≤Bn,jk(X) for any θ∈Θ, j=1,…,r and k∈/S, and ∣E{Bn,jkm(Xi)}∣≤Km!Hm−2 for any m≥2 and j=1,…,r and k∈/S, where K and H are two uniform positive constants independent of j and k. Then by Theorem 2.8 of Petrov (1995), we know sup1≤j≤rsupk∈/Sn−1∑i=1nBn,jk(Xi)=Op(1) provided that max{logr,logp}=o(n). Therefore, (2.11) holds with φn=1, accommodating exponentially growing dimensionality r and p. Since the identifiability condition (2.6) only provides a lower bound for the difference between ∣E{g(Xi;θ)}∣∞ and [math] when θ=(θST,θScT)T satisfying ∣θS−θ0,S∣∞>ε and θSc=0, we make use of (2.10) to derive a lower bound for ∣E{g(Xi;θ)}∣∞ when θ=(θST,θScT)T satisfies θSc=0 but ∣θSc∣1 is small, and then θ0 is a local minimizer for ∣E{g(Xi;θ)}∣∞.
For special case with linear models, Condition (2.10) becomes one similar to the well known crucial irrepresentable condition (Zhao and Yu,, 2007) for sparse linear regression at the population level.
We have the following proposition on the properties of the penalized EL estimator (2.8).
Proposition 2**.**
*Let P1,π(⋅)∈P for P defined in (2.9). Define an=∑k=1pP1,π(∣θk0∣) and bn=max{rn−1,an}. Assume that (2.4), (2.5), Conditions 1 and 2 hold.
Suppose that
*
[TABLE]
for some χn→0 and cn→0 with bn1/(2β)cn−1→0. If r=o(n1/2−1/γ), max{bn,rsχnbn1/(2β)}=o(n−2/γ) and r1/2φnmax{r1/2n−1/2,s1/2χn1/2bn1/(4β)}=o(π), then there exists a local minimizer θn∈Θ for (2.8) such that ∣θn,S−θ0,S∣∞=Op{bn1/(2β)} and P(θn,Sc=0)→1 as n→∞.
In Proposition 2, an depends on the truth of the model parameter and the tuning parameter π in the penalty function. For a typical penalty function belonging to (2.9) and a model parameter with s nonzero components, it is the case that an=O(sπ)→0 as n→∞.
Requirements on the first derivative of the penalty function via χn is to control the bias introduced by the penalty function P1,π(⋅) on θn. See (7.3) in Section 7.2 for details. If we propose the condition bn=o(mink∈S∣θk0∣2β) on the magnitudes of the nonzero components of θ0, (2.12) can be replaced by
[TABLE]
for some constant c∈(0,1).
For those asymptotically unbiased penalty functions like SCAD and MCP, χn is exactly 0 in (2.13) for n sufficiently large provided that the nonzero components of θ0 are not too small in the sense that the signal strength does not diminish to zero too fast, i.e. bn=o(mink∈S∣θk0∣2β); see also Fan and Li (2001). Hence, if β=1 in Condition 1, ∣θn,S−θ0,S∣∞=Op(bn1/2)→0 as n→∞.
Further, if π is chosen as O{(n−1logp)1/2}, a common one in the literature, then ∣θn,S−θ0,S∣∞=Op{s1/2(n−1logp)1/4}, providing a conservative convergence rate of the estimator θn,S.
Let Fn(θ)=maxλ∈Λn(θ)n−1∑i=1nlog{1+λTg(Xi;θ)}+∑k=1pP1,π(∣θk∣). The rationale of Proposition 2 is that for θ=(θST,θScT)T in a small neighborhood of θ0 such that ∣θS−θ0,S∣∞≥εn takes value departing from θ0, i.e., Δ(εn) decays
to zero at some slow enough rate, Fn(θ) takes a value larger than ξnFn(θ0) for some diverging ξn with probability tending to 1; see also Chang, Tang and Wu (2013, 2016) for such a phenomenon
of EL. Then with the penalty function encouraging sparsity of θn, we are able to establish the consistency of the penalized EL estimator for a sparse model parameter.
Our Proposition 2 shows that the penalized EL can consistently estimate a high-dimensional model parameter with p growing exponentially with n provided bn→0, though the requirement on r remains in a way such that r=o(n1/2).
The development of Proposition 2 is fundamentally facilitated by our motivation: to estimate a high-dimensional sparse model parameter. With the new identification condition (2.6), sparse and consistent estimator can be obtained by using penalized EL.
The intuition of our results is clear: to identify s nonzero components of a sparse p-dimensional model parameter, one essentially requires r(r≥s) informative estimating functions for those s components. The practical interpretation is also clear: given fewer estimating functions than the model parameters, a reasonable direction is to identify and estimate a sparse model parameter.
Such an observation is consistent with the ones found in Gautier and Tsybakov (2014) for high-dimensional instrumental variables regression with endogenity where the number of instrumental variables may be less than the model parameters in the regression problems.
3 A new penalized empirical likelihood
With the penalized EL estimator θn in (2.8) capable of handling high-dimensional model parameter with fewer number of estimating functions, our next goal is to accommodate a more general situation: allowing both r and p to grow exponentially with n. For such a purpose, we propose to update the penalized EL estimator with an extra penalty encouraging sparsity in λ:
[TABLE]
where θ=(θ1,…,θp)T, λ=(λ1,…,λr)T, and P1,π(⋅) and P2,ν(⋅) are two penalty functions with tuning parameters π and ν, respectively. Our motivation is that with appropriately chosen penalty function P2,ν(⋅) and tuning parameter ν, the estimator θn is associated with a sparse Lagrange multiplier λ. Since sparse λ effectively uses a subset of the estimating functions g(⋅;⋅), r itself can be allowed to be large as long as the number of nonzero components in λ is small, essentially satisfying the requirement in Proposition 2. Hence, one expects analogous properties of (3.1) to those in Proposition 2, but now being capable of accommodating high-dimensional p and r simultaneously.
Not surprisingly, involving the penalty P2,ν(⋅) makes the technical analysis much more challenging, especially when we are handling exponentially diverging p and r with n→∞.
For θ∈Θ and λ∈Λn(θ), we define
[TABLE]
Here f(λ;θ) is a function of λ upon given θ. Let λ(θ)=argmaxλ∈Λn(θ)f(λ;θ) be the Lagrange multiplier defined at θ∈Θ. For any subset A⊂{1,…,r}, we denote by gA(Xi;θ) the subvector of g(Xi;θ) with components indexed by A. We write gˉA(θ)=n−1∑i=1ngA(Xi;θ), VA(θ)=n−1∑i=1ngA(Xi;θ)gA(Xi;θ)T and VA(θ)=E{gA(Xi;θ)gA(Xi;θ)T}.
For any θ∈Θ and j=1,…,r, define gˉj(θ)=n−1∑i=1ngj(Xi;θ). We first characterize the properties of λ(θ) for θ near the truth θ0. To do this, we assume the following condition for the existence of higher order moments, a similar one to the common technical conditions on the tail probability in high-dimensional statistical analysis.
Condition 3**.**
There exist some K3>0 and γ>4 such that
[TABLE]
Let ρ2(t;ν)=ν−1P2,ν(t). We also take P2,ν(⋅)∈P for P defined in (2.9), so that ρ2′(0+;ν) is independent of ν. We write it as ρ2′(0+) for simplicity and define Mθ={1≤j≤r:∣gˉj(θ)∣≥νρ2′(0+)} for any θ∈Θ. Proposition 3 below shows that for any θ near the truth θ0, the support of the Lagrange multiplier λ(θ) is a subset of Mθ with probability approaching one.
Proposition 3**.**
Let {θn} be a sequence in Θ and P2,ν(⋅)∈P be a convex function for P defined in (2.9). For some C∈(0,1), define Mθn∗={1≤j≤r:∣gˉj(θn)∣≥Cνρ2′(0+)}. Assume Condition 3 hold. Further, for the sequence {θn}, we assume that the eigenvalues of VMθn(θn) are uniformly bounded away from zero and infinity with probability approaching one, and ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(un) for some un→0. Let max1≤j≤rn−1∑i=1n∣gj(Xi;θn)∣2=Op(ςn) for some ςn>0 that may diverge with n. If mn1/2unςn=o(ν) and mn1/2unn1/γ=o(1) where mn=∣Mθn∗∣, then with probability approaching one there exists a sparse local maximizer λ(θn)=(λn,1,…,λn,r)T for f(λ;θn) satisfying the three results: (i)∣λ(θn)∣2=Op(un), (ii)supp{λ(θn)}⊂Mθn, and (iii)\mboxsgn(λn,j)=\mboxsgn{gˉj(θn)} for any j∈Mθn with λn,j=0.
Conditions in Proposition 3 play roles from a few aspects. First, the sequence {θn} can be taken as one that approaches the truth θ0 as n→∞.
Then gˉMθn(θn) will be small when n is large.
As shown in the proof, νρ2′(0+)\mboxsgn{gˉMθn(θn)} is the asymptotically leading term of gˉMθn(θn).
The reason is that the tuning parameter ν typically diminishes to [math] at some slower rate than n−1/2, so that νρ2′(0+)\mboxsgn{gˉMθn(θn)} leads to a non-negligible contribution in the limiting distribution of θn, and our analysis shows that it leads to a correctable bias term in θn.
Upon removing the leading order term, we assume that ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(un) with un→0, which is a condition that can be easily satisfied. Requirement on the eigenvalues of VMθn(θn) is natural so that we can characterize the limiting behavior of the estimator θn.
Furthermore, mn is taken to be an upper bound of the size of Mθn,
the generic description such as mn1/2unςn=o(ν) and mn1/2unn1/γ=o(1) can be viewed as characterizing the capacity of the penalized EL under which it is reliable for consistent estimators, depending on the behavior of the estimating function g(⋅;⋅) on its continuity and tail probabilistic properties.
Proposition 3 implies that when θ is approaching θ0, the sparse λ in (3.1) effectively conducts a moments selection by choosing the estimating functions in a way that gˉj(θ) has large absolute deviation from [math].
Let μj(θ)=E{gj(Xi;θ)}, then we know that μj(θ0)=0 and gˉj(θ)→μj(θ) in probability as n→∞. If we take θ to be in the neighborhood of θ0, then the first order Taylor expansion gives that μj(θ)=μj(θ)−μj(θ0)={∇θμj(θ∗)}T(θ−θ0) for some θ∗ between θ and θ0. Hence, those components of the estimating functions with large magnitude in the derivative of their expected value with respect to θ will be selected. Since larger derivative indicates a steeper direction towards the truth θ0, making it easier and more informative to find the optimum. Therefore, selecting components in Mθ is seen sensible. However, we note that without further strong and likely to be unrealistic conditions on the shape of the estimating functions, Mθ cannot be controlled as a fixed set even at the limiting case when n→∞, so that it will depend on the value of the parameter θ. Instead of requiring that Mθ to be fixed, we show in the following that for any choice of its subset satisfying some reasonable conditions, the resulting penalized EL estimator is consistent and asymptotically normally distributed.
Let
[TABLE]
for some cn→0 satisfying bn1/(2β)cn−1→0 where bn is more clearly specified in Condition 6 below. Based on Proposition 3, we know the support of Lagrange multiplier λ(θ) is a subset of Mθ with probability approaching one when θ is in a small neighborhood of θ0.
Here ℓn is a technical device controlling the maximum number of effective estimating functions when applying the new penalized EL, and it can be viewed as a cap of the r in Proposition 2.
Though ℓn is a technical device, we remark that, practically, one can always achieve the control of the nonzero components of λ by appropriately choosing the tuning parameter ν.
To establish the consistency of the penalized EL estimator θn defined in (3.1), we need the following extra regularity conditions on the continuity and probabilistic behavior of the estimating functions.
Condition 4**.**
There exist uniform constants 0<K4<K5 such that K4<λmin{VF(θ0)}≤λmax{VF(θ0)}<K5 for any F⊂{1,…,r} with ∣F∣≤ℓn, where ℓn is defined in (3.2).
Condition 5**.**
Assume that
[TABLE]
for some ξn>0, ωn>0 and ϱn>0 that may diverge with n.
Condition 6**.**
Let bn=max{an,ν2} with an=∑k=1pP1,π(∣θk0∣). There exist χn→0 and cn→0 with bn1/(2β)cn−1→0 for β defined in Condition 1 such that maxk∈Ssup0<t<∣θk0∣+cnP1,π′(t)=O(χn).
Here Condition 4 is actually a weaker one than that in (2.5) in the sense that it only requires the population covariance matrices of subsets of estimating functions need to well behave at the truth θ0.
The first two bounds in Condition 5 are used to characterize the behavior of the eigenvalues of VF(θ) when θ in a small neighborhood of θ0; see Lemma 1 in Section 7.4.
We do not impose explicit rate on ξn, ωn, and ϱn, so that the conditions are generally not restrictive. Similar to our earlier discussion for φn in (2.11) in Condition 2, we can actually choose ξn=ωn=ϱn=1 under some additional mild conditions provided that max{logr,logp}=o(n). Condition 6 is similar to (2.12) in Proposition 2 with a differently defined bn.
Similar to that in Proposition 2, Condition 6 can be replaced by (2.13) if the minimal signal strength condition is satisfied for appropriately chosen tuning parameter π. Then χn=0 when n is large for those asymptotically unbiased penalty functions like SCAD and MCP.
We now present the following theorem for the consistency of θn.
Theorem 1**.**
Let P1,π(⋅),P2,ν(⋅)∈P for P defined in (2.9), and P2,ν(⋅) be a convex function with bounded second derivative around [math]. Assume Conditions 1–6* hold. Let bn=max{an,ν2} with an=∑k=1pP1,π(∣θk0∣), and κn=max{ℓn1/2n−1/2,s1/2χn1/2bn1/(4β)}. If logr=o(n1/3), ϱn=o(n2), s2ℓnωnbn1/β=o(1), ℓn2n−1ϱnlogr=o(1), max{bn,ℓnκn2}=o(n−2/γ), ℓn1/2ϱn1/2κn=o(ν) and ℓn1/2ξn1/2max{ℓnν,s1/2χn1/2bn1/(4β)}=o(π), then there exists a local minimizer θn∈Θ for (3.1) such that ∣θn,S−θ0,S∣∞=Op{bn1/(2β)} and P(θn,Sc=0)→1 as n→∞.*
Theorem 1 establishes the consistency of θn in the sense that ∣θn−θ0∣∞p0. The convergence rate Op{bn1/(2β)} is a conservative one before we establish the asymptotic normality of the penalized EL estimator θn,S later. Under additional regularity conditions, such a rate can be improved as Op(ν).
Results in Theorem 1 holds for broad situations accommodating various cases of the estimating functions.
In reasonable cases that we discussed earlier, χn=0 and ξn=ωn=ϱn=1. Theorem 1 holds provided that logr=o(n1/3), ℓn=o(min{n1/2(logr)−1/2,n1/2−1/γ}), an=o(min{s−2βℓn−β,n−2/γ}), and the tuning parameters ν and π satisfy ℓnn−1/2=o(ν), ν=o(min{s−βℓn−β/2,n−1/γ}) and ℓn3/2ν=o(π).
Noticing that an≲sπ, by choosing π=o(min{s−2β−1ℓn−β,s−1n−2/γ}) can ensure the consistency result.
Additionally, we note that s≤ℓn. Thus by letting
logr≍nτ and ℓn≍nδ for some τ∈[0,31) and δ∈[0,min{7γγ−4,6β+71}), θn satisfies Theorem 1 if ν≍n−ϕ1 and π≍n−ϕ2 with ϕ1∈(max{23βδ,γ1},21−δ) and ϕ2∈(max{(3β+1)δ,γ2+δ},ϕ1−23δ), which are reasonable choices for the tuning parameters.
To further establishing the limiting distribution of θn,S, we need the following two additional conditions.
Condition 7**.**
For each j=1,…,p, gj(X;θ) is twice continuously differentiable with respect to θ in Θ for any X, and
[TABLE]
for some ϖn≥0 that may diverge with n.
Condition 8**.**
Let QF=[E{∇θSgF(Xi;θ0)}]T[E{∇θSgF(Xi;θ0)}] for any F⊂{1,…,r}. There exist uniform constants 0<K6<K7 such that K6<λmin(QF)≤λmax(QF)<K7 for any F with s≤∣F∣≤ℓn.
Following similar discussion for Condition 5, ϖn=1 in Condition 7 for reasonable models in practice. Let Rn=supp{λ(θn)} and define
[TABLE]
We have the following limiting distribution for θn,S.
Theorem 2**.**
Let P1,π(⋅),P2,ν(⋅)∈P for P defined in (2.9), and P2,ν(⋅) be a convex function with bounded second derivative around [math]. Assume Conditions 1–8* hold. Let bn=max{an,ν2} with an=∑k=1pP1,π(∣θk0∣), and κn=max{ℓn1/2n−1/2,s1/2χn1/2bn1/(4β)}. If logr=o(n1/3), ϱn=o(n2), bn=o(n−2/γ), nsχn2=o(1), ℓn2ϱn1/2(logr)max{s2(ωn+sϖn)bn1/β,n−1(sωn+ℓnϱn)logr}=o(1), nℓnκn4max{sωn,n2/γ}=o(1), nℓns2ϖnmax{ℓn2ν4,s2χn2bn1/β}=o(1), ℓn1/2ϱn1/2κn=o(ν) and ℓn1/2ξn1/2max{ℓnν,s1/2χn1/2bn1/(4β)}=o(π), then local minimizer θn∈Θ for (3.1) specified in Theorem 1 satisfies*
[TABLE]
as n→∞, where JRn and ψRn are defined in (3.3).
Theorem 2 shows that
subject to a bias correction, the penalized EL estimator for nonzero components is asymptotically normal in the sense of (3.4).
The bias term ψRn in (3.4) is due to the penalty function P2,ν(⋅) used in (3.1); see also our discussion after the Proposition 3. Write λ(θn)=(λ1,…,λr)T. Furthermore, as shown in (7.10) in Section 7, the correctable bias term is ψRn=JRn−1{∇θSgˉRn(θn)}TVRn−1(θn)ηRn where η=(η1,…,ηr)T with ηj=νρ2′(∣λj∣;ν)sgn(λj) for λj=0 and ηj∈[−νρ2′(0+),νρ2′(0+)] for λj=0.
Similar to that in Theorem 1, with reasonable cases χn=0 and ξn=ωn=ϱn=ϖn=1, descriptions on the dimensionality in Theorem 2 can be simplified.
If ℓn≍s, Theorem 2 holds provided that logr=o(n1/3), s=o(min{n1/3(logr)−2/3,n1/(10β+7)(logr)−2β/(10β+7),n(γ−4)/(7γ)}), and ν and π satisfying sn−1/2=o(ν),
ν=o(min{n−1/γ,s−5β/2(logr)−β/2,n−1/4s−5/4}), s3/2ν=o(π) and
π=o(min{n−2/γs−1,s−5β−1(logr)−β}).
Generally speaking, conditions in Theorem 2 is stronger than those in Theorem 1, which can be viewed as the expense for the stronger asymptotic normality results. In summary, we have established that the sparse penalized EL estimator (3.1) has desirable properties including consistency in estimating nonzero components and identifying zero components of θ0, and asymptotic normality for the estimator of the nonzero components of θ0.
4 Algorithms for implementations
For ease and stability in implementations, we calculate the penalized EL estimator θn by minimizing the following slightly modified objective function:
[TABLE]
where log⋆(z) is a twice differentiable pseudo-logarithm function with bounded support adopted from Owen (2001):
[TABLE]
where ϵ is chosen as 1/n in our implementations.
Here P1,π(⋅) and P2,ν(⋅) are two penalty functions with tuning parameters π and ν, respectively. In the optimization, we apply the quadratic approximation (Fan and Li, 2001) to the penalty functions P1,π(⋅) and P2,ν(⋅). More specifically, for a penalty function Pτ(⋅), the quadratic approximation states
[TABLE]
for t being in a small neighborhood of t0. The first and second derivatives are approximated by
[TABLE]
The computation of EL is a challenging aspect, especially with high-dimensional p and r. To compute the penalized EL estimator θn, we propose to apply a modified two-layer coordinate decent algorithm extending the one in Tang and Wu (2014). The inner layer of the algorithm solves for λ with given θ by maximizing f(λ;θ) as given in Section 3. This layer only involves maximizing a concave function, and hence is stable. The outer layer of the algorithm searches for the optimizer θn. Both layers can be solved using coordinate descent by cycling through and updating each of the coordinates; see Tang and Wu (2014).
In the inner layer, λ is solved at a given θ, which can be done by optimizing (4.1) with respect to λ using coordinate descent. Suppose that λ starts at an initial value λ(0). With the other coordinates fixed, the (m+1)th Newton’s update for λj(j=1,…,r), the jth component of λ, is given by
[TABLE]
where ti(m)=1+g(Xi;θ)Tλ(m) with λ(m)=(λ1(m),…,λr(m))T. The procedure cycles through all the r components of λ and is repeated until convergence. During this process, the objective function needs to be checked to ensure it gets optimized in each step. If not, the step size continues to be halved until the objective function gets driven in the right direction. The iterative updating procedure (4.4) can be viewed as sequential univariate optimizations. The convergence rate and stability are studies in the optimization literature; see for example Friedman et al. (2007) and Wu and Lange (2008).
The outer layer of the algorithm is to optimize (4.1) with respect to the parameter θ, the main interest of the penalized EL, using the coordinate descent algorithm. At a given λ, the algorithm updates θk(k=1,…,p), by minimizing Sn(θ) defined in Section 3 with respect to θk with other θl(l=k) fixed. Suppose that θ starts at an initial value θ(0). The (m+1)th update for θk is given by
[TABLE]
where si(m)=1+λTg(Xi;θ(m)), wik(m)=λT∂g(Xi;θ(m))/∂θk and zik(m)=λT∂2g(Xi;θ(m))/∂θk2 with θ(m)=(θ1(m),…,θp(m))T. Since quadratic approximations are applied in the algorithms, we follow Fan and Li (2001) and set a component λj(m) or θk(m) as zero when it is less than a threshold level say 10−3 in an iteration.
We summarize the computation procedure for θ and λ in the following pseudo-code. Suppose ξ is a pre-defined small number, say, ξ=10−4.
Set the iteration counter m=0, and initialize θ(0) and λ(0);
(Inner layer) For j=1,…,r, update λj(m) as λj(m+1) defined in (4.4);
If max1≤k≤p∣θk(m+1)−θk(m)∣<ξ, then stop;
Otherwise repeat steps 3 through 4.
5 Numerical examples
5.1 Estimating high-dimensional mean parameter
The first simulation study is to calculate the mean of a multivariate normal distribution in Rp. Let X=(X1,…,Xp)T∼N(θ0,Σ). Suppose only three elements, X1,X2, and X5, have nonzero means and the rest p−3 elements have zero means, i.e., θ0=(5,4,0,0,1,0,…,0)T. The covariance matrix Σ=(σkl)p×p is set as σkk=1 for each k=1,…,p and σkl=0.9 for any k=l. The estimating function is simply g(X;θ)=X−θ. In this case, the number of parameters p is equal to the number of estimating equations r. We consider the underdetermined case where p=r>n. We generate 100 random samples. The SCAD penalty (Fan and Li, 2001) is used for both the penalty functions P1,π(⋅) and P2,ν(⋅) in (3.1) for all the numerical experiments in this paper. Since local quadratic approximation is applied in the algorithms, the convexity requirements of the results in Sections 2 and 3 are met.
Table 1 summarizes the results for (n,p)=(50,100), (100,200), and (100,500). The proposed penalized EL with two penalties (namely, PEL2) is compared to the single penalty approach (PEL) discussed in Tang and Leng (2010). Three information criteria for choosing the tuning parameters π and ν in the penalty functions – BIC (Schwarz,, 1978), BICC (Wang, Li and Leng, 2009), and EBIC (Chen and Chen,, 2008) – are used.
In general, all the three BIC-type criteria work similarly, with the latter two yield slightly fewer nonzero parameters. The results from MLE for all p variables and the three true variables (i.e., MLE-Oracle) are also reported. The column of θnonzero reports the average number of selected nonzero components. The column of θtrue reports the average number of true nonzero components that are selected. The difference is the average number of false predictors that get selected. The next column reports the model error (ME), which is defined by ME=(θ−θ)T(θ−θ)
for a given estimator θ.
A smaller ME means a better estimation and prediction. The last column reports the number of selected estimating equations. Obviously, in the single penalty approach, all equating equations are used since no selection is performed. In each cell, standard error appears in the parentheses.
It is clear from the table that the double-penalty approach outperforms the single-penalty approach, as expected. A much smaller subset of variables get selected with almost all the three true predictors identified by the double-penalty method. That says, the double-penalty approach yields lower false positives and higher true positives. While in the single-penalty approach, fewer true predictors are chosen in the larger set of selected variables or nothing can be picked out if p≫n. What is the most interesting is that a small number (on average 5-8) of estimating equations are selected in the double-penalty approach. As a result, the double-penalty method yields a much smaller ME than the single-penalty method.
5.2 Linear regression
In this simulation study, we consider a linear regression model
Yi=ZiTθ0+εi,
where θ0=(3,1.5,0,0,2,0,…,0)T, Zi∈Rp are generated from N(0,Σ) with σkk=1 for any k=1,…,p and σkl=0.5 for any k=l, where Σ=(σkl)p×p, and εi is a standard normal distributed random variable. Write Xi=(Yi,ZiT)T. The estimating function is g(X;θ)=Z(Y−ZTθ) with p=r.
The model error (ME) in the regression setting is defined by ME=(θ−θ)TΣ(θ−θ)
for a given estimator θ.
Table 2 reports the results for (n,p,r)=(50,100,100),(100,200,200), and (100,500,500) with the columns defined in the same way as those in Table 1.
Similar to the previous example, the single-penalty approach (PEL) of Tang and Leng (2010) is compared with the double-penalty approach (PEL2) together with the three BIC criteria for selecting the tuning parameter(s).
We also compare our method with the LASSO method with L1 penalty.
Since the number of parameters p doubles the number of subjects n, the MLE method does not work in this example. We only report the results from MLE-Oracle (i.e., the MLE method using the true predictors), which gives the smallest model error. In all the three settings, the single-penalty method fails to select any predictor when using all r estimating equations. The double-penalty method identifies all true predictors from a handful of selected ones in most cases by using only a few estimating equations.
With the default tuning parameter selection method in the LASSO, we clearly see that the number of false inclusion of the predictors is high. Hence, compared with LASSO method, we observe that our method has better performance in recovering a sparse model.
5.3 Regression model with repeated measures
This is an example with more estimating equations than the number of parameters, i.e., r>p.
Now we consider a repeated measures model such that
yij=zijTθ0+ϵij(i=1,…,n;j=1,2),
where θ0=(3,1.5,0,0,2,0,…,0)T∈Rp, zij are generated from N(0,Σ) with σkl=0.5∣k−l∣, where Σ=(σkl)p×p. The random errors (ϵi1,ϵi2)T are generated from a two-dimensional normal distribution with mean zero and unit marginal compound symmetry covariance matrix with ρ=0.7.
Let Yi=(yi1,yi2)T and Zi=(zi1T,zi2T)T respectively collect the response and predictor variables, and write Xi=(YiT,ZiT)T. To incorporate the dependence among the repeated measures from the same subject when estimating θ0, we use the quadratic estimating equations proposed by Qu, Lindsay and Li (2000):
[TABLE]
where vi is a diagonal matrix of the conditional variances of subject i, and Mj(j=1,…,m) are working correlation matrices. Note that when m=1, i.e., using only one working correlation matrix M1, the model becomes the one in Liang and Zeger (1986) and we have r=p. Here we choose two sets of basis matrices with M1 being the identity matrix of size ni and M2 being the compound symmetry with the diagonal elements of 1 and off-diagonal elements of ρ. In our setting, ni=2 and therefore r=2p estimating equations to estimate p parameters. For each simulation, we repeat the experiment 100 times.
We obtain the same quantities as those in the example of Section 5.2, and report them in Table 3.
In comparison of the single-penalty method, we can conclude from Table 3, with the columns defined in the same way as those in Table 2, that the proposed double-penalty method has much better performance. This confirms the efficacy and efficiency of adding the additional penalty on the Lagrange multiplier λ, which performs the selection of estimating equations by reducing the number of estimating equations to less than 10.
5.4 Trial of activity for adolescent girls 2 (TAAG2)
We apply the penalized EL with two penalties to examine the individual-, social-, and neighborhood-level factors associated with adolescent girls’ physical activity over time in the Trial of Activity for Adolescent Girls 2 (TAAG2) (Young et al.,, 2014; Grant, Young and Wu, 2015). The 589 girls in the Maryland site from TAAG2 were collected data at 8th grade (2009) and 11th (2011) grade. The response variable, moderate to vigorous physical activity (MVPA) minutes, were assessed from accelerometers. Forty-two variables to be considered include: (1) demographic and psychosocial information (individual- and social-level variables) that were obtained from questionnaires; (2) height, weight, and triceps skinfold to assess body composition; and (3) geographical information systems and self-report for neighborhood-level variables. There are 554 girls have complete information for all 42 variables and are used in this analysis.
A two-time point longitudinal linear mixed effects model is used to identify factors that are most relevant to MVPA. A similar model as in Section 5.3 is used with two working correlation structure matrices. Our double-penalty EL method identifies four variables are related to MVPA: Self-management strategies, Self-efficacy, Perceived barriers, and Social support. In particular, higher Self-management strategies, Self-efficacy, Social support and lower Perceived barriers are associated with higher MVPA. Our finding confirms the previous results in Young et al., (2014); Grant, Young and Wu (2015).
6 Discussion
We study a new penalized EL approach with two penalties, with one encouraging sparsity of the estimator and the other encouraging sparsity of the Lagrange multiplier in the optimizations associated with the EL. Such an approach utilizes sparsity in the target parameters and effectively achieves a moment selection procedure for estimating the sparse parameter. Both theory and numerical examples confirm the merits of the new penalized EL.
One interesting extension of the approach is to explore inferences with estimating equations after the variable selection procedure. Such a direction is a suitable stage for EL method with estimating equations who takes advantage of adaptivity to various moment conditions with less stringent distributional assumptions. The other interesting and challenging problem is to explore the optimality of the sparse estimator using estimating equations with high data dimensionality. Semiparametric efficiency of EL with estimating equations is shown in Qin and Lawless (1994). However, when the paradigm shifts to high-dimensional statistical problems, the efficiency of the sparse estimator respecting its nonzero components remains open for further investigations. We plan to address the problems in future works.
Acknowledgments
We are grateful to the Co-Editor, the Associate Editor and three
referees for very constructive comments and suggestions that have greatly improved our
paper. Chang was supported in part by a grant from the Australian Research Council.
Tang acknowledges supports from NSF Grants IIS-1546087 and SES-1533956.
Wu’s research was partially supported by NIH grants R01HL094572 and R01HL119058.
7 Proofs
In the sequel, we use the abbreviations “w.p.a.1” and “w.r.t” to denote, respectively, “with probability approaching one” and “with respect to”, and C denotes a generic positive finite constant that may be different in different uses. For simplicity and when no confusion arises, we use notation hi(θ) as equivalent to h(Xi;θ) for a generic q-dimensional multivariate function h(⋅;⋅) and denote by hi,k(θ) the kth component of hi(θ). Let hˉ(θ)=n−1∑i=1nhi(θ), and hˉk(θ)=n−1∑i=1nhi,k(θ) be the kth component of hˉ(θ). For a given set L⊂{1,…,q}, we denote by hL(⋅;⋅) the subvector of h(⋅;⋅) collecting the components indexed by L. Analogously, we let hi,L(θ)=hL(Xi;θ) and hˉL(θ)=n−1∑i=1nhi,L(θ). For an s1×s2 matrix B=(bij), let ∣B∣∞=max1≤i≤s1,1≤j≤s2∣bij∣, ∥B∥1=max1≤j≤s2∑i=1s1∣bij∣, ∥B∥∞=max1≤i≤s1∑j=1s2∣bij∣ and ∥B∥2=λmax1/2(BBT) where λmax(BBT) denotes the largest eigenvalue of BBT. Specifically, if s2=1, we use ∣B∣1=∑i=1s1∣bi1∣ and ∣B∣2=(∑i=1s1bi12)1/2 to denote the L1-norm and L2-norm of the s1-dimensional vector B, respectively.
Define An(θ,λ)=n−1∑i=1nlog{1+λTgi(θ)} for any θ∈Θ and λ∈Λn(θ). We first prove that maxλ∈Λn(θ0)An(θ0,λ)=Op(rn−1). Let λ=argmaxλ∈Λn(θ0)An(θ0,λ). Pick δn=o(r−1/2n−1/γ) and r1/2n−1/2=o(δn), which is guaranteed by r2n2/γ−1=o(1). Let λˉ=argmaxλ∈ΛnAn(θ0,λ) where Λn={λ∈Rr:∣λ∣2≤δn}. It follows from Markov inequality that max1≤i≤n∣gi(θ0)∣2=Op(r1/2n1/γ). Then max1≤i≤n,λ∈Λn∣λTgi(θ0)∣=op(1). By Taylor expansion, it holds w.p.a.1 that
[TABLE]
for some ∣c∣<1. Notice that ∣gˉ(θ0)∣2=Op(r1/2n−1/2), (7.1) yields that ∣λˉ∣2=Op(r1/2n−1/2)=op(δn). Therefore, λˉ∈int(Λn) w.p.a.1. Since Λn⊂Λn(θ0) w.p.a.1, λ=λˉ w.p.a.1 by the concavity of An(θ0,λ) and Λn(θ0). Hence, by (7.1), we have maxλ∈Λn(θ0)An(θ0,λ)=Op(rn−1).
We then show ∣gˉ(θ)∣2=Op(r1/2n−1/2). For δn specified above, let λ∗=δngˉ(θ)/∣gˉ(θ)∣2, then λ∗∈Λn. By Taylor expansion, it holds w.p.a.1 that
[TABLE]
for some ∣c∣<1. Notice that An(θ,λ∗)≤maxλ∈Λn(θ)An(θ,λ)≤maxλ∈Λn(θ0)An(θ0,λ)=Op(rn−1), thus ∣gˉ(θ)∣2=Op(δn). Consider any ϵn→0 and let λ∗∗=ϵngˉ(θ), then ∣λ∗∗∣2=op(δn). Using the same arguments above, we can obtain ϵn∣gˉ(θ)∣22−Cϵn2∣gˉ(θ)∣22{1+op(1)}=Op(rn−1). Then ϵn∣gˉ(θ)∣22=Op(rn−1). Notice that we can select arbitrary slow ϵn→0, following a standard result from probability theory, we have ∣gˉ(θ)∣22=Op(rn−1). Hence, we complete the proof. \hfill□
Define Fn(θ)=maxλ∈Λn(θ)An(θ,λ)+∑k=1pP1,π(∣θk∣) where θ=(θ1,…,θp)T and An(θ,λ)=n−1∑i=1nlog{1+λTgi(θ)}. Recall an=∑k=1pP1,π(∣θk0∣) and bn=max{rn−1,an}. As shown in the proof of Proposition 1, maxλ∈Λn(θ0)An(θ0,λ)=Op(rn−1) which implies Fn(θ0)=Op(rn−1)+an. Define Θ∗={θ=(θST,θScT)T:∣θS−θ0,S∣∞≤ε,∣θSc∣1≤n−1/2φn−1} for some fixed ε>0. Let θn=argminθ∈Θ∗Fn(θ). As Fn(θn)≤Fn(θ0), we have Fn(θn)≤Op(rn−1)+an=Op(bn).
We will first show that θn∈int(Θ∗) w.p.a.1. To do this, our proof includes two steps: (i) to show that for any ϵn→∞ satisfying bnϵn2βn2/γ=o(1), there exists a uniform constant K>0 independent of θ such that P{Fn(θ)>Kbnϵn2β}→1 as n→∞ for any
θ=(θST,θScT)T∈Θ∗ satisfying ∣θS−θ0,S∣∞>ϵnbn1/(2β). Thus ∣θn,S−θ0,S∣∞=Op{ϵnbn1/(2β)}. Notice that we can select arbitrary slow diverging ϵn, following a standard result from probability theory, we have ∣θn,S−θ0,S∣∞=Op{bn1/(2β)}, (ii) to show that ∣θn,Sc∣1<n−1/2φn−1.
For (i), we will use the technique developed for the proof of Theorem 1 in Chang, Tang and Wu (2013). For any θ=(θST,θScT)T∈Θ∗ satisfying ∣θS−θ0,S∣∞>ϵnbn1/(2β), define θ∗=(θST,0T)T and let j0=argmax1≤j≤r∣E{gi,j(θ∗)}∣. Define μj0=E{gi,j0(θ)}, μj0∗=E{gi,j0(θ∗)}, and λ=δbn1/2ϵnβej0 where δ>0 is a constant to be determined later, and ej0 is an r-dimensional vector with the j0-th component being 1 and other components being [math]. Without lose of generality, we assume μj0∗>0. (2.4) and Markov inequality yield that max1≤i≤n∣gi,j0(θ)∣=Op(n1/γ), which implies max1≤i≤n∣λTgi(θ)∣=Op(bn1/2ϵnβn1/γ)=op(1). Then λ∈Λn(θ) w.p.a.1. Write θ=(θ1,…,θp)T and λ=(λ1,…,λr)T. By the definition of Fn(θ), it
holds w.p.a.1 that
[TABLE]
for some ∣c∣<1 and λj0=δbn1/2ϵnβ. Therefore, it holds that
[TABLE]
From (2.4) and Markov inequality, there exists a uniform positive constant L independent of θ such that P{n−1∑i=1ngi,j02(θ)>L}→0. Thus, with δ=(K/L)1/2, we have
[TABLE]
From (2.6) and (2.10), we know that μj0∗≥Δ(ϵnbn1/(2β))≥K1ϵnβbn1/2/2 with K1 specified in (2.6) for sufficiently large n, and
[TABLE]
for K2 specified in (2.10). Therefore, μj0≥K1ϵnβbn1/2/3 for sufficiently large n.
For sufficiently small K independent of θ, we have
2bn1/2ϵnβ(KL)1/2−μj0≤−cμj0
for some 0<c<1, which implies that
n1/2{2bn1/2ϵnβ(KL)1/2−μj0}≤−cn1/2μj0≲−ϵnβbn1/2n1/2→−∞.
As n−1/2∑i=1n{gi,j0(θ)−μj0}dN(0,σ2) for some σ>0, it holds that P{Fn(θ)≤Kbnϵn2β}→0. Hence, we complete the proof for (i).
For (ii), if ∣θn,Sc∣1=n−1/2φn−1, we define θn∗=(θn,ST,τθn,ScT)T for some τ∈(0,1) and will show Fn(θn∗)<Fn(θn) w.p.a.1. Notice that θn=argminθ∈Θ∗Fn(θ). This will be a contradiction. Therefore, ∣θn,(2)∣1<n−1/2φn−1. Write θn=(θn,1,…,θn,p)T and θn∗=(θn,1∗,…,θn,p∗)T. By the definition of Fn(θ) and the inequality Fn(θn)≤Fn(θ0), it holds that
[TABLE]
On the other hand, it holds that
[TABLE]
for some ck∈(0,1).
As we have shown in Section 7.1, maxλ∈Λn(θ0)An(θ0,λ)=Op(rn−1). Therefore, maxλ∈Λn(θn)An(θn,λ)=Op(rn−1)+Op{sχnbn1/(2β)}. Pick δn satisfying δn=o(r−1/2n−1/γ) and max{rn−1,sχnbn1/(2β)}=o(δn2), which can be guaranteed by r2n2/γ−1=o(1) and rsχnbn1/(2β)n2/γ=o(1). Same as (7.2), we have
[TABLE]
which implies ∣gˉ(θn)∣2=Op(δn). Following the same arguments in Section 7.1 below (7.2), we have ∣gˉ(θn)∣2=Op(r1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}. Notice that
∣gˉ(θn∗)∣2≤∣gˉ(θn)∣2+∣{∇θgˉ(θˉ)}(θn∗−θn)∣2 for some θˉ lying on the jointing line between θn and θn∗. Since θn,S=θn,S∗, by (2.11), it holds that ∣{∇θgˉ(θˉ)}(θn∗−θn)∣2=Op(r1/2n−1/2). Hence, ∣gˉ(θn∗)∣2=Op(r1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}. Write λ∗=argmaxλ∈Λn(θn∗)An(θn∗,λ). Following the same arguments for (7.1), it holds that ∣λ∗∣2=Op(r1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}. Since θn∗=(θn,ST,τθn,ScT)T and Fn(θn)≥An(θn,λ∗)+∑k=1pP1,π(∣θn,k∣), then
[TABLE]
for some θˇ lying on the jointing line between θn and θn∗. Notice that max1≤i≤n∣λ∗,Tgi(θˇ)∣=op(1), then
[TABLE]
On the other hand,
[TABLE]
for some ck∈(0,1). If r1/2φnmax{r1/2n−1/2,s1/2χn1/2bn1/(4β)}=o(π), (7.4) implies Fn(θn∗)<Fn(θn) w.p.a.1. Hence, we complete the proof of (ii).
Nextly, we will show P(θn,Sc=0)→1. Define
[TABLE]
for θ=(θ1,…,θp)T. Then θn and its Lagrange multiplier λ satisfy the score equation ∇λGn(θn,λ)=0. By the implicit theorem [Theorem 9.28 of Rudin (1976)], for all θ in a ∣⋅∣2-neighborhood of θn, there is a λ(θ) such that ∇λGn{θ,λ(θ)}=0 and λ(θ) is continuously differentiable in θ. By the concavity of Gn(θ,λ) w.r.t λ, Gn{θ,λ(θ)}=maxλ∈Λn(θ)Gn(θ,λ). Write λ=(λ1,…,λr)T. From the envelope theorem,
[TABLE]
Write h=(h1,…,hp)T=∇θGn{θ,λ(θ)}∣θ=θn. Let ρ1(t;π)=π−1P1,π(t). Since P1,π(⋅)∈P, ρ1′(0+;π) is independent of π. We write it as ρ1′(0+) for simplicity. Therefore, for each k=1,…,p,
[TABLE]
where κk=πρ1′(∣θk∣;π)sgn(θk) for θk=0 and κk∈[−πρ1′(0+),πρ1′(0+)] otherwise. From Triangle inequality, it holds that
[TABLE]
As r1/2φnmax{r1/2n−1/2,s1/2χn1/2bn1/(4β)}=o(π), if θk=0 for some k∈/S, then πρ1′(∣θk∣;π)sgn(θk) will dominates the sign of hk. According to the arguments for the proof of Lemma 1 in Fan and Li (2001), we know θn,Sc=0 w.p.a.1. Hence, we complete the proof of Proposition 2. \hfill□
Recall Mθn={1≤j≤r:∣gˉj(θn)∣≥νρ2′(0+)} and Mθn∗={1≤j≤r:∣gˉj(θn)∣≥Cνρ2′(0+)} for some C∈(0,1). Clearly, Mθn⊂Mθn∗. Recall mn=∣Mθn∗∣. Given Mθn, we select δn satisfying δn=o(mn−1/2n−1/γ) and un=o(δn). Let λˉn=argmaxλ∈Λnf(λ;θn) where Λn={λ=(λMθnT,λMθncT)T∈Rr:∣λMθn∣2≤δnandλMθnc=0}. For given Mθn, Condition 3 and Markov inequality imply that max1≤i≤n∣gi,Mθn(θn)∣2=Op(mn1/2n1/γ), which leads to max1≤i≤n∣λˉnTgi(θn)∣=op(1). Write λˉn=(λˉn,1,…,λˉn,r)T. By the definition of λˉn and Taylor expansion, noting P2,ν(t)=νρ2(t;ν) and ρ2′(t;ν)≥ρ2′(0+) for any t>0, we have
[TABLE]
Notice that ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(un) and P[λmin{VMθn(θn)}≥C]→1, then ∣λˉn,Mθn∣2=Op(un)=op(δn). Write λˉn,Mθn=(λˉ1,…,λˉ∣Mθn∣)T. We have w.p.a.1 that
[TABLE]
where η=(η1,…,η∣Mθn∣)T with ηj=νρ2′(∣λˉj∣;ν)sgn(λˉj) for λˉj=0 and ηj∈[−νρ2′(0+),νρ2′(0+)] for λˉj=0. (7.5) implies that
η=gˉMθn(θn)+R with ∣R∣∞=Op(ςn1/2un). Since ςn1/2un=o(ν),
then w.p.a.1 sgn(λˉj)=sgn{gˉj(θn)} for any λˉj=0.
We will show that λˉn is a local maximizer for f(λ;θn) w.p.a.1. We first show that λˉn=argmaxλ∈Λn∗(θn)f(λ;θn) w.p.a.1, where Λn∗(θn)={λ=(λMθn∗T,λMθn∗,cT)T∈Rr:∣λMθn∗∣2≤ϵ,λMθn∗,c=0} for some ϵ>0. Notice that f(λ;θn) is concave w.r.t λ. To do this, it suffices to show that w=λˉn,Mθn∗T=:(w1,…,wmn)T∈Rmn satisfies the equation
[TABLE]
w.p.a.1, where η∗=(η1∗,…,ηmn∗)T with ηj∗=νρ2′(∣wj∣;ν)sgn(wj) for wj=0 and ηj∗∈[−νρ2′(0+),νρ2′(0+)] for wj=0.
Based on (7.5), we know
0=n−1∑i=1ngi,j(θn)/{1+wTgi,Mθn∗(θn)}−ηj∗
holds for any j∈Mθn. For each j∈Mθn∗\Mθn, it holds that
n−1∑i=1ngi,j(θn)/{1+wTgi,Mθn∗(θn)}=gˉj(θn)+Op(ςn1/2un)
where Op(ςn1/2un) is uniform for any j∈Mθn∗\Mθn. Since Cνρ2′(0+)≤∣gˉj(θn)∣<νρ2′(0+) for j∈Mθn∗\Mθn, if ςn1/2un=o(ν), then ∣n−1∑i=1ngi,j(θn)/{1+wTgi,Mθn∗(θn)}∣<νρ2′(0+) w.p.a.1 for any j∈Mθn∗\Mθn.
This implies that there exists ηj∗ such that 0=n−1∑i=1ngi,j(θn)/{1+wTgi,Mθn∗(θn)}−ηj∗ holds for any j∈Mθn∗\Mθn.
Secondly, we prove λˉn
is a local maximizer for f(λ;θn) over λ∈Λn(θn) w.p.a.1, where Λn(θn)={λ=(λMθn∗T,λMθn∗,cT)T∈Rr:∣λMθn∗−λˉn,Mθn∗∣2≤o(un),∣λMθn∗,c∣1=o(r−1/γn−1/γ)}. Notice that
max1≤i≤n,λ∈Λn(θn)∣λTgi(θn)∣=op(1). For any λ∈Λn(θn), we write λ=(λMθn∗T,λMθn∗,cT)T and denote by λ=(λMθn∗T,0T)T the projection of λ onto the subspace Λn∗(θn). We only need to show
[TABLE]
By Taylor expansion, it holds that
[TABLE]
for some λ∗ lying on the jointing line between λ and λ. We have that
[TABLE]
where the term Op(mn1/2unςn) is uniformly for any λ∈Λn(θn). On the other hand, we have
[TABLE]
Hence,
[TABLE]
Notice that mn1/2unςn/ν→0, then −(1−C)νρ2′(0+)+Op(mn1/2unςn)≤0 w.p.a.1 which implies (7.6) holds. Hence, λˉn w.p.a.1 is a local maximizer of f(λ;θn). We complete the proof of Proposition 3. \hfill□
where Λn†(θ0)={η∈Rm0:ηTgi,G0(θ0)∈V,i=1,…,n} for some open interval V containing zero. Given G0, since ∣G0∣≤ℓn, following the proof of Proposition 1, we have maxη∈Λn†(θ0)n−1∑i=1nlog{1+ηTgi,G0(θ0)}=Op(ℓnn−1) which implies
maxλ∈Λn(θ0)f(λ;θ0)=Op(ℓnn−1).
Recall an=∑k=1pP1,π(∣θk0∣), bn=max{ℓnn−1,an,ν2} and Sn(θ)=maxλ∈Λn(θ)f(λ;θ)+∑k=1pP1,π(∣θk∣) for any θ=(θ1,…,θp)T. Define Θ∗={θ=(θST,θScT)T:∣θS−θ0,S∣∞≤ε,∣θSc∣1≤ℵn} for some fixed ε>0 and ℵn=min{sωn1/2bn1/(2β)ξn−1/2,o(bn1/2),o(νϱn−1/2ℓn−3/2ξn−1/2)}. Let θn=argminθ∈Θ∗Sn(θ). As we have shown above, P{Sn(θ0)≤an+Op(ℓnn−1)}→1 as n→∞. As Sn(θn)≤Sn(θ0), we have P{Sn(θn)≤an+Op(ℓnn−1)}→1 as n→∞. We will show that θn∈int(Θ∗) w.p.a.1. Same as the proof of Proposition 2 stated in Section 7.2, our proof includes two steps: (i) to show that for any ϵn→∞ satisfying bnϵn2βn2/γ=o(1), there exists a uniform constant K>0 independent of θ such that P{Sn(θ)>Kbnϵn2β}→1 as n→∞ for any
θ=(θST,θScT)T∈Θ∗ satisfying ∣θS−θ0,S∣∞>ϵnbn1/(2β), which leads to ∣θn,S−θ0,S∣∞=Op{bn1/(2β)}. (ii) to show that ∣θn,Sc∣1<ℵn. The proof of (i) is the same as that stated in Section 7.2, thus we omit its proof and only show (ii) here. We need the following lemma whose proof is given in the supplementary material.
Lemma 1**.**
Let F={F⊂{1,…,r}:∣F∣≤ℓn} and Θn={θ=(θST,θScT)T:∣θS−θ0,S∣∞=Op{bn1/(2β)},∣θSc∣1≤ℵn}. Assume that Conditions 4 and 5, then
[TABLE]
provided that logr=o(n1/3), s2ℓnωnbn1/β=o(1) and ℓn2n−1ϱnlogr=o(1).
We begin to prove (ii) now. If ∣θn,Sc∣1=ℵn, we define θn∗=(θn,ST,τθn,ScT)T for some τ∈(0,1) and will show Sn(θn∗)<Sn(θn) w.p.a.1. Notice that θn=argminθ∈Θ∗Sn(θ). This will be a contradiction. Therefore, ∣θn,Sc∣1<ℵn. Write θn=(θn,1,…,θn,p)T. Notice that
[TABLE]
by (7.3), we have maxλ∈Λn(θn)f(λ;θn)=Op(ℓnn−1)+Op{sχnbn1/(2β)}. Pick δn satisfying δn=o(ℓn−1/2n−1/γ) and max{ℓnn−1,sχnbn1/(2β)}=o(δn2), which can be guaranteed by ℓnsχnbn1/(2β)n2/γ=o(1) and ℓn2n2/γ−1=o(1). Select λ∗ such that λMθn∗=δn[gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}]/∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2 and λMθnc∗=0. Write λ∗=(λ1∗,…,λr∗)T. Then
[TABLE]
for some c,cj,cj∗∈(0,1). Recall Mθn={1≤j≤r:∣gˉj(θn)∣≥νρ2′(0+)}, then \mboxsgn(λMθn∗)=\mboxsgn{gˉMθn(θn)}. Thus ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(δn). Using the technique developed in Section 7.1, we have ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(ℓn1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}.
By Lemma 1 and Condition 4, we know λmin{VMθn(θn)}≥C w.p.a.1. Therefore Proposition 3 leads to ∣λ(θn)∣2=Op(ℓn1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}. Based on this property of the Lagrange multiplier λ(θn), we can follow the same arguments stated in Section 7.2 to construct (ii). Specifically, write λ(θn) and λ(θn∗) as λ=(λ1,…,λr)T and λ∗=(λ1∗,…,λr∗)T, respectively. In the sequel, we use θˇ to denote a generic vector lying on the jointing line between θn and θn∗ that may be different in different uses. Write θn∗=(θn,1∗,…,θn,p∗)T. By Taylor expansion, it holds that
[TABLE]
We will show I+II+III<0 w.p.a.1 as follows.
For I, we will first specify the convergence rate of ∣λ∗−λ∣1. Define
[TABLE]
for any θ=(θ1,…,θp)T and λ=(λ1,…,λr)T. Then θn and its Lagrange multiplier λ satisfy the score equation ∇λHn(θn,λ)=0, i.e.
[TABLE]
where η=(η1,…,ηr)T with ηj=νρ2′(∣λj∣;ν)sgn(λj) for λj=0 and ηj∈[−νρ2′(0+),νρ2′(0+)] for λj=0. Let Rn=supp{λ(θn)}.
Restricted on Rn, for any θ∈Rp and ζ=(ζ1,…,ζ∣Rn∣)T∈R∣Rn∣ with each ζj=0, define
[TABLE]
where w=(w1,…,w∣Rn∣)T with wj=νρ2′(∣ζj∣;ν)sgn(ζj). Then, λRn and θn satisfy m(λRn,θn)=0. By the implicit theorem [Theorem 9.28 of Rudin (1976)], for all θ in a ∣⋅∣2-neighborhood of θn, there is a ζ(θ) such that m{ζ(θ),θ}=0 and ζ(θ) is continuously differentiable in θ. Since θn,S∗=θn,S, we have
[TABLE]
Notice that
[TABLE]
Since max1≤i≤n∣ζ(θˇ)Tgi,Rn(θˇ)∣=op(1), from Lemma 1, we know ∥A(θˇ)∥1≤∣Rn∣1/2∥A(θˇ)∥2=Op(ℓn1/2). Meanwhile, we have ∣B(θˇ)∣∞=Op(ξn1/2) which implies ∥B(θˇ)∥1=Op(ξn1/2ℓn). Therefore, it holds that ∥∇θScζ(θ)∣θ=θˇ∥1≤∥A(θˇ)∥1∥B(θˇ)∥1=Op(ℓn3/2ξn1/2), which implies ∣ζ(θn∗)−λRn∣1=Op(ℓn3/2ξn1/2)∣θn,Sc∣1. Let λ satisfy λRn=ζ(θn∗) and λRnc=0. For any j∈Rnc, we have
[TABLE]
where the term op(ν) holds uniformly for any j∈Rnc. Write λ=(λ1,…,λr)T.
Recall that ζ(θn∗) and θn∗ satisfy m{ζ(θn∗),θn∗}=0, and (7.9) holds, then it holds w.p.a.1 that
[TABLE]
for η∗=(η1∗,…,ηr∗)T with ηj∗=νρ2′(∣λj∣;ν)sgn(λj) for λj=0 and ηj∗∈[−νρ2′(0+),νρ2′(0+)] for λj=0. By the concavity of f(λ;θ)=n−1∑i=1nlog{1+λTgi(θ)}−∑j=1rP2,ν(∣λj∣), we know λ∗=λ w.p.a.1. Hence, ∣λ∗−λ∣1=Op(ℓn3/2ξn1/2)∣θn,Sc∣1. This implies I=Op(ℓn3/2ξn1/2ν)∣θn,Sc∣1.
Let J∗=supp(λ∗). Notice that max1≤i≤n∣λ∗,Tgi(θˇ)∣=op(1), then
[TABLE]
which implies II=max{ℓn1/2n−1/2,s1/2χn1/2bn1/(4β)}∣θn,Sc∣1Op(ℓn1/2ξn1/2). On the other hand, by Taylor expansion, we have
[TABLE]
for some ck∈(0,1). Since max{ℓn3/2ξn1/2ν,ℓnξn1/2n−1/2,ℓn1/2ξn1/2s1/2χn1/2bn1/(4β)}=o(π), (7.7) implies Sn(θn∗)<Sn(θn) w.p.a.1. Hence, we complete the proof of (ii). Together with (i), we know such defined θn is a local minimizer of Sn(θ). Following the same arguments stated in Section 7.2, we can prove P(θn,Sc=0)→1. We complete the proof of Theorem 1. \hfill□
Recall Rn=supp{λ(θn)}. We still write λ=λ(θn)=(λ1,…,λr)T. For Hn(θ,λ) defined in (7.8), we have ∇λHn(θn,λ)=0, i.e.
[TABLE]
where η=(η1,…,ηr)T with ηj=νρ2′(∣λj∣;ν)sgn(λj) for λj=0 and ηj∈[−νρ2′(0+),νρ2′(0+)] for λj=0. By Taylor expansion, we have
[TABLE]
for some ∣c∣<1, which implies
[TABLE]
On the other hand, together with
[TABLE]
it holds that
[TABLE]
where κS={∑k=1p∇θSP1,π(∣θk∣)}∣θS=θn,S. From Condition 6, it holds that
∣κS∣∞=Op(χn).
We will use (7.11) to derive the limiting distribution of θn,S. Before this, we need the following lemmas.
Assume the conditions of Theorem 1 and Condition 7 hold. Then
[TABLE]
holds uniformly for any z∈Rs, where F is defined in Lemma 1.
Lemma 4**.**
Let JF={∇θSgˉF(θn)}TVF−1(θn){∇θSgˉF(θn)} for any F∈F, where F is defined in Lemma 1. Assume the conditions for Lemma 3 and Condition 8 hold. If s2ℓn2bn1/βϱn1/2max{ωn,sϖn}logr=o(1), n−1ℓn2sωnϱn1/2(logr)2=o(1) and n−1ℓn3ϱn3/2(logr)2=o(1), we have
[TABLE]
for any u∈R and α∈Rs, where Φ(⋅) is the cumulative distribution function of the standard normal distribution.
Now we begin the proof of Theorem 2. Recall JRn={∇θSgˉRn(θn)}TVRn−1(θn){∇θSgˉRn(θn)}. For any α∈Rs with unit L2-norm, let δ=JRn−1/2α, then
[TABLE]
where U=VRn−1/2(θn){∇θSgˉRn(θn)}. Thus, by Lemma 1, ∣{∇θSgˉRn(θn)}δ∣2=Op(1). Meanwhile, notice that
∣δ∣2=Op(1). Lemma 2 yields that
[TABLE]
As shown in Section 7.4, ∣gˉMθn(θn)−νρ2′(0+)\mboxsgn{gˉMθn(θn)}∣2=Op(ℓn1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}. From Proposition 3, we have ∣gˉRn(θn)−ηRn∣2=Op(ℓn1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)}.
Following Lemmas 2 and 3, (7.11) leads to
[TABLE]
Expanding gˉRn(θn) around θ=θ0, it holds w.p.a.1 that
[TABLE]
where θ is on the line joining θ0 and θn. Notice that ∣gˉRn(θn)−gˉRn(θ0)∣2≤∣gˉRn(θn)∣2+∣gˉRn(θ0)∣2=Op(ℓn1/2ν)+Op{s1/2χn1/2bn1/(4β)}. By Taylor expansion, ∣gˉRn(θn)−gˉRn(θ0)∣2≥λmin([∇θSgˉRn(θ˙)]T[∇θSgˉRn(θ˙)])∣θn,S−θ0,S∣2 for some θ˙ lying on the line jointing θ0 and θn. Same as Lemma 3, λmin([∇θSgˉRn(θ˙)]T[∇θSgˉRn(θ˙)]) is bounded away from zero w.p.a.1, which implies ∣θn,S−θ0,S∣2=Op(ℓn1/2ν)+Op{s1/2χn1/2bn1/(4β)}. Together with Condition 7, it holds that
∣{∇θSgˉRn(θ)−∇θSgˉRn(θn)}(θn,S−θ0,S)∣2=Op(ℓn3/2sϖn1/2ν2)+Op{ℓn1/2s2ϖn1/2χnbn1/(2β)}.
Therefore, (7.12) leads to
[TABLE]
Lemma 4 leads to n1/2αTJRn−1/2{∇θSgˉRn(θn)}TVRn−1(θn)gˉRn(θ0)→dN(0,1) as n→∞. We complete the proof of Theorem 2. \hfill□
Notice that ∥VF(θ)−VF(θ0)∥2≤∥VF(θ)−VF(θ0)∥2+∥VF(θ0)−VF(θ0)∥2 for any F∈F and θ∈Θn. Following the moderate deviation of self-normalized sums (Jing, Shao and Wang, 2003) and Condition 5, it holds that max1≤j1,j2≤r∣n−1∑i=1ngi,j1(θ0)gi,j2(θ0)−E{gi,j1(θ0)gi,j2(θ0)}∣=Op{(n−1ϱnlogr)1/2}, which implies supF∈F∥VF(θ0)−VF(θ0)∥2=Op{ℓn(n−1ϱnlogr)1/2} provided that logr=o(n1/3). For any z∈R∣F∣ with unit L2-norm, we have
[TABLE]
which implies
[TABLE]
Write θ=(θST,θScT)T with θS∈Rs. By Taylor expansion and Cauchy-Schwarz inequality, we have
[TABLE]
for some θ lying on the jointing line between θ0 and θ. By Condition 5,
[TABLE]
Similarly, we have
[TABLE]
Therefore,
[TABLE]
holds uniformly for θ∈Θn. Meanwhile, by Condition 4, it holds that supF∈Fλmax{VF(θ0)}≤C w.p.a.1. Then supθ∈ΘnsupF∈F∥VF(θ)−VF(θ0)∥2=Op{s(ℓnωnbn1/β)1/2}. Thus we complete the proof of Lemma 1. \hfill□
As shown in Section 7.4, ∣λ∣2=Op(ℓn1/2n−1/2)+Op{s1/2χn1/2bn1/(4β)} and max1≤i≤n∣λRnTgi,Rn(θn)∣=Op(ℓnn−1/2+1/γ)+Op{ℓn1/2s1/2χn1/2bn1/(4β)n1/γ}=op(1). Notice that ∣(1+x)−2−1∣≤5∣x∣ for any ∣x∣<1/2, by Lemma 1, it holds that w.p.a.1
[TABLE]
For the second result, by Taylor expansion and Cauchy-Schwarz inequality, it holds that w.p.a.1
[TABLE]
for some ∣c∣<1. By Lemma 1, it holds that λRnTVRn(θn)λRn≤λmax{VRn(θn)}∣λRn∣22=Op(ℓnn−1)+Op{sχnbn1/(2β)}. Meanwhile, write z=(z1,…,zs)T, by Cauchy-Schwarz inequality and Condition 5,
For any F∈F, let JF=[E{∇θSgi,F(θ0)}]TVF−1(θ0)[E{∇θSgi,F(θ0)}]. Given F, by Lindeberg-Feller Central Limit Theorem, we have
[TABLE]
Let Zi,F=αTJF−1/2[E{∇θSgi,F(θ0)}]TVF−1(θ0)gi,F(θ0). Applying Berry-Esseen inequality, we have
[TABLE]
where C is a uniform positive constant independent of F. By Cauchy-Schwarz inequality,
[TABLE]
which implies
[TABLE]
for a uniform positive constant C independent of F. Therefore, if ℓn=o(n1/3), we have
[TABLE]
Write ΨF=αTJF−1/2[E{∇θSgi,F(θ0)}]TVF−1(θ0)gˉF(θ0) and ΨF=αTJF−1/2{∇θSgˉF(θn)}TVF−1(θn)gˉF(θ0). By Lemmas 2 and 3, noting supF∈F∣gˉF(θ0)∣2=n−1/2ℓn1/2ϱn1/4(logr)1/2, we have
[TABLE]
Hence, for any u∈R, (7.17) leads to the result. \hfill□
References
Jing, Shao and Wang (2003)
Jing, B.-Y., Shao, Q.-M. and Wang, Q. (2003). Self-normalized cramer-type large deviations for independent random variables. The Annals of Probability, 31, 2167–2215.
Bibliography43
The reference list from the paper itself. Each links out to its DOI / PubMed record.
1Bartolucci (2007) Bartolucci, F. (2007). A penalized version of the empirical likelihood ratio for the population mean. Statistics and Probability Letters , 77 , 104–110.
2Candes and Tao (2007) Candes, E. and T. Tao (2007). The Dantzig selector: Statistical estimation when p 𝑝 p is much larger than n 𝑛 n . The Annals of Statistics , 35 , 2313–2351.
3Chang, Chen and Chen (2015) Chang, J., Chen, S. X. and Chen, X. (2015). High dimensional generalized empirical likelihood for moment restrictions with dependent data. Journal of Econometrics , 185 , 283–304.
4Chang, Tang and Wu (2013) Chang, J., Tang, C. Y. and Wu, Y. (2013). Marginal empirical likelihood and sure independence feature screening. The Annals of Statistics , 41 , 2123–2148.
5Chang, Tang and Wu (2016) Chang, J., Tang, C. Y. and Wu, Y. (2016). Local independence feature screening for nonparametric and semiparametric models by marginal empirical likelihood. The Annals of Statistics , 44 , 515–539.
6Chen and Chen, (2008) Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika , 95 , 759–771.
7Chen and Cui, (2006) Chen, S. X. and Cui, H. (2006). On Bartlett correction of empirical likelihood in the presence of nuisance parameters. Biometrika , 93 , 215–220.
8Chen and Cui, (2007) Chen, S. X. and Cui, H. (2007). On the second properties of empirical likelihood with moment restrictions. Journal of Econometrics , 141 , 492–516.