Nearly Semiparametric Efficient Estimation of Quantile Regression
Kani Chen, Yuanyuan Lin, Zhanfeng Wang, Zhiliang Ying

TL;DR
This paper develops a nearly semiparametric efficient estimator for multiple quantile regression models, improving efficiency by pooling information across quantiles with a feasible, easy-to-implement method that outperforms traditional estimators.
Contribution
It introduces a one-step nearly semiparametric efficient estimator for multiple quantile levels, leveraging the least favorable submodel technique for improved efficiency.
Findings
The proposed estimator achieves the semiparametric efficiency lower bound.
Numerical studies show higher efficiency than the Koenker-Bassett estimator.
The method is computationally feasible and easy to implement.
Abstract
As a competitive alternative to least squares regression, quantile regression is popular in analyzing heterogenous data. For quantile regression model specified for one single quantile level , major difficulties of semiparametric efficient estimation are the unavailability of a parametric efficient score and the conditional density estimation. In this paper, with the help of the least favorable submodel technique, we first derive the semiparametric efficient scores for linear quantile regression models that are assumed for a single quantile level, multiple quantile levels and all the quantile levels in respectively. Our main discovery is a one-step (nearly) semiparametric efficient estimation for the regression coefficients of the quantile regression models assumed for multiple quantile levels, which has several advantages: it could be regarded as an optimal way to pool…
| Model | |||||||
| M1 | True | 2 | 1 | 2 | 1.5244 | ||
| 1000 | TQE | 2.0007(0.0512) | 0.9974(0.0899) | 2.0031(0.0547) | 1.5195(0.0961) | ||
| SEF | 2.0009(0.0238) | 0.9968(0.0547) | 2.0050(0.0265) | 1.5149(0.0560) | |||
| EFF | 2.0015(0.0227) | 0.9959(0.0533) | 2.0009(0.0247) | 1.5200(0.0529) | |||
| 2000 | TQE | 1.9992(0.0365) | 1.0010(0.0652) | 2.0023(0.0370) | 1.5213(0.0653) | ||
| SEF | 2.0002(0.0159) | 0.9993(0.0361) | 2.0034(0.0174) | 1.5190(0.0376) | |||
| EFF | 2.0002(0.0145) | 0.9992(0.0352) | 2.0006(0.0150) | 1.5224(0.0365) | |||
| M2 | True | 2 | 2 | 2.5244 | 2.5244 | ||
| 1000 | TQE | 1.9976(0.1192) | 1.9987(0.1155) | 2.5240(0.1244) | 2.5206(0.1229) | ||
| SEF | 1.9989(0.0896) | 1.9981(0.0875) | 2.5228(0.0891) | 2.5209(0.0903) | |||
| EFF | 1.9985(0.0881) | 1.9982(0.0870) | 2.5239(0.0883) | 2.5205(0.0881) | |||
| 2000 | TQE | 1.9980(0.0834) | 2.0022(0.0844) | 2.5230(0.0877) | 2.5225(0.0833) | ||
| SEF | 1.9990(0.0617) | 2.0003(0.0614) | 2.5232(0.0631) | 2.5236(0.0605) | |||
| EFF | 1.9988(0.0608) | 2.0002(0.0608) | 2.5240(0.0624) | 2.5228(0.0602) | |||
| M3 | True | 2 | 1 | 2 | 1.8473 | ||
| 1000 | TQE | 2.0011(0.0822) | 0.9958(0.1437) | 2.0055(0.0907) | 1.8397(0.1592) | ||
| SEF | 2.0014(0.0381) | 0.9949(0.0874) | 2.0094(0.0445) | 1.8305(0.0929) | |||
| EFF | 2.0021(0.0365) | 0.9938(0.0852) | 2.0019(0.0420) | 1.8400(0.0875) | |||
| 2000 | TQE | 1.9987(0.0585) | 1.0017(0.1042) | 2.0040(0.0615) | 1.8424(0.1082) | ||
| SEF | 2.0003(0.0256) | 0.9990(0.0575) | 2.0061(0.0290) | 1.8378(0.0622) | |||
| EFF | 2.0002(0.0230) | 0.9990(0.0561) | 2.0012(0.0250) | 1.8436(0.0607) | |||
| M4 | True | 2 | 1 | 2 | 1.7265 | ||
| 1000 | TQE | 2.0009(0.0669) | 0.9966(0.1144) | 2.0083(0.0930) | 1.7221(0.1621) | ||
| SEF | 2.0014(0.0316) | 0.9955(0.0699) | 2.0166(0.0491) | 1.7015(0.0952) | |||
| EFF | 2.0023(0.0287) | 0.9945(0.0677) | 2.0041(0.0480) | 1.7172(0.0925) | |||
| 2000 | TQE | 1.9990(0.0469) | 1.0013(0.0824) | 2.0057(0.0629) | 1.7228(0.1097) | ||
| SEF | 2.0002(0.0207) | 0.9993(0.0461) | 2.0103(0.0327) | 1.7118(0.0646) | |||
| EFF | 2.0005(0.0188) | 0.9988(0.0449) | 2.0016(0.0289) | 1.7227(0.0628) | |||
| M5 | True | 1 | 2 | 1.8473 | 2.7265 | ||
| 1000 | TQE | 0.9964(0.1797) | 1.9982(0.1555) | 1.8467(0.2073) | 2.7277(0.2072) | ||
| SEF | 0.9979(0.1344) | 1.9972(0.1179) | 1.8440(0.1488) | 2.7214(0.1510) | |||
| EFF | 0.9971(0.1315) | 1.9984(0.1173) | 1.8449(0.1465) | 2.7250(0.1474) | |||
| 2000 | TQE | 0.9973(0.1258) | 2.003(0.1139) | 1.8449(0.1459) | 2.7268(0.1396) | ||
| SEF | 0.9987(0.0921) | 2.0006(0.0831) | 1.8448(0.1052) | 2.7260(0.1011) | |||
| EFF | 0.9982(0.0911) | 2.0004(0.0817) | 1.8462(0.1039) | 2.7264(0.1004) | |||
| ∗ Standard deviations are in parentheses. | |||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Control Systems and Identification
Nearly Semiparametric Efficient Estimation of Quantile Regression
Kani CHEN, Yuanyuan LIN, Zhanfeng WANG and Zhiliang YING
ABSTRACT: As a competitive alternative to least squares regression, quantile regression is popular in analyzing heterogenous data. For quantile regression model specified for one single quantile level , major difficulties of semiparametric efficient estimation are the unavailability of a parametric efficient score and the conditional density estimation. In this paper, with the help of the least favorable submodel technique, we first derive the semiparametric efficient scores for linear quantile regression models that are assumed for a single quantile level, multiple quantile levels and all the quantile levels in respectively. Our main discovery is a one-step (nearly) semiparametric efficient estimation for the regression coefficients of the quantile regression models assumed for multiple quantile levels, which has several advantages: it could be regarded as an optimal way to pool information across multiple/other quantiles for efficiency gain; it is computationally feasible and easy to implement, as the initial estimator is easily available; due to the nature of quantile regression models under investigation, the conditional density estimation is straightforward by plugging in an initial estimator. The resulting estimator is proved to achieve the corresponding semiparametric efficiency lower bound under regularity conditions. Numerical studies including simulations and an example of birth weight of children confirms that the proposed estimator leads to higher efficiency compared with the Koenker-Bassett quantile regression estimator for all quantiles of interest.
KEY WORDS: Quantile regression; Semiparametric efficient score; Least favorable submodel; One-step estimation;
- INTRODUCTION
Quantile regression is a statistical methodology for the modeling and inference of conditional quantile functions. Following Koenker and Bassett (1978), we model the th conditional quantile function of given as
[TABLE]
for certain specific , and is -vector usually including an intercept. Let , be independent and identically distributed copies of . For the th quantile, the classical Koenker-Bassett estimate of , denoted as , is obtained by minimizing the following objective function
[TABLE]
over , where . The computation of is straightforward with the help of linear programming. There is vast literature on the estimation and inference for one or several percentile levels for model (1); see Yu and Jones (1998), He (1997), Koenker and Geling (2001), Koenker and Xiao (2002), He and Zhu (2003), Koenker (2005), Peng and Huang (2008), Peng and Fine (2009), Bondell, Reich and Wang (2010), Wang, Wu and Li (2012), Jiang, Wang and Bondell (2013), He, Wang and Hong (2013), Kato (2011, 2012), Zheng, Peng and He (2015), among many others. When there are commonality of quantile coefficients across multiple quantiles, the composite quantile regression (CQR) is proposed to combine information shared across a number of quantiles to improve estimation efficiency; see Zou and Yuan (2008), Wang and Wang (2009), Kai et al. (2001), Wang, Li and He (2012), Wang and Li (2013). But the novelty of CQR lies in the key assumption that there exist common covariate effects across multiple quantile levels. Recently, important findings in Bayesian inference for quantile regression were reported in Yang and He (2011), Kim and Yang (2011) and Feng, Chen and He (2015).
Typically, model (1) can be expressed as the following linear regression model
[TABLE]
where the th percentile of is assumed to be 0. For specific , under the independence assumption of and , it can be shown that is semiparametric efficient by a straightforward argument to be discussed in section 2. As a special case, when , the least absolute deviation (LAD) is semiparametric efficient for model (3) with the independence assumption of and (Zhou and Portnoy, 1998). However, we point out that, without assuming independence of and , is not semiparametric efficient and the semiparametric efficient estimation of model (1) or model(3) is indeed a sophisticated issue. The most difficult part is the estimation of the density of given in the semiparametric score function (Kato, 2014), which suffers from the curse of dimensionality.
When model (1) is specified for each , following Portnoy (2003), we consider the quantile regression model
[TABLE]
where and are the same as in model (1), and the regression parameter is a function of . With the linearity assumption for all quantiles, the true unknown function is suffice to describe the entire conditional distribution of given . Important results on the estimation of the quantile process with survival data can be found in Portnoy (2003), Peng and Huang (2008). Recently, there are some breakthroughs on Bayesian nonparametric regression models on all quantiles; see Mller & Quintana (2004), Dunson & Taylor (2005) and Chung & Dunson (2009), Reich et al.(2011), Qu & Yoon (2015), etc. To summarize, there are two main approaches for the estimation of quantile process: linear interpolation and basis representation. The linear interpolation approach consists of two steps: the first step is to estimate the quantile regression coefficients separately at certain proper grid of -values, and the second step is to interpolate linearly between grid values or apply rearrangement. For the basis representation method, the quantile function is represented by basis functions or some specific functions after transformation. Nevertheless, both methods reviewed above are in Bayesian framework and their theoretical properties remain unclear.
To the best of our knowledge, there is no specific construction of a semiparametric efficient estimate of of model (4) in the literature. We point out that for model (4), the likelihood function is where and is the derivative of . However, the maximum likelihood method as in Zeng and Lin (2006,2007) involves enormous technical/numerical difficulty. In our view, one of the main reasons lies in the nature of model (4) that the quantile process and the nuisance parameter are not separable. The numerical maximization of the estimated likelihood subject to constrains is rather unstable and wild. The numerical difficulties here are in the same spirit as that in numerically searching for the maximum likelihood estimation (MLE) of for Uniform, where the solutions would often go to the boundary. Moreover, due to data sparsity, the estimated or would be unstable when is close to 0 or 1.
In view of the technical/numerical complications involved in the semiparametric efficient estimation of in model (4), we thus take one step back and consider the following quantile regression model
[TABLE]
where . Model (5) is intermediate of model (1) and model (4). With the explicit expression of the semiparametric efficient score function of , , derived by the least favorable submodel technique in section 2, we propose a one-step estimation with the estimated score function, that leads to the semiparametric efficient estimation of . The proposed procedure is numerically doable and stable. Most importantly, one can show that when the maximum space of tends to 0, the semiparametric efficient score of model (5) approaches to that of model (4). As the impetus for this work was to pursue semiparametric efficient estimation of in model (4), theoretically, one can use efficient estimator of with model (5) to approximate that of model (4). Hence, we refer the proposed procedure as nearly semiparametric efficient estimation for quantile regression.
The rest of the paper is organized as follows. Section 2 introduces the model and the proposed estimation with detailed discussions. Extensive simulation studies with supportive evidence are demonstrated in section 3. In section 4, the proposed method is illustrated using a real data of birth weight of children from the National Center for Health Statistics. All technical derivation and proofs are in Appendix.
- METHODOLOGIES AND MAIN RESULTS
First, consider model (5), by the definition of quantile,
[TABLE]
where is the cumulative distribution function of given . Let be the density function of conditional on . Let be the true value of . By the nature of quantile regression model, is -quantile of given . Without loss of generality, we assume that .
*2.1. Semiparametric efficient scores. *
In quantile regression, estimation of the quantile regression coefficient or the quantile process is inseparably linked to the nuisance parameter, the conditional density function. In such a case, the least favorable submodel method (Kato, 2014) plays a role to derive a semiparametric efficient score function of of model (5) and their variance lower bound. It is known that the least favorable submodel technique is to reduce a high dimensional problem to a problem involving a finite-dimensional “ least favorable submodel”; see Begun et al.(1983), Bickel et al.(1993), among others. Following section 25.4 in van der Vaart (1998), we begin with the construction of a parametric submodel of model (5) based on the cumulative distribution function with parameter in a neighborhood of 0,
[TABLE]
where is a function of satisfying certain conditions. Differentiating (7) we get
[TABLE]
where , and are derivatives of , and respectively. To guarantee is a density function for all , the first restriction of is that
[TABLE]
Moreover, under model (5), let be the quantile of and , for . Hence, we have the identity . By a Taylor expansion of the right hand side of this identity as a function of in a neighborhood of 0, we obtain the second restriction that
[TABLE]
for , where is the derivative of at . Clearly, the derivative of log-likelihood of based on the density function at is , denoted as . By the information theory in Bickel et al.(1993), we are able to approximate the least favorable submodel by searching for the lower bound of , which as a result would lead to the semiparametric efficient score. We defer the details to Appendix I. The resulting semiparametric efficient score of can be regarded as an optimal way to combine information from all the quantile levels .
Let and be a diagonal matrix with diagonal elements being the reciprocal of diagonal of matrix , where and are defined in (A.18) and (A.24) in Appendix I. Set , where is a vector with length . The following proposition presents the semiparametric efficient score of , and their variance lower bound.
Proposition 1. For model (5), the semiparametric efficient score of , , is
[TABLE]
Moreover, for the estimate of the th component of , its variance has a lower bound
[TABLE]
*where is matrix, , , ; ; , ; and . *
Remark 1. When , model (5) reduces to model (1) for one single quantile point . By Proposition 1, the semiparametric efficient score of in model (1) is
[TABLE]
By the definitions of and , is a constant matrix not depending on random variable . For the corresponding linear model (3), under the assumption that the -quantile of is [math] and the error term is independent of covariate , is also not relevant to . In this case, the efficient score in (10) is exactly the efficient score in classical quantile regression model specified at a single quantile level, such as the least absolute deviation estimate (LAD) for median regression; see Zhou and Portnoy (1998) and Kato (2014). However, without the crucial independence assumption of and , as conventional quantile regression models allows heterogeneity, the distribution of depends on implying also depends on . As a result, the Koenker-Bassett estimate is not semiparametric efficient.
Remark 2. When and the maximum space of tends to 0, model (5) approaches model (4). Next, we intend to show that the semiparametric efficient score (9) of approaches that of model (4) as . In fact, for the -th component of , a similar calculation as that of (9) reveals that semiparametric efficient score of in model (4) is
[TABLE]
where is a minimizer of
[TABLE]
subject to . We defer the detailed derivations of this finding in Appendix II. We point out that, it is infeasible to pursue the semiparametric efficient estimation of in model (4) based on (11), as the numerical minimization of (12) is intractable. Fortunately, the semiparametric efficient score of in (9) can be rewritten as,
[TABLE]
where is a minimizer of the quadratic form subject to . It is straightforward to check that
[TABLE]
as . This finding motivates us to use the efficient score in (9) to approximate the efficient score in (11), which leads to a nearly semiparametric efficient estimator of in model (4).
Remark 3. The key idea of this work is to borrow information across quantiles and search for the most efficient estimation. This remark provides more insights in this idea. Intuitively, for certain quantile level , the estimation of in traditional quantile regression does not depend on the information on at other quantiles , especially those quantiles far away from . The intuition is true when the number of covariates (including an intercept term) is 1, that is . For this special case, one can rewrite (13) as
[TABLE]
from which one can see that is not relevant to the model information at other quantiles . Appendix III contains the proofs of . In other words, for model (5) with , the semiparametric efficiency for the estimation of can be achieved using only the information at . However, besides an intercept, there is generally at least one covariate in the model, namely . Hence, the efficient estimator of generally depends on the information at other quantiles. In view of this fact, borrowing information across other quantiles via the efficient score is able to improve the estimation efficiency of when . In addition, Proposition 1 tells that the variance of estimates of have a lower bound .
For illustration, we consider a toy example for model (5) with . To estimate , if we use only the model information at single quantile and ignore the information at , then
[TABLE]
On the other hand, by incorporating the model information at for the estimation of , we have shown in Appendix IV that
[TABLE]
Most importantly, we have shown which leads to . In summary, our theoretical analysis validates that combining information across quantiles can generally reduce the variance of the estimate of .
2.2. The nearly semiparametric efficient estimation.
In this subsection, we introduce the proposed nearly semiparametric efficient estimation procedure for the regression coefficients of mode (4). As discussed earlier, we make use of the score (9) in the construction of the proposed estimator. Since (9) involves the density function of given , we need to find an appropriate estimate of , . Recall that
[TABLE]
Hence, instead of estimating the conditional density function directly, we estimate . A natural estimate of is , where is the Koenker-Bassett estimate of by minimizing (2) and is the bandwidth. Thus, the density function can be estimated by for . Next, we define the proposed one-step estimator of , denoted by , as
[TABLE]
where is the -th component of the estimated score by plugging and , , into (9), is the estimated variance lower bound by plugging and , , into in Proposition 1. Under regularity conditions given in Appendix V, the resulting estimate of can be proved to achieve the semiparametric efficiency lower bound. The following theorem presents the main results.
Theorem 1. Assume model (4) and conditions in Appendix V hold. Then, for and ,
[TABLE]
*in distribution as , where is the -th component of . Moreover, the asymptotic variance of achieves the semiparametric efficiency bound . *
The implementation of the one-step estimation is as follows: for each , ,
*Step 1. * For each , compute the initial estimator ;
Step 2. For each , calculate and the conditional density function is ;
*Step 3. * Compute and by plugging the initial estimator in step 1 and the estimated density in step 2 into and ;
Step 4. Obtain according to (18).
Remark 4. Actually, in the above one-step estimation, we only need to estimate the conditional density function at quantile levels . In this regard, we only need to assume the linear quantile regression model is specified in a neighborhood of each , , and do not need to assume a linear quantile regression model for all .
- SIMULATION STUDIES
Simulations are conducted to evaluate the performance of our proposed method. In the simulation, for a quantile level of interest, we consider three methods for the estimation of : the Koenker-Bassett quantile estimate , denoted by TQE; the proposed one-step estimate based on the semiparametric efficient score of , referred as EFF; the one-step estimate based on the score function (10) ignoring the model information at other quantiles, referred as (SEF). The simulated data is generated from the following quantile regression model with two covariates,
[TABLE]
where and takes each of the following 5 forms:
and ;
and ;
and ;
and ;
and .
The covariate is constant for , and , and it follows log-normal distribution for and . Another covariate follows log-normal distribution for all cases. In particular, model (20) with cases and are equivalent to
[TABLE]
and
[TABLE]
respectively, where follows the standard normal distribution. The sample size and 2000. All simulations are repeated 1000 times.
We first consider the two quantiles and . The simulation results are summarized in Table 1. One can see that the parameter estimates are generally unbiased. In all configurations, EFF has the smallest standard deviation (SD) compared with TQE and SEF. And SEF have much smaller SD compared to TQE. For example, for case M3 and , the ratio of the standard deviations of TQE and EFF ranges from to . And the ratio of the standard deviations of SEF and EFF ranges from to . In other words, EFF improves efficiency of TQE for at least and it improves efficiency of the SEF for around to 12%, which confirms our theoretical findings.
In addition, we also compare the numerical performance of the three methods with quantiles and , a higher quantile. Table 2 reports the estimation results for the 5 cases, from which similar conclusion to that of and can be drawn. Specially, EFF has the smallest standard erros and SEF is more efficient than TQE. This confirms the theory that, if a higher quantile is of particular interest, it is beneficial to combine the model information across other quantile levels, for example, some moderate quantile , for more efficient and stable estimation.
- APPLICATION
We apply the proposed method to analyze a birth data (birth) released annually by the National Center for Health Statistics. The data includes information on nearly all live births from United States. Education of mother of each birth is recorded as 5 classes based on years of education. For illustration, we only consider the births that occurred in the month of June, 1997, and had mothers with smoking cigarettes and education class 2 (7 to 11 years of education). There are 9832 birth children consisting of 4861 female and 4971 male. In this paper, our interest is to study the relationship of the birth weight of child (in grams) and the covariates: the age of mother (Mage), the age of father (Fage) and the total number of prenatal care visits (Nprevist). All variables are taken the logarithmic transformation before analysis. We apply model (5) with for analyzing the dataset. Tables 3-4 present the estimation results of regression coeffecients by TQE, SEF and EFF, which are defined the same as in section 3. In Tables 3-4, Est represents the parameter estimate, Esd is the variance estimate of Est by boostrap resampling method and the -value is computed by where is the cumulative distribution function of the standard normal distribution.
It can be seen that at nominal significance level 0.05, all the three methods detect Nprevist for all quantiles, detect ages of parents at and 0.5. And at , the three methods identify father age of the female children data. However, one significant finding in the analysis is that at , Fage and Mage of the male children data do not have significantly nonzero coefficients, however, for female data, Mage is only detected by EFF with a significant nonzero coefficients, while TQE and SEF do not detect this. Overall, Tables 3-4 report that Nprevist and ages of parents have positive and negative coefficients, respectively, which suggests that the birth weights of children become heavier when their mothers are younger and have more prenatal care visits. In addition, the effect of the three covariates to the birth weights of children are more significant at lower quantile () compared with that of higher quantile ().
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1]
- 2[2] Begun, J. M., Hall, W. J., Huang, W. M. and Wellner, J. A. (1983). Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11 , 432-452.
- 3[3]
- 4[4] Bickel, P. J., Klaassen, C. A., Ritov, Y. and Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press.
- 5[5]
- 6[6] Bondell, H. D., Reich, B. J. and Wang, H. (2010). Noncrossing quantile regression curve estimation. Biometrika , 97 , 825-838.
- 7[7]
- 8[8] Chung, Y. and Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. J. Amer. Statist. Assoc. 104 , 1646-1660.
