Rate-optimal nonparametric estimation for random coefficient regression models
Hajo Holzmann, Alexander Meister

TL;DR
This paper establishes the optimal convergence rate for nonparametric density estimation in linear random coefficient models, highlighting the influence of tail behavior and proposing an estimator that does not require density division.
Contribution
It derives the first optimal pointwise convergence rate for density estimation in these models, accounting for tail behavior, and introduces an adaptive estimator without dividing by a density estimate.
Findings
Achieves the optimal convergence rate for density estimation.
Shows the tail behavior of the design density affects the rate.
Proposes an estimator that does not require dividing by a nonparametric density estimate.
Abstract
Random coefficient regression models are a popular tool for analyzing unobserved heterogeneity, and have seen renewed interest in the recent econometric literature. In this paper we obtain the optimal pointwise convergence rate for estimating the density in the linear random coefficient model over H\"older smoothness classes, and in particular show how the tail behavior of the design density impacts this rate. In contrast to previous suggestions, the estimator that we propose and that achieves the optimal convergence rate does not require dividing by a nonparametric density estimate. The optimal choice of the tuning parameters in the estimator depends on the tail parameter of the design density and on the smoothness level of the H\"older class, and we also study adaptive estimation with respect to both parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Rate-optimal nonparametric estimation for random coefficient regression models
Hajo Holzmannlabel=e1][email protected] [
Alexander Meisterlabel=e2][email protected] [ Philipps-Universität Marburg\thanksmarkm1
Universität Rostock\thanksmarkm2
Hajo Holzmann
Fachbereich Mathematik und Informatik,
Philipps-Universität Marburg,
35037 Marburg, Germany.
Alexander Meister
Institut für Mathematik,
Universität Rostock,
18051 Rostock, Germany.
Abstract
Random coefficient regression models are a popular tool for analyzing unobserved heterogeneity, and have seen renewed interest in the recent econometric literature. In this paper we obtain the optimal pointwise convergence rate for estimating the density in the linear random coefficient model over Hölder smoothness classes, and in particular show how the tail behavior of the design density impacts this rate. In contrast to previous suggestions, the estimator that we propose and that achieves the optimal convergence rate does not require dividing by a nonparametric density estimate. The optimal choice of the tuning parameters in the estimator depends on the tail parameter of the design density and on the smoothness level of the Hölder class, and we also study adaptive estimation with respect to both parameters.
62G07,
62G20,
62G30,
adaptive estimation,
ill-posed inverse problem,
minimax risk,
nonparametric estimation.,
keywords:
[class=MSC]
keywords:
\arxiv
arXiv:1902.05261 \startlocaldefs
\endlocaldefs
and
1 Introduction
In this paper we consider the linear random coefficient regression model, in which i.i.d. (independent and identically distributed) data , are observed according to
[TABLE]
Therein are unobserved i.i.d. random variables with the bivariate Lebesgue density ; while and are independent. Note that (1.1) represents a randomized extension of the standard linear regression model. We shall derive the optimal convergence rates for estimating over Hölder smoothness classes in case when the have a Lebesgue density with polynomial tail behaviour, as specified in Assumption 1 below.
From a parametric point of view with focus on means and variances of the random coefficients, a multivariate version of model (1.1) is studied by [11]. They assume the coefficients to be mutually independent. The nonparametric analysis of model (1.1) has been initiated by [3] and [4]. [2] use Fourier methods to construct an estimator of . They do not derive the optimal convergence rate, though. Furthermore, their estimator is rather involved as it requires a nonparametric estimator of a conditional characteristic function, which is then plugged into a regularized Fourier inversion.
Extensions of model (1.1) have seen renewed interest in the econometrics literature in recent years. [13] suggest a nonparametric estimator in a multivariate version of model (1.1). They only obtain its convergence rate for very heavy tailed regressors. Moreover, their estimator requires dividing by a nonparametric density estimator for a transformed version of the regressors. This involves an additional smoothing step, and potentially renders the estimator unstable. [5] propose a specification test for model (1.1) against a general nonseparable model as the alternative, while [6] suggest multiscale tests for qualitative hypotheses on . Extensions and modifications of model (1.1) are studied in [9], [17], [1], [8], [10], [18], [19] and [12]. Methods of analytic continuation of the coefficients density outside the support of the covariates are considered under more restrictive conditions in [12] and in the recent work of [7].
In this paper, we consider the basic model (1.1) under the following condition.
Assumption 1** (Design density).**
For some constants and , the density satisfies
[TABLE]
We analyze precisely how the tail parameter of influences the optimal rate of convergence of at a given point in a minimax sense in case . Note that the heavy tailed setting which is studied in [13] corresponds to in Assumption 1. To our best knowledge a rigorous study of the minimax convergence rate in the more realistic case of has been missing so far. Indeed we fill this gap and derive optimal rates, which are fundamentally new and not known from any other nonparametric estimation problem.
The estimator which we propose is inspired by [12]. It achieves the optimal convergence rate and does not require dividing by a nonparametric density estimator. Instead we exploit the order statistic of the transformed design variables in a Priestley-Chao manner. The optimal choice of the tuning parameters depends both on the two parameters and on the smoothness parameter of the Hölder class, which is reminiscent of the estimation problem in [14] and in contrast to usual adaptation problems in nonparametric curve estimation, in which the smoothing parameters shall adapt only to an unknown smoothness level. Here we show how to make the estimator adaptive with respect to both of these parameters.
The paper is organized as follows. In Section 2 we introduce our estimation procedure. Section 3 is devoted to upper and lower risk bounds, which yield minimax rate optimality for the pointwise risk. We also derive an upper risk bound for the uniform risk, here, an additional logarithmic factor occurs. In Section 4 we deal with adaptivity. The proofs and technical lemmas are deferred to Section 5.
Let us fix some notation: denotes the characteristic function of the , while is the conditional characteristic function of the random variable given the random variable . Throughout stands for the Euclidean norm of a real or complex vector, and denotes the indicator function of the event . For positive sequences and we write if , for constants .
2 The estimator
In order to construct an estimator for in model (1.1), we transform the data into via
[TABLE]
so that almost surely (a.s.), and are independent, and
[TABLE]
Then the conditional characteristic function of given equals
[TABLE]
By Fourier inversion, integral substitution into polar coordinates (with signed radius) and (2.2) we deduce that
[TABLE]
The equation (2) motivates us to estimate by an empirical version of the conditional characteristic function which is directly accessible from the data . For that purpose choose a function which satisfies the following assumption.
Assumption 2** (Kernel).**
For a number the function is even, supported on , -fold continuously differentiable on the whole real line, satisfies as well as for all , and is bounded by .
Assumption 2 could be relaxed somewhat. In particular, we may assume compact support instead of imposing the support of to be a subset of and we may remove the condition that is bounded by . Simple boundedness is sufficient, which follows from the other conditions.
Now we consider the regularized version of by kernel smoothing as follows
[TABLE]
where
[TABLE]
Inspired by (2.4) we introduce a Priestley-Chao type estimator of the density ,
[TABLE]
where , , denotes the sample , , sorted such that , and where is a classical bandwidth parameter and is a threshold parameter both of which remain to be selected. By the parameter we cut off that subset of the interval in which the are sparse.
In the following we shall use the symbol
[TABLE]
to denote the sum over the random set of indices for which . Thus, we may write the estimator in (2.6) as
[TABLE]
In this paper we consider one-dimensional covariates only. From a methodological point of view, the estimator (2.6) could be extended to the multivariate setting by using Voronoi cells instead of the order statistics. A similar technique is proposed in eq. (36) in [12]. On the other hand, the asymptotic properties of such an estimator might be completely different from the univariate case.
3 Upper and lower risk bounds
We consider the following Hölder smoothness class of densities.
Definition**.**
For a point , a smoothness index and constants define the class of densities as follows: is Hölder-smooth of the degree in the neighborhood , that is, is -times continuously differentiable in and its partial derivatives satisfy
[TABLE]
for all and . Furthermore, assume that the Fourier transform of is weakly differentiable and its weak derivative satisfies
[TABLE]
and that for all .
For the proof of the first theorem, the global partial tail and smoothness condition (3.2) of the order is required in addition to the local smoothness assumption (3.1) of the order . The theorem provides an upper bound on the convergence rate for the estimator in (2.6).
Theorem 3.1**.**
Consider model (1.1) and assume that satisfies (1.2) for some . If satisfies Assumption 2 for , and if and are chosen such that
[TABLE]
then the estimator (2.6) attains the following asymptotic risk upper bound over the function class ,
[TABLE]
The following theorem yields that the convergence rates which our estimator (2.6) achieves according to Theorem 3.1 are optimal for the pointwise risk in the minimax sense.
Theorem 3.2**.**
Fix and the constants , sufficiently large for any and . Let be an arbitrary sequence of estimators of , where is based on the data , , for each . Assume that satisfies (1.2). Then
[TABLE]
The convergence rates from Theorem 3.1 and 3.2 differ significantly from standard rates in nonparametric estimation. While they become faster as increases, they become slower as gets larger. It is remarkable that they do not approach the (squared) parametric rate but the slower rate for large .
The case . An analysis of the proof of Theorem 3.1 shows that in case , choosing and gives the rate
[TABLE]
in case , an additional logarithmic factor occurs. The upper bound no longer depends on in this regime. For , [13] obtain the faster rate {\cal O}\big{(}n^{-\frac{2\alpha}{2\alpha+3}}\big{)}; their rate is in but could be transferred to a pointwise rate. However, they additionally impose the assumption that the density is uniformly bounded with a bounded support. This implies that is also uniformly bounded. Under this additional assumption, instead of (5.4) in our analysis, we have the sharper bound
[TABLE]
since \int_{\mathbb{R}}K^{2}\big{(}u;h\big{)}\,\,\mathrm{d}u\leq\mbox{const.}\cdot h^{-3}. Then one can show that our estimator also achieves the rate {\cal O}\big{(}n^{-\frac{2\alpha}{2\alpha+3}}\big{)} for , even with the choice .
Finally, we consider the uniform rate of convergence, again in the case .
Theorem 3.3**.**
Consider model (1.1) and assume that satisfies (1.2) for some . Suppose that satisfies Assumption 2 for , and that and are chosen such that
[TABLE]
For a compact rectangle let denote the class of densities on such that for each . Then the estimator (2.6) attains the following uniform asymptotic risk upper bound over the function class ,
[TABLE]
4 Adaptation
4.1 Adaptation with respect to for given smoothness
Assume that (1.2) holds with unknown . If there are at least two observations in the interval so that is not the sum over the empty set, we set
[TABLE]
otherwise we put and . To define a selection rule for , define the function
[TABLE]
which is continuous except at the sites , and for . Now choose in the interval such that
[TABLE]
The next proposition shows that the convergence rate from Theorem 3.1 does not deteriorate if only is unknown but is known.
Proposition 4.1**.**
Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose satisfying the Assumption 2 for for given . If is chosen in (4.2) and
[TABLE]
then for the estimator \hat{f}_{A}\big{(}a;\hat{h}_{n},\hat{\delta}_{n}\big{)} we have that
[TABLE]
where .
4.2 Adaptation by the Lepski method
Finally we consider adaptivity with respect to both parameters and based on a combination of Lepski’s method, see [15] and [16], and the choice (4.2). Consider the grid of bandwidths
[TABLE]
where , and is defined in (4.2). Fix and denote
[TABLE]
For sufficiently large to be chosen we let
[TABLE]
where
[TABLE]
Theorem 4.1**.**
Consider model (1.1) and assume that satisfies (1.2) for some unknown . Choose according to Assumption 2 for some . Then for sufficiently large (e.g. suffices), we have, for every with , that
[TABLE]
where .
Thus for adaptivity an additional logarithmic factor occurs in the pointwise rate under Hölder smoothness constraints.
5 Proofs
In the proofs we drop in and in from the notation.
5.1 Proofs for Section 3
Proof of Theorem 3.1.
By passing to Cartesian coordinates in (2.4) we can write
[TABLE]
Assumption 2 guarantees that is a kernel of order . Then, using Taylor approximation as usual in kernel regularization, see p. 37–38 in [20] for the argument in case of non-compactly supported kernels, the following asymptotic rate of the regularization bias term occurs
[TABLE]
where the constant factor only depends on , , and .
Now let denote the -field generated by , and consider the conditional bias-variance decomposition
[TABLE]
Since are independent given , observing from (2.5) that , we may bound
[TABLE]
where the constant factor only depends on . Therein we use the notation (2.7). For the conditional expectation, we obtain that
[TABLE]
where we set
[TABLE]
We deduce that
[TABLE]
where
[TABLE]
where and are defined in (4.1). If there are no two consecutive in the interval , then (indeed ). In this case, by our convention we have and so that and is the integral from to , as required for the estimate (5.5) to remain true in this case.
First, consider the term . Using the Cauchy-Schwarz inequality, it holds that
[TABLE]
Analogously we establish that
[TABLE]
Finally, consider the term . In case when there are two consecutive in the interval so that the sum in (2.7) is not empty, it holds that
[TABLE]
Now, for , we get that
[TABLE]
according to (2.2). Hence we may bound
[TABLE]
Applying the Cauchy-Schwarz inequality gives for
[TABLE]
For interchanging sum and integrals we obtain
[TABLE]
Using the Cauchy-Schwarz inequality twice yields
[TABLE]
Hence, the term obeys the upper bound
[TABLE]
Finally, if there are no two consecutive in the interval , we simply have I_{1}\leq\big{|}\tilde{f}_{A}(a;h)\big{|}^{2}\leq f_{A}(a)^{2}+\text{const.}\,\cdot h^{2\alpha}\leq\text{const.} Collecting the terms that bound (5.5) and using (5.4), from (5.3) we obtain that
[TABLE]
Here, the last term takes care of the event in which the sum is empty and the estimator actually is zero. In order to bound the terms in (5.7) involving the order statistics, we note that since ,
[TABLE]
From (5.2) and (5.7) and Lemma 5.1 we obtain for that
[TABLE]
Upon inserting the rates for and we obtain the result.
∎
Proof of Theorem 3.2.
We introduce the functions
[TABLE]
for , some constant and some sequences and which remain to be selected; moreover we specify
[TABLE]
and
[TABLE]
where
[TABLE]
We verify that is a probability density as and are probability densities. The Fourier transform of equals
[TABLE]
so that
[TABLE]
since is supported on the interval . Choosing the constant sufficiently small we can guarantee that is a non-negative function and satisfies the inequality
[TABLE]
for some constant . Thus, is a probability density as well. Furthermore we verify that for both under the constraint
[TABLE]
as and may be viewed as sufficiently large. Therein note that (3.2) is satisfied as can be written as the sum of two functions where , are bounded, weakly differentiable, integrable functions whose weak derivatives are essentially bounded and integrable as well.
The squared pointwise distance between and at [math] equals
[TABLE]
Using (5.8), the conditional density of given under the parameter equals
[TABLE]
for all . Moreover we have that
[TABLE]
where the Fourier transform equals
[TABLE]
Therefore the -distance between the competing observation densities is bounded from above as follows,
[TABLE]
where
[TABLE]
Moreover, this choice also guarantees that integrates to and, hence, is a probability density. Then the integrals in (5.11) range over a subset of
[TABLE]
as and its (weak) derivative are supported on . Also these functions are uniformly bounded by . Thus the integrals vanish whenever . It follows that
[TABLE]
if ; and otherwise. According to standard arguments from decision theory, (5.10) represents a lower bound on the attainable rate if the Hellinger distance between the competing data distributions (for and , respectively) obeys an upper bound which is smaller than – uniformly with respect to , see e.g. [21]. Writing for the Hellinger distance, it holds that
[TABLE]
as the distribution of the is identical for and . Then, the term (5.12) is bounded from above by
[TABLE]
as . We choose so that the -distance between the joint densities of the observations under and in (5.13) is bounded from above as tends to infinity. By elementary decision theoretic arguments and by (5.10), a lower bound on the attainable convergence rate is given by
[TABLE]
which completes the proof of the theorem. ∎
Proof of Theorem 3.3.
We estimate
[TABLE]
where is defined in (5.1). The second term - the regularization bias - is bounded in (5.2), and that bound is uniform in from the assumptions on the function class . For the first term we have, similarly to (5.3), that
[TABLE]
The second term in (5.14) is bounded by
[TABLE]
where are defined as in (5.5), and the dependence on is stressed in the notation. The bounds on the derived after (5.5) are uniform in over a bounded set . Thus, it remains to bound the first term in (5.14).
Given let be a subset of for which the -balls with centers at points in cover . It is possible to choose such a set with a cardinality of order , where depends on but not on . Then
[TABLE]
Since , see the formula (2.5) for and the Assumption 2 in , by Lipschitz-continuity the second term is . From the Hoeffding inequality, since we obtain for that
[TABLE]
Set
[TABLE]
Then, for we estimate
[TABLE]
Choose and . Then if we obtain from Lemma 5.1 that
[TABLE]
and overall
[TABLE]
Plugging in the choices of and gives the result. ∎
5.2 Proofs for Section 4
Proof of Proposition 4.1.
From (5.7) and (5.2) we estimate
[TABLE]
Observe that from the term in the definition of ,
[TABLE]
Since , and since contains the term , from (5.16) and the choice of we obtain the bound
[TABLE]
By definition of ,
[TABLE]
for the deterministic choice , which is contained in for sufficiently large since . Further, by Jensen’s inequality, Lemma 5.1 and the choice of ,
[TABLE]
Substituting these estimates into (5.17), and using (5.28) finally gives
[TABLE]
∎
Proof of Theorem 4.1.
Fix with and , and set
[TABLE]
see the bound for the regularization bias in (5.2). We shall abbreviate .
On the event
[TABLE]
where , we may estimate
[TABLE]
since . In the following, suppose that there are two design points in the interval . Since for each , as in the proof of Proposition 4.1 the term involving in (5.16) is negligible as compared to that with the factor . Hence using (5.7) and (5.2) we estimate
[TABLE]
Define the ‘oracle index’ by
[TABLE]
Note that since , while since from the definition of . Further, since by the choice of we have that we estimate
[TABLE]
since by the choice of . Finally,
[TABLE]
since since from the definition of and since .
Since increase by factors in , and decrease by factors in , it follows from the above estimates that and , and that there are constants such that . Rearranging yields
[TABLE]
for constants . We obtain from (5.18) that
[TABLE]
Now, for we estimate
[TABLE]
For the second term, we have that
[TABLE]
The second term in (5.22) is bounded by (5.20) after a trivial estimate of the indicator. Further, from the definition of and (5.19) we have the bound
[TABLE]
which also holds in conditional expectation given .
For the first term in (5.21) we estimate
[TABLE]
Then
[TABLE]
Now let
[TABLE]
By choice of , for we have that
[TABLE]
Hence, setting we may estimate
[TABLE]
Therefore, for ,
[TABLE]
Since , , it suffices to bound
[TABLE]
By choice of the grid , , therefore
[TABLE]
for sufficiently large. Hence
[TABLE]
where \tilde{C}=\big{(}C_{\text{Lep}}^{1/2}/4\,-1\big{)}. Using the bound , see the formula (2.5) for and the Assumption 2 in , we use the conditional Hoeffding inequality in order to estimate
[TABLE]
see (5.4), where
[TABLE]
for the choice . Note that in this step, the logarithmic factor is essential.
Hence
[TABLE]
and in (5.23) we obtain the bound
[TABLE]
The crude bound
[TABLE]
now suffices to conclude that for sufficiently large choice of the constant ,
[TABLE]
The remainder of the proof is as that of Proposition 4.1. ∎
5.3 Spacings
As the density of equals
[TABLE]
so that (1.2) implies
[TABLE]
for some constants .
Lemma 5.1**.**
If satisfies (1.2) and hence fulfills (5.25), then for we have that
[TABLE]
Furthermore,
[TABLE]
and for that
[TABLE]
Proof of Lemma 5.1.
Setting
[TABLE]
we deduce under (5.25) that
[TABLE]
that is, (5.26). Moreover we write and so that
[TABLE]
as . The term \operatorname{\mathbb{E}}\big{[}\big{(}R_{n}(\delta)-\pi/2)^{2}\big{]} can be bounded analogously.
Concerning (5.28), we bound the probability that there is at most one observation in for by
[TABLE]
which implies the result. ∎
Acknowledgements
The authors are grateful to the editors and a referee for their thorough review and very helpful and constructive comments. H. Holzmann gratefully acknowledges financial support of the DFG, grant Ho 3260/5-1.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Arellano, M. and Bonhomme, S. (2011). Identifying distributional characteristics in random coefficients panel data models. Rev. Econ. Stud. 79 , 987–1020.
- 2[2] Beran, R. Feuerverger, A. and Hall, P. (1996). On nonparametric estimation of intercept and slope distributions in random coefficient regression. Ann. Statist. 24 , 2569–2592.
- 3[3] Beran, R. and Hall, P. (1992). Estimating coefficient distributions in random coefficient regressions. Ann. Statist. 20 , 1970–1984.
- 4[4] Beran, R. and Millar, P.W. (1994). Minimum distance estimation in random coefficient regression models. Ann. Statist. 22 , 1976–1992.
- 5[5] Breunig, C. and Hoderlein, S. (2018). Specification testing in random coefficient models. Quant. Econ. 9 , 1371–1417.
- 6[6] Dunker, F., Eckle, K., Proksch, K. and Schmidt-Hieber, J. (2019). Tests for qualitative features in the random coefficients model. Elect. J. Statist. 13 , 2257–2306.
- 7[7] Gaillac, C. and Gautier, E. (2019). Adaptive estimation in the linear random coefficients model when regressors have limited variation. ar Xiv: 1905.06584 .
- 8[8] Gautier, E. and Hoderlein, S. (2011). A triangular treatment effect model with random coefficients in the selection equation. ar Xiv: 1109.0362 .
