Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters
Shonosuke Sugasawa, Tatsuya Kubokawa

TL;DR
This paper introduces an adaptive transformation approach for mixed model predictions of finite population parameters, improving estimation accuracy when response data deviate from normality.
Contribution
It proposes a data-driven method to select suitable transformations using a parametric family and profile likelihood, enhancing prediction robustness.
Findings
The method performs well in simulations, showing improved accuracy over traditional transformations.
Application to income data demonstrates the method's practical utility and identifies limitations of common transformations.
Constructs empirical Bayes confidence intervals for population parameters.
Abstract
For estimating area-specific parameters (quantities) in a finite population, a mixed model prediction approach is attractive. However, this approach strongly depends on the normality assumption of the response values although we often encounter a non-normal case in practice. In such a case, transforming observations to make them suitable for normality assumption is a useful tool, but the problem of selecting suitable transformation still remains open. To overcome the difficulty, we here propose a new empirical best predicting method by using a parametric family of transformations to estimate a suitable transformation based on the data. We suggest a simple estimating method for transformation parameters based on the profile likelihood function, which achieves consistency under some conditions on transformation functions. For measuring variability of point prediction, we construct an…
| Scenario | s1 | s2 | s3 | s4 | s5 | s6 | s7 |
| Model | A | A | A | B | B | C | D |
| - | - |
| Scenario | ATP | LTP | MQ | DE | ATP | LTP | MQ | DE | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 5.04 | 5.04 | 15.08 | 21.44 | 5.00 | 5.02 | 16.55 | 17.51 | |||
| 20 | 3.33 | 3.32 | 12.48 | 9.14 | 3.34 | 3.33 | 14.09 | 12.01 | |||
| s1 | 30 | 2.60 | 2.60 | 11.65 | 7.08 | 2.34 | 2.33 | 14.32 | 8.14 | ||
| 40 | 1.76 | 1.76 | 9.76 | 4.62 | 1.80 | 1.79 | 11.44 | 5.05 | |||
| 50 | 1.51 | 1.51 | 10.41 | 3.40 | 1.37 | 1.37 | 12.39 | 3.28 | |||
| 10 | 5.45 | 5.78 | 13.90 | 21.28 | 5.10 | 5.37 | 14.87 | 17.49 | |||
| 20 | 3.09 | 3.35 | 10.62 | 8.32 | 3.19 | 3.43 | 11.31 | 11.70 | |||
| s2 | 30 | 2.35 | 2.58 | 9.75 | 6.30 | 2.34 | 2.53 | 11.74 | 8.12 | ||
| 40 | 1.73 | 1.87 | 8.64 | 4.35 | 1.79 | 2.00 | 9.84 | 5.02 | |||
| 50 | 1.42 | 1.62 | 8.52 | 3.37 | 1.43 | 1.66 | 10.68 | 3.31 | |||
| 10 | 4.62 | 5.95 | 11.37 | 18.12 | 4.43 | 5.83 | 12.27 | 15.71 | |||
| 20 | 2.90 | 3.68 | 9.85 | 7.84 | 2.91 | 4.05 | 9.92 | 10.77 | |||
| s3 | 30 | 2.13 | 3.02 | 8.10 | 5.88 | 2.01 | 3.02 | 10.03 | 6.85 | ||
| 40 | 1.56 | 2.19 | 7.05 | 4.13 | 1.57 | 2.48 | 8.08 | 4.38 | |||
| 50 | 1.31 | 2.18 | 7.00 | 3.07 | 1.30 | 2.27 | 8.43 | 3.09 | |||
| 10 | 6.03 | 6.47 | 13.72 | 21.22 | 5.28 | 5.61 | 13.92 | 16.41 | |||
| 20 | 3.39 | 3.65 | 10.83 | 8.26 | 3.45 | 3.77 | 10.74 | 10.98 | |||
| s4 | 30 | 2.47 | 2.69 | 10.03 | 6.09 | 2.43 | 2.71 | 11.04 | 8.02 | ||
| 40 | 1.78 | 2.00 | 8.15 | 4.49 | 1.79 | 2.07 | 9.72 | 4.57 | |||
| 50 | 1.53 | 1.74 | 8.76 | 3.30 | 1.53 | 1.84 | 9.97 | 3.25 | |||
| 10 | 5.16 | 6.82 | 11.39 | 18.59 | 4.73 | 6.63 | 10.48 | 14.41 | |||
| 20 | 3.17 | 4.42 | 8.59 | 7.66 | 3.10 | 4.60 | 8.94 | 10.25 | |||
| s5 | 30 | 2.11 | 3.26 | 7.39 | 5.54 | 2.14 | 3.49 | 8.72 | 6.76 | ||
| 40 | 1.52 | 2.50 | 6.21 | 3.93 | 1.76 | 2.94 | 7.47 | 4.52 | |||
| 50 | 1.41 | 2.47 | 6.31 | 3.06 | 1.37 | 2.61 | 7.78 | 2.90 | |||
| 10 | 6.20 | 6.69 | 17.21 | 21.35 | 6.23 | 6.83 | 18.06 | 18.38 | |||
| 20 | 3.88 | 4.28 | 13.73 | 9.16 | 3.86 | 4.29 | 14.48 | 12.47 | |||
| s6 | 30 | 2.76 | 3.11 | 11.96 | 6.90 | 2.92 | 3.38 | 14.75 | 8.93 | ||
| 40 | 2.07 | 2.33 | 10.42 | 4.62 | 2.10 | 2.42 | 12.16 | 5.11 | |||
| 50 | 1.68 | 1.96 | 10.56 | 3.52 | 1.69 | 2.03 | 13.02 | 3.52 | |||
| 10 | 5.74 | 8.45 | 13.91 | 19.94 | 6.03 | 8.99 | 14.63 | 17.19 | |||
| 20 | 3.83 | 5.82 | 11.71 | 8.80 | 3.89 | 6.48 | 12.19 | 11.83 | |||
| s7 | 30 | 2.82 | 4.82 | 9.55 | 6.46 | 2.93 | 5.47 | 11.00 | 8.38 | ||
| 40 | 2.19 | 3.95 | 8.23 | 4.48 | 2.30 | 4.53 | 9.27 | 4.99 | |||
| 50 | 2.06 | 4.02 | 8.77 | 3.55 | 1.98 | 4.27 | 10.46 | 3.55 | |||
| Scenario | ATP | LTP | MQ | DE | ATP | LTP | MQ | DE | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1.86 | 1.86 | 4.16 | 6.52 | 1.83 | 1.83 | 4.36 | 4.76 | |||
| 20 | 1.17 | 1.17 | 3.25 | 2.27 | 1.22 | 1.21 | 3.59 | 3.63 | |||
| s1 | 30 | 0.92 | 0.92 | 3.14 | 2.14 | 0.83 | 0.83 | 3.84 | 2.39 | ||
| 40 | 0.62 | 0.62 | 2.36 | 1.43 | 0.66 | 0.65 | 2.96 | 1.56 | |||
| 50 | 0.52 | 0.52 | 2.59 | 1.06 | 0.50 | 0.50 | 3.12 | 1.05 | |||
| 10 | 1.76 | 1.80 | 4.05 | 5.58 | 1.73 | 1.77 | 4.12 | 4.41 | |||
| 20 | 0.94 | 0.96 | 2.81 | 2.04 | 1.02 | 1.03 | 3.11 | 3.02 | |||
| s2 | 30 | 0.77 | 0.80 | 2.79 | 1.79 | 0.74 | 0.76 | 3.28 | 2.10 | ||
| 40 | 0.55 | 0.56 | 2.28 | 1.26 | 0.57 | 0.58 | 2.57 | 1.39 | |||
| 50 | 0.44 | 0.45 | 2.19 | 0.98 | 0.45 | 0.46 | 2.80 | 0.91 | |||
| 10 | 1.20 | 1.27 | 3.26 | 3.95 | 1.20 | 1.31 | 3.39 | 3.36 | |||
| 20 | 0.73 | 0.80 | 2.64 | 1.53 | 0.77 | 0.83 | 2.90 | 2.33 | |||
| s3 | 30 | 0.55 | 0.59 | 2.47 | 1.26 | 0.51 | 0.56 | 2.80 | 1.47 | ||
| 40 | 0.40 | 0.44 | 1.98 | 0.98 | 0.43 | 0.48 | 2.45 | 1.01 | |||
| 50 | 0.34 | 0.37 | 2.00 | 0.78 | 0.33 | 0.36 | 2.46 | 0.68 | |||
| 10 | 1.92 | 1.96 | 3.49 | 5.67 | 1.67 | 1.70 | 3.49 | 4.05 | |||
| 20 | 0.97 | 0.97 | 2.45 | 1.93 | 1.03 | 1.02 | 2.42 | 2.82 | |||
| s4 | 30 | 0.76 | 0.76 | 2.41 | 1.67 | 0.73 | 0.73 | 2.63 | 1.93 | ||
| 40 | 0.55 | 0.54 | 1.81 | 1.23 | 0.59 | 0.59 | 2.25 | 1.31 | |||
| 50 | 0.46 | 0.46 | 1.95 | 0.92 | 0.46 | 0.46 | 2.33 | 0.90 | |||
| 10 | 1.28 | 1.39 | 2.97 | 4.08 | 1.19 | 1.26 | 2.67 | 2.98 | |||
| 20 | 0.74 | 0.76 | 2.13 | 1.50 | 0.77 | 0.82 | 2.40 | 2.35 | |||
| s5 | 30 | 0.51 | 0.54 | 1.93 | 1.21 | 0.54 | 0.54 | 2.30 | 1.45 | ||
| 40 | 0.39 | 0.40 | 1.54 | 0.90 | 0.45 | 0.47 | 2.06 | 1.03 | |||
| 50 | 0.36 | 0.38 | 1.62 | 0.75 | 0.35 | 0.37 | 2.13 | 0.68 | |||
| 10 | 2.61 | 2.64 | 6.77 | 7.63 | 2.65 | 2.73 | 6.81 | 6.66 | |||
| 20 | 1.57 | 1.58 | 5.11 | 3.30 | 1.59 | 1.59 | 5.37 | 4.44 | |||
| s6 | 30 | 1.10 | 1.12 | 4.78 | 2.52 | 1.24 | 1.26 | 5.67 | 3.07 | ||
| 40 | 0.84 | 0.84 | 3.98 | 1.68 | 0.95 | 0.94 | 4.70 | 1.89 | |||
| 50 | 0.69 | 0.70 | 3.93 | 1.35 | 0.74 | 0.74 | 5.03 | 1.32 | |||
| 10 | 2.45 | 2.67 | 5.47 | 7.32 | 2.57 | 2.82 | 5.55 | 5.99 | |||
| 20 | 1.54 | 1.67 | 4.20 | 2.86 | 1.66 | 1.86 | 4.71 | 4.42 | |||
| s7 | 30 | 1.12 | 1.26 | 3.82 | 2.37 | 1.19 | 1.36 | 4.30 | 3.02 | ||
| 40 | 0.88 | 0.97 | 2.99 | 1.74 | 0.97 | 1.09 | 3.48 | 1.87 | |||
| 50 | 0.77 | 0.90 | 3.06 | 1.37 | 0.78 | 0.92 | 3.89 | 1.20 | |||
| BCI | NCI | BCI | NCI | BCI | NCI | BCI | NCI | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 0.958 | 0.944 | 0.951 | 0.936 | 0.949 | 0.938 | 0.951 | 0.935 | |||
| 20 | 0.946 | 0.930 | 0.944 | 0.933 | 0.952 | 0.932 | 0.946 | 0.925 | |||
| CP | 30 | 0.951 | 0.935 | 0.952 | 0.936 | 0.951 | 0.941 | 0.952 | 0.936 | ||
| 40 | 0.947 | 0.933 | 0.956 | 0.936 | 0.953 | 0.942 | 0.954 | 0.939 | |||
| 50 | 0.951 | 0.934 | 0.954 | 0.936 | 0.955 | 0.942 | 0.957 | 0.942 | |||
| 10 | 0.281 | 0.264 | 0.155 | 0.145 | 0.272 | 0.256 | 0.150 | 0.140 | |||
| 20 | 0.218 | 0.205 | 0.118 | 0.111 | 0.221 | 0.208 | 0.123 | 0.115 | |||
| AL | 30 | 0.187 | 0.175 | 0.101 | 0.095 | 0.185 | 0.174 | 0.102 | 0.096 | ||
| 40 | 0.166 | 0.155 | 0.088 | 0.083 | 0.164 | 0.154 | 0.090 | 0.084 | |||
| 50 | 0.151 | 0.142 | 0.081 | 0.076 | 0.145 | 0.136 | 0.079 | 0.074 | |||
| SDP | SDP-s | SS | SL | ||
|---|---|---|---|---|---|
| AIC | 347778.2 | 347840.0 | 348123.8 | 348870.7 | |
| BIC | 347886.7 | 347940.8 | 348232.4 | 348963.8 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic and Environmental Valuation · demographic modeling and climate adaptation · Income, Poverty, and Inequality
Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters
SHONOSUKE SUGASAWA
*Center for Spatial Information Science, The University of Tokyo
*TATSUYA KUBOKAWA
*Faculty of Economics, The University of Tokyo
Abstract. For estimating area-specific parameters (quantities) in a finite population, a mixed model prediction approach is attractive. However, this approach strongly depends on the normality assumption of the response values although we often encounter a non-normal case in practice. In such a case, transforming observations to make them suitable for normality assumption is a useful tool, but the problem of selecting suitable transformation still remains open. To overcome the difficulty, we here propose a new empirical best predicting method by using a parametric family of transformations to estimate a suitable transformation based on the data. We suggest a simple estimating method for transformation parameters based on the profile likelihood function, which achieves consistency under some conditions on transformation functions. For measuring variability of point prediction, we construct an empirical Bayes confidence interval of the population parameter of interest. Through simulation studies, we investigate numerical performance of the proposed methods. Finally, we apply the proposed method to synthetic income data in Spanish provinces in which the resulting estimates indicate that the commonly used log-transformation would not be appropriate.
Key words: Confidence interval; Empirical Bayes; Finite population; Mean squared error; Random effect; Small area estimation.
Introduction
The mixed model prediction based on random effect models has been widely used in small area estimation (Rao and Molina, 2015). The random effect models used in small area estimation are mainly divided into two models: the Fay-Herriot model (Fay and Herriot, 1979) and the nested error regression model (Battese et al., 1988). Especially, the nested error regression model has been used for estimating population parameters in a finite population. Here we consider a finite population consisting of areas and each area has units for . Let be a characteristic of the th unit in the th area, the main purpose is to estimate the area-specific parameter defined as
[TABLE]
where is a known (user-specified) function. The simplest choice is , in which corresponds to the finite population mean, and this case has been studied in the literature; Chambers et al. (2014), Jiang and Lahiri (2006), Lahiri and Mukherjee (2007) and Schmit et al. (2016). On the other hand, as noted by Molina and Rao (2010), more complex forms of are often used in practice. For example, in poverty mapping, we often adopt the FGT poverty measure (Foster et al., 1984), where is a suitable poverty threshold.
If all the units in the th area were observed, we could calculate the true value of . However, only a part of the units are available in practice. Let be the number of sampled units and be the sampled data. It is known that the direct estimator of using the observed units has high variability, especially in the case that is much smaller than . In real application, some covariates associated with are available not only for sampled but also for non-sampled units, which are denoted by {\text{\boldmathx}}_{ij} with and . Hence, one aims to estimate based on the sampled data and information on covariates. To this end, a typical strategy is to assume that all the units follow the nested error regression model (Battese et al., 1988):
[TABLE]
where {\text{\boldmathx}}_{ij} and are -dimensional vectors of covariates and regression coefficients, is the area-specific effect which follows and is a sampling error distributed as . Note that the model (2) leads to normality assumption of . Then, the conditional distribution of the non-sampled data given all the sampled data is given by
[TABLE]
which follows from the normality of under the model (2). Then the best predictor of under squared error loss is the conditional expectation , which has the form
[TABLE]
Here, the expectation could be computed via the Monte Carlo integration by generating a large number of random samples from the conditional distribution (3). Moreover, the best predictor depends on the unknown model parameters {\text{\boldmath\beta}},\tau^{2} and in the model (2), so that these parameters should be replaced with their estimated counterparts. To this end, the model parameters in the model (2) would be estimated based on the sampled data by using, for example, the maximum likelihood or restricted maximum likelihood methods.
It is observed that the key assumption in deriving the best predictor (4) is the normality of in the model (2), which enables us to obtain the simple expression of the conditional distribution (3). However, we often encounter the case where the normality assumption is not plausible for . For instance, in poverty mapping, would be a welfare variable (e.g. income), so that the distribution of would be right skewed and would not be normal. In this case, several methods have been considered so far. For instance, Chambers and Tzavidis (2006) proposed the M-quantile models and Sinha and Rao (2009) and Chambers et al. (2014) among others proposed robust methods against outliers under normality assumptions. However, these methods basically aims to estimate only the mean parameters, corresponding in (1), and do not take account of the data characteristics like skewness. On the other hand, Molina and Rao (2010) considered the nested error regression model (2) for the transformed variables instead of , which would be able to take account of data characteristics. For instance, the log-transformation, , would be the most standard approach in small area estimation (e.g. Slud and Haiti, 2006; Molina and Martin, 2017) when is right skewed as often appeared in welfare variables. However, we can not know how skew the true data distribution is whereas the use of log-transformation forces the amount of skewness in the data distribution. This problem can be regarded as the misspecification of the transformation, under which the predictor of might be biased or inefficient. In such a case, it would be more preferable to consider a parametric family of transformations and we propose to estimate the transformation based on the data for solving the misspecification as much as possible. This approach would enable us to get better prediction although the problem of violating normality assumption are not necessarily solved due to limitations arising from the use of a parametric family of transformations. We derive a form of the best predictor of and provide a simple estimating method for transformation parameters based on profile likelihood function, which produces a consistent estimator under some regularity conditions. We also construct an empirical Bayes confidence interval of for measuring the variability of the point prediction. The proposed intervals are shown to have coverage error, and we also suggest the parametric bootstrap calibration for confidence intervals with further accuracy.
This paper is organized as follows: In Section 2, we describe the proposed prediction method as well as parameter estimation of the model parameters. In Section 3, we construct an empirical Bayes confidence interval of . In Section 4, we present the results from simulation studies and a data application. In Section 5, we give conclusions and some discussions. The technical proofs are given in Appendix.
Adaptively Transformed Mixed Model Prediction
Transformed best predictor
Let be a family of transformations with parameter . The transformation parameter might be multidimensional, but we treat as a scalar parameter for notational simplicity. The assumptions and specific choices of will be discussed in the subsequent section. We assume that the transformed variable follows the nested error regression model:
[TABLE]
where {\text{\boldmathx}}_{ij} and are -dimensional vectors of fixed covariates and regression coefficients, and are an area-specific effect and a sampling error, respectively. Here we assume that and are mutually independent and distributed as and with unknown two variance parameters and . It is worth noting that, owing to the area effect , the units in the same area are mutually correlated while the units in the different area are independent. Specifically, from (5), it holds , thereby the units in the same area are mutually correlated and the degree of correlation is determined by the ratio . The normality assumptions of and leads to the assumption that H_{{\lambda}}(Y_{ij})\sim N({\text{\boldmathx}}_{ij}^{t}{\text{\boldmath\beta}},\tau^{2}+{\sigma}^{2}). Although, there would not necessarily exist the suitable transformation parameter to make the transformed response hold the normality assumption exactly, our theory will be developed under the normality assumption.
In model (5), the response variable is assumed to follow a class of distributions defined by transforming the normal distribution N({\text{\boldmathx}}_{ij}^{t}{\text{\boldmath\beta}},\tau^{2}+{\sigma}^{2}) by , the inverse function of . For example, the use of log-transformation leads to the assumption that follows a log-normal distribution. Moreover, the use of a transformation does not change the correlation structure that the units in the same areas are correlated while different areas are kept independent. On the other hand, the mean and variance could be a complicate function of the mean {\text{\boldmathx}}_{ij}^{t}{\text{\boldmath\beta}} and variance in the transformed scale, so that it could induce some relationships between mean and variance as observed in a log-normal distribution, and the correlations among ’s in the same areas would not be necessarily constant.
Let be the sampled data. From the model (5), we have , , where
[TABLE]
Hence, the best predictor of given in (1) can be obtained as
[TABLE]
where the expectation is taken with respect to , and is the composite function of and , the inverse function of . Although the expectation does not have a closed form in general, it can be easily computed via the Monte Carlo integration. We call the best predictor (7) adaptively transformed best predictor (ATP).
Estimation of structural parameters
Let {\text{\boldmath\phi}}=({\text{\boldmath\beta}}^{t},\tau^{2},{\sigma}^{2},{\lambda})^{t} be a vector of unknown model parameters in (5). We here propose estimating based on the following log-marginal likelihood function:
[TABLE]
where ({\text{\boldmath{\Sigma}}}_{i})_{k\ell}=\tau^{2}+{\sigma}^{2}I(k=\ell), , {\text{\boldmathX}}_{i}=({\text{\boldmathx}}_{i1}^{t},\ldots,{\text{\boldmathx}}_{in_{i}}^{t})^{t}, and denotes the derivative of . The maximum likelihood estimator of can be defined as the maximizer of L({\text{\boldmath\phi}}).
For maximizing the likelihood function L({\text{\boldmath\phi}}), we first note that the profile log-likelihood function of can be expressed as
[TABLE]
where is the maximum log-likelihood of the nested error regression model with response values and covariate vectors {\text{\boldmathx}}_{ij}. Since the computation of can be readily carried out via well-developed numerical methods, e.g. sae package in R (Molina and Marhuenda, 2015), the point evaluation of the profile likelihood (9) is quite easy. Hence, we may obtain the maximizer of by using, for example, the golden section method (Brent et al., 1973). Once we obtain the estimator , we get the estimators of other parameters by applying the nested error regression model to the data set \{H_{{\widehat{\lambda}}}(y_{ij}),{\text{\boldmathx}}_{ij}\}.
For estimating the two variance parameters and , the restricted maximum likelihood (REML) method (Jiang, 1996) might be more attractive than the maximum likelihood method in terms of estimating variance components since the REML method is known to produce estimates with smaller bias than the maximum likelihood method. To implement the REML estimation, the first three terms in (8) are changed to those of the REML method, and the resulting function could be maximized by profiling as used in the maximum likelihood method. We note that the estimator of transformation parameters would be changed by adapting the REML estimation, and it would be hard to investigate theoretical differences in terms of estimating transformation parameters. Therefore, we consider only the maximum likelihood estimator for simplicity.
Class of transformations
When applying the aforementioned prediction method, the concrete choice of would be important in practice. For the choice of suitable transformations, we need to take account of the data characteristics. For instance, income data is often right skewed, so that a family of transformations that includes the log-transformation would be suitable.
Before considering the concrete family of transformations, we introduce the following class of transformations for theoretical guarantee of the proposed method.
Assumption 1**.**
(Class of transformations)
* is a differentiable and monotone function, and the range of is for all .*
- 2.
For fixed , as the function of is differentiable.
- 3.
The function , and with are bounded from the upper by with some constants .
The first condition is crucial in this context. If the range of is not , but some subset , the inverse function cannot be defined on , which causes problems in computing the predictor (7). Although the Box-Cox transformation (Box and Cox, 1964) is widely used for positive valued data and has been adopted in small area estimation (Li and Lahiri, 2007), it does not belong to the class of transformations due to the limited range of the Box-Cox transformation.
When we focus on estimating poverty indicators based on welfare variables like income, the following two properties of parametric transformations would be preferable: one is simplicity of the inverse function used in computing the predictor (7), and the other is eliminating skewness of data distribution which would be often the case in income distribution. There area several parametric transformations related to the Box-Cox transformation, e.g. John and Draper (1980) and Yeo and Johnson (2000). However, the transformation by John and Draper (1980) cannot necessarily eliminate the skewness since the transformation is an odd function. On the other hand, the transformation by Yeo and Johnson (2000) has a relatively complicated form and so does the inverse transformation. As transformations that belongs to the class and satisfies two desirable properties, we consider two parametric transformations, the dual power (DP) transformation (Yang, 2006) and the sinh-arcsinh (SS) transformation (Jones and Pewsey, 2009).
The DP transformation is described as
[TABLE]
where . In the context of small area estimation, the Fay-Herriot model has been extended by Sugasawa and Kubokawa (2017) with use of the DP transformation. It is easy to confirm that the range of DP transformation is . Note that controls the skewness of the transformed random variable . The inverse function and the Jacobian appeared in the predictor (7) and the profile likelihood (9), respectively, are given by
[TABLE]
The original DP transformation (10) by Yang (2006) can be applied only for a positive valued response. However, as demonstrated in Section 4.3, welfare variables can take negative values depending on their definition. In this case, we propose the shifted-DP (SDP) transformation of the form , where with some small .
The SS transformation has the following form:
[TABLE]
where is the hyperbolic sine function, , and two transformation parameter and control skewness and tail heaviness of the transformed random variable , respectively. The inverse transformation and the Jacobian are obtained as
[TABLE]
Since the domain of the SS transformation is the whole real line, it could be used not only for positive valued data but also real valued data.
As mentioned at the beginning of this section, which transformation should be used would depend on the data characteristics. Meanwhile, we may select the suitable parametric transformation in a more objective way by using information criteria of the form: , where ML is the maximum log-likelihood, namely the maximum value of (8), is the number of transformation parameters and is the total number of sampled units. Setting and correspond to AIC-like and BIC-like criterion, respectively.
Large sample properties
We here consider the large sample properties of the estimator of structural parameters. To this end, we assume the following condition:
Assumption 2**.**
(Assumptions under large )
The true parameter vector {\text{\boldmath\phi}}_{0} is an interior point of the parameter space .
- 2.
.
- 3.
The elements of {\text{\boldmathX}}_{i} are uniformly bounded and {\text{\boldmathX}}_{i}^{t}{\text{\boldmathX}}_{i} is positive definite.
- 4.
m^{-1}\sum_{i=1}^{m}{\text{\boldmathX}}_{i}^{t}{\text{\boldmath{\Sigma}}}_{i}^{-1}{\text{\boldmathX}}_{i}* converges to a positive definite matrix as .*
The first condition implies that there exists the true transformation parameter that the transformed random variable achieves normality while it might not necessarily hold in practice. Since the asymptotic variance and covariance matrix of the maximum likelihood estimator can be derived from the Fisher information matrix, we first provide the Fisher information matrix in the following theorem, where the proof is given in Appendix.
Theorem 1**.**
When the Fisher information is denoted by I_{\phi_{k}\phi_{j}}=-{\rm E}[\partial^{2}L({\text{\boldmath\phi}})/\partial\phi_{k}\partial\phi_{j}], we have
[TABLE]
where for , {\text{\boldmathz}}_{i}=H_{{\lambda}}(y_{i})-{\text{\boldmathX}}_{i}{\text{\boldmath\beta}}, and denotes the expectation with respect to ’s following the model (5). Then, under Assumptions 1 and 2, the maximum likelihood estimator {\widehat{\text{\boldmath\phi}}} is asymptotically distributed as {\widehat{\text{\boldmath\phi}}}\sim N({\text{\boldmath\phi}},{\text{\boldmathI}}_{{\text{\boldmath\phi}}}^{-1}).
From Theorem 1, it is observed that the information matrix of ({\text{\boldmath\beta}}^{t},\tau^{2},{\sigma}^{2}) does not depend on the transformation parameter , and their expressions are the same as those of the traditional nested error regression models. While the two variance parameters and are orthogonal to in the sense that {\text{\boldmathI}}_{{\text{\boldmath\beta}}\tau^{2}}={\text{\boldmathI}}_{{\text{\boldmath\beta}}{\sigma}^{2}}=0, the transformation parameter is not orthogonal to the others. The expectations appeared in the Fisher matrix is not analytically tractable, but it can be easily estimated by replacing the expectation with its sample counterpart. In the case that is multidimensional, the extension of Theorem 1 is straightforward. The expressions of and could be analytically complicated and require tedious algebraic calculations. In such a case, the numerical derivative would be useful since we need to compute only the point values of the derivatives.
Empirical Bayes Confidence Intervals
Asymptotically valid confidence intervals
Measuring the variability of the transformed empirical best predictor is an important issue in practice. Traditionally, the mean squared error (MSE) of has been used, and several methods ranging from analytical method (Prasad and Rao, 1990) to numerical methods (Hall and Maiti, 2006) have been considered. On the other hand, an empirical Bayes confidence interval of is more preferable since it can provide distributional information than MSE although construction of the confidence interval is generally difficult. Here, we derive an asymptotically valid empirical Bayes confidence interval of .
The key to derivation of the confidence interval is the conditional distribution of given . Noting that for , it follows that
[TABLE]
namely, the each component has the expression
[TABLE]
where and are mutually independent standard normal random variables, and and are defined in (6). Then the posterior distribution of can be expressed as
[TABLE]
which is a complex function of standard normal random variables and . However, random samples from the conditional distribution (12) can be easily simulated.
We define Q_{a}(y_{i},{\text{\boldmath\phi}}) as the lower quantile point of the posterior distribution of with the true , which satisfies {\rm P}(\mu_{i}\leq Q_{a}(y_{i},{\text{\boldmath\phi}})|y_{i})=a. Hence, the Bayes confidence interval of with nominal level is obtained as I_{\alpha}=(Q_{\alpha/2}(y_{i},{\text{\boldmath\phi}}),Q_{1-\alpha/2}(y_{i},{\text{\boldmath\phi}})), which holds that . However, the interval depends on the unknown parameter , so that the feasible version of is obtained by replacing with its estimator {\widehat{\text{\boldmath\phi}}}, namely
[TABLE]
which we call naive empirical Bayes confidence interval of . The two quantiles appeared in (13) can be computed by generating a large number of random samples from the conditional distribution (12). Owing to the asymptotic properties of {\widehat{\text{\boldmath\phi}}}, the coverage probability of the naive interval (13) converges to the nominal level as the number of areas tends to infinity as shown in the following theorem proved in Appendix.
Theorem 2**.**
Under Assumptions 1 and 2, it holds .
Bootstrap calibrated intervals
As shown in Theorem 2, the coverage error of the naive interval (13) is of order , which is not necessarily negligible when is not sufficiently large. Since the number of is usually moderate in practice, the calibrated intervals with higher accuracy would be valuable. Following Chatterjee, et al. (2008), Hall and Maiti (2006), we construct a second order corrected empirical Bayes confidence interval satisfying .
To begin with, we define the bootstrap estimator of the coverage probability of the naive interval. Let be the parametric bootstrap samples generated from the estimated model (5) with {\text{\boldmath\phi}}={\widehat{\text{\boldmath\phi}}}, and . Moreover, let be the bootstrap version of based on ’s. Since the coverage probability is {\rm P}(Q_{a/2}(y_{i},{\widehat{\text{\boldmath\phi}}})\leq\mu_{i}\leq Q_{1-a/2}(y_{i},{\widehat{\text{\boldmath\phi}}})), its parametric bootstrap estimator can be defined as
[TABLE]
where the expectation is taken with respect to the bootstrap samples ’s. Based on the coverage probability, we define the calibrated nominal level as the solution of the equation , which can be solved by the bisectional method (Brent, 1973). Then, the calibrated interval is given by
[TABLE]
which has second order accuracy as shown in the following theorem proved in Appendix.
Theorem 3**.**
Under Assumptions 1 and 2, it holds .
Numerical Studies
Evaluation of prediction errors
We first evaluate the prediction errors of the proposed predictors together with some existing methods. To this end, we considered the following data generating processes:
[TABLE]
where , , , , , , and and were generated from Bernoulli distributions with probabilities and , respectively. Based on the above models, we considered seven scenarios of data generating processes as summarized in Table 1. In this study, we focus on estimating the following parameters:
[TABLE]
where is defined as times median of ’s. Note that is known as FGT poverty measures (Foster et al., 1984). We considered two cases of , (poverty rate) and (poverty gap). We set and divided areas into five groups with equal number of areas, and we set the same numbers of sampled units within the same groups. The group pattern of was . For the size of units , we considered two cases, and , to check sensitivity of ratios .
Among the generated , we used first observations as the sampled data. Then, based on the sampled data ’s and covariates , we predict based on the proposed adaptively transformed empirical best prediction (ATP) method with DP transformation (10) and the transformed empirical best prediction (TP) (Molina and Rao, 2010) with log-transformation. We used 1000 Monte Carlo samples in applying these two methods. As a competitor from other model-based methods, we employed the M-quantile method (Chambers and Tzavidis, 2006). We fitted the M-quantile model to the sampled data in the same way as in Chambers and Tzavidis (2006), and modified the distribution function estimator given in equation (5) in Chambers and Tzavidis (2006) to an estimator of by replacing the indicator function with . Moreover, we computed the following direct estimator (DE): . Note that the proposed model is correctly specified in Scenarios (s1)(s3), namely, there exists the true transformation parameter such that the transformed variable achieves normality. On the other hand, in Scenarios (s4)(s7), the proposed model is misspecified in the sense that there is no true transformation parameter that achieves normality
To compare the performances of the four methods, we computed mean squared error (MSE) defined as
[TABLE]
with , where and are the estimated and true values of , respectively, in the th iteration. The obtained values of MSEs are averaged within the same groups and the results are reported in Tables 2 and 3. From these tables, we can observe that the proposed method provides better estimates than the three existing methods in almost all cases, and there are not much differences between the two cases of . In scenario (s1), the performance between ATP and TP are almost the same while the ATP method is overfitting while the log-transformed model is correctly specified. Meanwhile, in the other scenarios, the ATP method can improve the estimation accuracy of TP method as well as MQ and DE methods, by adaptively estimating the transformation parameter from the data even when the model assumption in the TP method is violated. In Supplementary Material, we provide additional simulation results (e.g. relative bias and coefficient of variations).
Performance of empirical Bayes confidence intervals
We next evaluate the finite sample performance of the empirical Bayes confidence intervals given in Section 3. To this end, we adopted the same settings as used in scenario (s2) in Section 4.1 and focused on the same population parameters given in (15) with and . For constructing confidence intervals of , we employed two methods: bootstrap calibrated confidence interval (14) as well as the naive confidence interval (13), which are denoted by BCI and NCI, respectively. We used 200 Monte Carlo samples and 100 bootstrap replications in applying these two methods. Note that theoretical coverage accuracy of BCI and NCI is and , where in this study.
To evaluate the performances of the two confidence intervals, based on simulation runs, we computed the following empirical coverage probability (CP) and average length of confidence interval (AL):
[TABLE]
where and are the true value and the confidence interval in the th iteration. We averaged CP’s and AL’s within the same groups, and reported the results in Table 4. Table 4 shows that NCI tends to produce shorter confidence intervals and the coverage probability is smaller than the nominal level for all areas. On the other hand, BCI produces more accurate confidence intervals than NCI, which would support the theoretical property given in Theorem 3. Since underestimation of risk estimates may yield serious problems in practice, BCI would be appealing when the number of areas is not large.
Example: poverty mapping in Spain
We applied the proposed method to estimation of poverty indicators in Spanish provinces, using the synthetic income data available in sae package (Molina and Marhuenda, 2015) in R language. The similar data set was used in Molina and Rao (2010) and Molina et al. (2014). Such data are available for areas and the sample sizes (the number of observed units) range from to , so that there are no out-of-sample areas. The total number of samples units is . The welfare variable for the individuals is the equivalized annual net income denoted by , noting that the small portions of take negative values. The median of is about and area-wise medians of range about from to , so that the area-wise distributions would be quite different. As auxiliary variables, we considered the indicators of the four groupings of ages (16-24, 25-49, 50-64 and 65), the indicator of having Spanish nationality, the indicators of education levels (primary education and post-secondary education), and the indicators of two employment categories (employed, unemployed). An intercept term is also included in our model.
Let {\text{\boldmathx}}_{ij} be the vector of auxiliary variables including an intercept term. We consider the four models with different families of transformations, SDP transformation, SDP transformation with known shift (SDP-s), SS transformation and shifted log-transformation (SL), which are described as
[TABLE]
where , and . By maximizing the profile likelihood function of transformation parameters, we obtained the following estimates:
[TABLE]
where the values in the parentheses are the corresponding standard errors calculated from the Fisher information matrix given in Theorem 1. From the above result, it can be observed that the approximate confidence intervals of the transformation parameter in SDP and SDP-s are bounded from [math], which means that the log-transformed model would be inappropriate.
For comparing the four models, we calculated AIC and BIC given in the end of Section 2.3. We reported the values in Table 5, which shows that both AIC and BIC values of the SDP model are significantly smaller than those of the other models. Moreover, to see the adequacy of normality assumption of and in the models (16), we computed
[TABLE]
and their QQ-plots are shown in Figures 1 and 2. Although the normality of seems plausible in all the four models from Figure 1, the normality of in the SL model would not be appropriate from Figure 2.
Finally, we calculated the estimates of poverty indicators based on FGT poverty measures given in (15), where we set as the times the median of ’s. In particular, we estimated the poverty rate () and poverty gap (). Since the auxiliary variables of non-sampled units in five provinces are available, we computed the estimates of poverty indicators of the provinces based on the four models in (16) with 100 Monte Carlo samples as well as the direct estimator (DE). Moreover, we computed confidence intervals (CI) of the poverty indicators based on 100 bootstrap samples. The results are given in Table 6. It is observed that the four model-based estimates are very different from DE even in the provinces with relatively large number of sampled units (e.g. Sevilla). Moreover, the SDP and SDP-s models provide quite similar estimates while the SL model provide relatively different estimates from the others. However, based on AIC and BIC values and QQ-plot in Figure 2, the validity of SL method would be doubtful, so that the estimates given in Table 6 would not be reliable.
Conclusions and Discussion
We have introduced the use of the parametric family of transformations for estimating (predicting) general area-specific parameters based on the mixed effects models. We have provided the best predictor of the parameter as well as the maximum likelihood method for estimating model parameters. Moreover, for measuring variability of the predictor, we constructed the mpirical Bayes confidence interval of the area parameter. The simulation and empirical studies have revealed that the use of parametric transformations would improve the prediction accuracy of the existing method using specified transformations.
As demonstrated in the simulation studies, the proposed method using the parametric family of transformations outperformed the prediction method using specified transformations when the specified transformation is not true. Hence, the proposed method would be promising and recommended as an alternative tool for prediction methods with specified transformations. However, when the estimated transformation is close to a well-known one, we may not necessarily employ the proposed method and it would be recommended to simply use the well-known transformation. For example, when the estimate of in the DP transformation is close to [math], it would be better to simply use the log-transformation.
In this paper, we developed the methodology under the situation where the random effects ’s are mutually independent, following Molina and Rao (2010). However, ’s might be spatially correlated in some cases and several works have been focused on introducing spatial correlations in small area estimation (e.g. Pratesi and Salvati, 2009; Schmit et al., 2016). The detailed discussion introducing spatial correlation in this context is left to a future work.
Although we considered an empirical Bayes approach in this paper, the hierarchical Bayes approach as considered in Molina et al. (2014), by assigning some prior distributions for model parameters, would be useful as well. Moreover, it would be interesting to consider the use of the penalized spline method for modeling the regression part (e.g. Opsomer et al., 2008) or a semiparametric transformation approach (e.g. Nesser, et al., 1996) rather than the full parametric approach, which would achieve more flexible modeling whereas both would be computationally burdensome. The detailed investigation are left to valuable future studies.
Acknowledgement The authors are supported by Grant-in-Aid for Scientific Research (18K12757, 15H01943 and 26330036) from Japan Society for the Promotion of Science.
Appendix
A1. Proof of Theorem 1. From the likelihood function (8), the derivatives are given by
[TABLE]
where {\text{\boldmathz}}_{i}=H_{{\lambda}}(y_{i})-{\text{\boldmathX}}_{i}{\text{\boldmath\beta}}. Since {\rm E}[{\text{\boldmathz}}_{i}]={\text{\boldmath0}}, it follows that {\rm E}[\partial^{2}L/\partial{\text{\boldmath\beta}}\partial\tau^{2}]={\rm E}[\partial^{2}L/\partial{\text{\boldmath\beta}}\partial{\sigma}^{2}]={\text{\boldmath0}}. The other elements of the Fisher information can be obtained by a straightforward calculation. Moreover, under Assumptions 1 and 2, each element of the Fisher information matrix is finite, so that the asymptotic normality of {\widehat{\text{\boldmath\phi}}} follows.
A2. Proof of Theorem 2. Let {\text{\boldmath\phi}}_{0} be the true values of parameters. It suffices to show that P(\mu_{i}\leq Q_{a}(y_{i},{\widehat{\text{\boldmath\phi}}}))=a+O(m^{-1}) for . We first note that
[TABLE]
where F(\cdot;y_{i},{\text{\boldmath\phi}}_{0}) is a distribution function of given . Let G(y_{i},{\widehat{\text{\boldmath\phi}}},{\text{\boldmath\phi}}_{0})=F(Q_{a}(y_{i},{\widehat{\text{\boldmath\phi}}});y_{i},{\text{\boldmath\phi}}_{0}), noting that 0\leq G(y_{i},{\widehat{\text{\boldmath\phi}}},{\text{\boldmath\phi}}_{0})\leq 1 and G(y_{i},{\text{\boldmath\phi}}_{0},{\text{\boldmath\phi}}_{0})=a. The Taylor expansion of G(y_{i},{\widehat{\text{\boldmath\phi}}},{\text{\boldmath\phi}}_{0}) shows that
[TABLE]
where {\text{\boldmath\phi}}^{\ast} is on the line connecting {\widehat{\text{\boldmath\phi}}} and {\text{\boldmath\phi}}_{0}. Then, it follows that
[TABLE]
where
[TABLE]
Using the Cauchy-Schwarz inequality, we have
[TABLE]
From the asymptotic normality of {\widehat{\text{\boldmath\phi}}} given in Theorem 1, it holds that . Moreover, since 0\leq G(y_{i},{\text{\boldmath\phi}},{\text{\boldmath\phi}}_{0})\leq 1 and {\text{\boldmath\phi}}_{0} is an interior point, it holds |G(y_{i},{\text{\boldmath\phi}}_{1},{\text{\boldmath\phi}}_{0})-G(y_{i},{\text{\boldmath\phi}}_{2},{\text{\boldmath\phi}}_{0})|\leq 2 for all {\text{\boldmath\phi}}_{1},{\text{\boldmath\phi}}_{2}\in N_{{\text{\boldmath\phi}}_{0}} with N_{{\text{\boldmath\phi}}_{0}}=\{{\text{\boldmath\phi}};\|{\text{\boldmath\phi}}-{\text{\boldmath\phi}}_{0}\|\leq{\varepsilon}\}, thereby the partial derivatives of G(y_{i},{\text{\boldmath\phi}},{\text{\boldmath\phi}}_{0}) at {\text{\boldmath\phi}}={\text{\boldmath\phi}}_{0} are bounded. Then, we obtain . Using a similar evaluation, we can show that . Regarding , it is noted that
[TABLE]
From Lohr and Rao (2009), it holds {\rm E}[{\widehat{\text{\boldmath\phi}}}-{\text{\boldmath\phi}}_{0}|y_{i}]=m^{-1}{\text{\boldmathb}}_{{\text{\boldmath\phi}}}-{\text{\boldmathI}}_{{\text{\boldmath\phi}}}^{-1}\partial L_{i}(y_{i},{\text{\boldmath\phi}}_{0})/\partial{\text{\boldmath\phi}}+o_{p}(m^{-1}), where \sum_{i=1}^{m}L_{i}(y_{i},{\text{\boldmath\phi}}_{0})\equiv L({\text{\boldmath\phi}}) and b_{{\text{\boldmath\phi}}}=\lim_{m\to\infty}m{\rm E}[{\widehat{\text{\boldmath\phi}}}-{\text{\boldmath\phi}}_{0}] is the asymptotic bias of {\widehat{\text{\boldmath\phi}}}. Hence, we have
[TABLE]
which is . Therefore, the proof is completed.
A3. Proof of Theorem 3. From the proof of Theorem 2, we have
[TABLE]
where c(a,{\text{\boldmath\phi}}) is a smooth function of and . Take and so that they satisfy F_{a^{\ast}}({\text{\boldmath\phi}}_{0})=a and F_{\widehat{a}^{\ast}}({\widehat{\text{\boldmath\phi}}})=a, respectively. Then, it holds since {\widehat{\text{\boldmath\phi}}}-{\text{\boldmath\phi}}=o_{p}(1). From the above expansion, we have
[TABLE]
so that . Hence, it follows that
[TABLE]
which completes the proof.
A4. Checking assumptions of transformations. We here check the assumption 3 in Assumption 1 for the dual power (DP) transformation (10) and sinh-arcsinh (SS) transformation (11).
(DP transformation) We first note that as . By putting for , we have
[TABLE]
as . A straightforward calculation shows that
[TABLE]
thereby, it follows that
[TABLE]
as . Moreover, since
[TABLE]
a similar evaluation leads to \big{|}\partial^{2}H_{{\lambda}}(w)/\partial{\lambda}^{2}\big{|}=O(|x|(\log|x|)^{2}) as . Regarding , it holds that
[TABLE]
as , so that the DP transformation satisfies the assumption. When the location parameter is used, namely, , it is noted that , so that the quite similar evaluation shows that the shifted-DP transformation also satisfies the assumption.
(SS transformation) It follows that
[TABLE]
Note that as , so that . Then, we have
[TABLE]
as . Moreover, it holds that
[TABLE]
thereby a similar evaluation shows that , and as . On the other hand, a straightforward calculation shows that
[TABLE]
which are bounded by the function and , respectively. It is not difficult to show that the second partial derivatives of are bounded by polynomial functions of the second partial derivatives of and , thereby the assumption is satisfied.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1]
- 2[2] Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association , 83 , 28-36.
- 3[3]
- 4[4] Brent, R. (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs N.J. Prentice-Hall.
- 5[5]
- 6[6] Box, G.E.P. and Cox, D.R. (1964). An analysis of transformation (with discussion). Journal of the Royal Statistical Society: Series B , 26 , 211-252.
- 7[7]
- 8[8] Chambers, R. and Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika , 93 , 255-268.
