Adaptively Transformed Mixed Model Prediction of General Finite   Population Parameters

Shonosuke Sugasawa; Tatsuya Kubokawa

arXiv:1705.04136·stat.ME·June 12, 2018

Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters

Shonosuke Sugasawa, Tatsuya Kubokawa

PDF

Open Access 1 Repo

TL;DR

This paper introduces an adaptive transformation approach for mixed model predictions of finite population parameters, improving estimation accuracy when response data deviate from normality.

Contribution

It proposes a data-driven method to select suitable transformations using a parametric family and profile likelihood, enhancing prediction robustness.

Findings

01

The method performs well in simulations, showing improved accuracy over traditional transformations.

02

Application to income data demonstrates the method's practical utility and identifies limitations of common transformations.

03

Constructs empirical Bayes confidence intervals for population parameters.

Abstract

For estimating area-specific parameters (quantities) in a finite population, a mixed model prediction approach is attractive. However, this approach strongly depends on the normality assumption of the response values although we often encounter a non-normal case in practice. In such a case, transforming observations to make them suitable for normality assumption is a useful tool, but the problem of selecting suitable transformation still remains open. To overcome the difficulty, we here propose a new empirical best predicting method by using a parametric family of transformations to estimate a suitable transformation based on the data. We suggest a simple estimating method for transformation parameters based on the profile likelihood function, which achieves consistency under some conditions on transformation functions. For measuring variability of point prediction, we construct an…

Tables5

Table 1. Table 1: 8 Scenarios of simulation studies.

Scenario	s1	s2	s3	s4	s5	s6	s7
Model	A	A	A	B	B	C	D
$λ$	$0$	$0.25$	$0.50$	$0.25$	$0.50$	-	-

Table 2. Table 2: The group-wise averaged values of mean squared errors (MSE) of poverty rate ( α = 0 ) 𝛼 0 (\alpha=0) for four methods (ATP, LTP, DE and MQ) in 7 scenarios. All the values in the table are multiplied by 1000 1000 1000 .

		$N = 200$				$N = 400$
Scenario	$n$	ATP	LTP	MQ	DE	ATP	LTP	MQ	DE
	10	5.04	5.04	15.08	21.44	5.00	5.02	16.55	17.51
	20	3.33	3.32	12.48	9.14	3.34	3.33	14.09	12.01
s1	30	2.60	2.60	11.65	7.08	2.34	2.33	14.32	8.14
	40	1.76	1.76	9.76	4.62	1.80	1.79	11.44	5.05
	50	1.51	1.51	10.41	3.40	1.37	1.37	12.39	3.28
	10	5.45	5.78	13.90	21.28	5.10	5.37	14.87	17.49
	20	3.09	3.35	10.62	8.32	3.19	3.43	11.31	11.70
s2	30	2.35	2.58	9.75	6.30	2.34	2.53	11.74	8.12
	40	1.73	1.87	8.64	4.35	1.79	2.00	9.84	5.02
	50	1.42	1.62	8.52	3.37	1.43	1.66	10.68	3.31
	10	4.62	5.95	11.37	18.12	4.43	5.83	12.27	15.71
	20	2.90	3.68	9.85	7.84	2.91	4.05	9.92	10.77
s3	30	2.13	3.02	8.10	5.88	2.01	3.02	10.03	6.85
	40	1.56	2.19	7.05	4.13	1.57	2.48	8.08	4.38
	50	1.31	2.18	7.00	3.07	1.30	2.27	8.43	3.09
	10	6.03	6.47	13.72	21.22	5.28	5.61	13.92	16.41
	20	3.39	3.65	10.83	8.26	3.45	3.77	10.74	10.98
s4	30	2.47	2.69	10.03	6.09	2.43	2.71	11.04	8.02
	40	1.78	2.00	8.15	4.49	1.79	2.07	9.72	4.57
	50	1.53	1.74	8.76	3.30	1.53	1.84	9.97	3.25
	10	5.16	6.82	11.39	18.59	4.73	6.63	10.48	14.41
	20	3.17	4.42	8.59	7.66	3.10	4.60	8.94	10.25
s5	30	2.11	3.26	7.39	5.54	2.14	3.49	8.72	6.76
	40	1.52	2.50	6.21	3.93	1.76	2.94	7.47	4.52
	50	1.41	2.47	6.31	3.06	1.37	2.61	7.78	2.90
	10	6.20	6.69	17.21	21.35	6.23	6.83	18.06	18.38
	20	3.88	4.28	13.73	9.16	3.86	4.29	14.48	12.47
s6	30	2.76	3.11	11.96	6.90	2.92	3.38	14.75	8.93
	40	2.07	2.33	10.42	4.62	2.10	2.42	12.16	5.11
	50	1.68	1.96	10.56	3.52	1.69	2.03	13.02	3.52
	10	5.74	8.45	13.91	19.94	6.03	8.99	14.63	17.19
	20	3.83	5.82	11.71	8.80	3.89	6.48	12.19	11.83
s7	30	2.82	4.82	9.55	6.46	2.93	5.47	11.00	8.38
	40	2.19	3.95	8.23	4.48	2.30	4.53	9.27	4.99
	50	2.06	4.02	8.77	3.55	1.98	4.27	10.46	3.55

Table 3. Table 3: The group-wise averaged values of mean squared errors (MSE) of poverty rate ( α = 1 ) 𝛼 1 (\alpha=1) for four methods (ATP, LTP, DE and MQ) in 7 scenarios. All the values in the table are multiplied by 1000 1000 1000 .

		$N = 200$				$N = 400$
Scenario	$n$	ATP	LTP	MQ	DE	ATP	LTP	MQ	DE
	10	1.86	1.86	4.16	6.52	1.83	1.83	4.36	4.76
	20	1.17	1.17	3.25	2.27	1.22	1.21	3.59	3.63
s1	30	0.92	0.92	3.14	2.14	0.83	0.83	3.84	2.39
	40	0.62	0.62	2.36	1.43	0.66	0.65	2.96	1.56
	50	0.52	0.52	2.59	1.06	0.50	0.50	3.12	1.05
	10	1.76	1.80	4.05	5.58	1.73	1.77	4.12	4.41
	20	0.94	0.96	2.81	2.04	1.02	1.03	3.11	3.02
s2	30	0.77	0.80	2.79	1.79	0.74	0.76	3.28	2.10
	40	0.55	0.56	2.28	1.26	0.57	0.58	2.57	1.39
	50	0.44	0.45	2.19	0.98	0.45	0.46	2.80	0.91
	10	1.20	1.27	3.26	3.95	1.20	1.31	3.39	3.36
	20	0.73	0.80	2.64	1.53	0.77	0.83	2.90	2.33
s3	30	0.55	0.59	2.47	1.26	0.51	0.56	2.80	1.47
	40	0.40	0.44	1.98	0.98	0.43	0.48	2.45	1.01
	50	0.34	0.37	2.00	0.78	0.33	0.36	2.46	0.68
	10	1.92	1.96	3.49	5.67	1.67	1.70	3.49	4.05
	20	0.97	0.97	2.45	1.93	1.03	1.02	2.42	2.82
s4	30	0.76	0.76	2.41	1.67	0.73	0.73	2.63	1.93
	40	0.55	0.54	1.81	1.23	0.59	0.59	2.25	1.31
	50	0.46	0.46	1.95	0.92	0.46	0.46	2.33	0.90
	10	1.28	1.39	2.97	4.08	1.19	1.26	2.67	2.98
	20	0.74	0.76	2.13	1.50	0.77	0.82	2.40	2.35
s5	30	0.51	0.54	1.93	1.21	0.54	0.54	2.30	1.45
	40	0.39	0.40	1.54	0.90	0.45	0.47	2.06	1.03
	50	0.36	0.38	1.62	0.75	0.35	0.37	2.13	0.68
	10	2.61	2.64	6.77	7.63	2.65	2.73	6.81	6.66
	20	1.57	1.58	5.11	3.30	1.59	1.59	5.37	4.44
s6	30	1.10	1.12	4.78	2.52	1.24	1.26	5.67	3.07
	40	0.84	0.84	3.98	1.68	0.95	0.94	4.70	1.89
	50	0.69	0.70	3.93	1.35	0.74	0.74	5.03	1.32
	10	2.45	2.67	5.47	7.32	2.57	2.82	5.55	5.99
	20	1.54	1.67	4.20	2.86	1.66	1.86	4.71	4.42
s7	30	1.12	1.26	3.82	2.37	1.19	1.36	4.30	3.02
	40	0.88	0.97	2.99	1.74	0.97	1.09	3.48	1.87
	50	0.77	0.90	3.06	1.37	0.78	0.92	3.89	1.20

Table 4. Table 4: The group-wise averaged values of coverage probability (CP) and average length (AL) of 95 % percent 95 95\% confidence intervals based on the naive method (NCI) and the bootstrap method (BCI).

	$(N, α)$	$(200, 0)$		$(200, 1)$		$(400, 0)$		$(400, 1)$
	$n$	BCI	NCI	BCI	NCI	BCI	NCI	BCI	NCI
	10	0.958	0.944	0.951	0.936	0.949	0.938	0.951	0.935
	20	0.946	0.930	0.944	0.933	0.952	0.932	0.946	0.925
CP	30	0.951	0.935	0.952	0.936	0.951	0.941	0.952	0.936
	40	0.947	0.933	0.956	0.936	0.953	0.942	0.954	0.939
	50	0.951	0.934	0.954	0.936	0.955	0.942	0.957	0.942
	10	0.281	0.264	0.155	0.145	0.272	0.256	0.150	0.140
	20	0.218	0.205	0.118	0.111	0.221	0.208	0.123	0.115
AL	30	0.187	0.175	0.101	0.095	0.185	0.174	0.102	0.096
	40	0.166	0.155	0.088	0.083	0.164	0.154	0.090	0.084
	50	0.151	0.142	0.081	0.076	0.145	0.136	0.079	0.074

Table 5. Table 5: AIC and BIC of the four models.

		SDP	SDP-s	SS	SL
AIC		347778.2	347840.0	348123.8	348870.7
BIC		347886.7	347940.8	348232.4	348963.8

Equations109

μ_{i} = \frac{1}{N _{i}} j = 1 \sum N_{i} T (Y_{ij}),

μ_{i} = \frac{1}{N _{i}} j = 1 \sum N_{i} T (Y_{ij}),

Y_{ij} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, j = 1, \dots, N_{i}, i = 1, \dots, m,

Y_{ij} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, j = 1, \dots, N_{i}, i = 1, \dots, m,

Y_{ij}|y_{s}\sim N\bigg{(}{\text{\boldmath$x$}}_{ij}^{t}{\text{\boldmath$\beta$}}+\frac{n_{i}\tau^{2}}{{\sigma}^{2}+n_{i}\tau^{2}}({\bar{y}}_{i}-{\bar{{\text{\boldmath$x$}}}}_{i}^{t}{\text{\boldmath$\beta$}}),\ \frac{{\sigma}^{2}\tau^{2}}{{\sigma}^{2}+n_{i}\tau^{2}}\bigg{)},\ \ \ j=n_{i}+1,\ldots,N_{i},

Y_{ij}|y_{s}\sim N\bigg{(}{\text{\boldmath$x$}}_{ij}^{t}{\text{\boldmath$\beta$}}+\frac{n_{i}\tau^{2}}{{\sigma}^{2}+n_{i}\tau^{2}}({\bar{y}}_{i}-{\bar{{\text{\boldmath$x$}}}}_{i}^{t}{\text{\boldmath$\beta$}}),\ \frac{{\sigma}^{2}\tau^{2}}{{\sigma}^{2}+n_{i}\tau^{2}}\bigg{)},\ \ \ j=n_{i}+1,\ldots,N_{i},

{\widetilde{\mu}}_{i}\equiv{\rm E}[\mu_{i}|y_{s}]=\frac{1}{N_{i}}\bigg{\{}\sum_{j=1}^{n_{i}}T(y_{ij})+\sum_{j=n_{i}+1}^{N_{i}}{\rm E}[T(Y_{ij})|y_{s}]\bigg{\}}.

{\widetilde{\mu}}_{i}\equiv{\rm E}[\mu_{i}|y_{s}]=\frac{1}{N_{i}}\bigg{\{}\sum_{j=1}^{n_{i}}T(y_{ij})+\sum_{j=n_{i}+1}^{N_{i}}{\rm E}[T(Y_{ij})|y_{s}]\bigg{\}}.

H_{λ} (Y_{ij}) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, j = 1, \dots, N_{i}, i = 1, \dots, m,

H_{λ} (Y_{ij}) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, j = 1, \dots, N_{i}, i = 1, \dots, m,

θ_{ij} = \boldmath x_{ij}^{t} \boldmath β + \frac{τ ^{2}}{σ ^{2} + n _{i} τ ^{2}} j = 1 \sum n_{i} (H_{λ} (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β), s_{i} = \frac{σ ^{2} τ ^{2}}{σ ^{2} + n _{i} τ ^{2}} .

θ_{ij} = \boldmath x_{ij}^{t} \boldmath β + \frac{τ ^{2}}{σ ^{2} + n _{i} τ ^{2}} j = 1 \sum n_{i} (H_{λ} (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β), s_{i} = \frac{σ ^{2} τ ^{2}}{σ ^{2} + n _{i} τ ^{2}} .

μ_{i} (y_{s}) \equiv E [μ_{i} ∣ y_{s}] = \frac{1}{N _{i}} {j = 1 \sum n_{i} T (y_{ij}) + j = n_{i} + 1 \sum N_{i} E [T \circ H_{λ}^{- 1} (u_{ij})]},

μ_{i} (y_{s}) \equiv E [μ_{i} ∣ y_{s}] = \frac{1}{N _{i}} {j = 1 \sum n_{i} T (y_{ij}) + j = n_{i} + 1 \sum N_{i} E [T \circ H_{λ}^{- 1} (u_{ij})]},

L (\boldmath ϕ) = - \frac{1}{2} i = 1 \sum m lo g ∣ \boldmath Σ_{i} ∣ - \frac{1}{2} i = 1 \sum m {H_{λ} (y_{i}) - \boldmath X_{i} \boldmath β}^{t} \boldmath Σ_{i}^{- 1} {H_{λ} (y_{i}) - \boldmath X_{i} \boldmath β} - \frac{1}{2} i = 1 \sum m n_{i} lo g 2 π + i = 1 \sum m j = 1 \sum n_{i} lo g H_{λ}^{'} (y_{ij}),

L (\boldmath ϕ) = - \frac{1}{2} i = 1 \sum m lo g ∣ \boldmath Σ_{i} ∣ - \frac{1}{2} i = 1 \sum m {H_{λ} (y_{i}) - \boldmath X_{i} \boldmath β}^{t} \boldmath Σ_{i}^{- 1} {H_{λ} (y_{i}) - \boldmath X_{i} \boldmath β} - \frac{1}{2} i = 1 \sum m n_{i} lo g 2 π + i = 1 \sum m j = 1 \sum n_{i} lo g H_{λ}^{'} (y_{ij}),

PL (λ) = ML (λ) + i = 1 \sum m j = 1 \sum n_{i} lo g H_{λ}^{'} (y_{ij}),

PL (λ) = ML (λ) + i = 1 \sum m j = 1 \sum n_{i} lo g H_{λ}^{'} (y_{ij}),

H_{λ}^{DP} (x) = \frac{x ^{λ} - x ^{- λ}}{2 λ}, x > 0, λ > 0,

H_{λ}^{DP} (x) = \frac{x ^{λ} - x ^{- λ}}{2 λ}, x > 0, λ > 0,

H_{{\lambda}}^{\text{DP}(-1)}(x)=\Big{(}{\lambda}x+\sqrt{1+{\lambda}^{2}x^{2}}\Big{)}^{1/{\lambda}}\ \ \ \ \text{and}\ \ \ \ \frac{dH_{{\lambda}}^{\rm DP}(x)}{dx}=\frac{1}{2}(x^{{\lambda}-1}+x^{-{\lambda}-1}).

H_{{\lambda}}^{\text{DP}(-1)}(x)=\Big{(}{\lambda}x+\sqrt{1+{\lambda}^{2}x^{2}}\Big{)}^{1/{\lambda}}\ \ \ \ \text{and}\ \ \ \ \frac{dH_{{\lambda}}^{\rm DP}(x)}{dx}=\frac{1}{2}(x^{{\lambda}-1}+x^{-{\lambda}-1}).

H_{a, b}^{SS} (x) = sinh (b sinh^{- 1} (x) - a), x \in (- \infty, \infty), a \in (- \infty, \infty), b \in (0, \infty),

H_{a, b}^{SS} (x) = sinh (b sinh^{- 1} (x) - a), x \in (- \infty, \infty), a \in (- \infty, \infty), b \in (0, \infty),

H_{a, b}^{SS (- 1)} (x) = sinh (b^{- 1} sinh^{- 1} (x) + a), and \frac{d H _{a, b}^{SS} ( x )}{d x} = b \frac{1 + H _{a, b}^{SS} ( x ) ^{2}}{1 + x ^{2}} .

H_{a, b}^{SS (- 1)} (x) = sinh (b^{- 1} sinh^{- 1} (x) + a), and \frac{d H _{a, b}^{SS} ( x )}{d x} = b \frac{1 + H _{a, b}^{SS} ( x ) ^{2}}{1 + x ^{2}} .

I_{τ^{2} τ^{2}} = \frac{1}{2} i = 1 \sum m (\boldmath 1_{n_{i}}^{t} \boldmath Σ_{i}^{- 1} \boldmath 1_{n_{i}})^{2}, I_{τ^{2} σ^{2}} = \frac{1}{2} i = 1 \sum m \boldmath 1_{n_{i}}^{t} \boldmath Σ_{i}^{- 2} \boldmath 1_{n_{i}}, I_{σ^{2} σ^{2}} = \frac{1}{2} i = 1 \sum m tr (\boldmath Σ_{i}^{- 2}),

I_{τ^{2} τ^{2}} = \frac{1}{2} i = 1 \sum m (\boldmath 1_{n_{i}}^{t} \boldmath Σ_{i}^{- 1} \boldmath 1_{n_{i}})^{2}, I_{τ^{2} σ^{2}} = \frac{1}{2} i = 1 \sum m \boldmath 1_{n_{i}}^{t} \boldmath Σ_{i}^{- 2} \boldmath 1_{n_{i}}, I_{σ^{2} σ^{2}} = \frac{1}{2} i = 1 \sum m tr (\boldmath Σ_{i}^{- 2}),

\boldmath I_{\boldmath β \boldmath β} = i = 1 \sum m \boldmath X_{i}^{t} \boldmath Σ_{i}^{- 1} \boldmath X_{i}, \boldmath I_{\boldmath β τ^{2}} = \boldmath I_{\boldmath β σ^{2}} = 0, I_{λ σ^{2}} = - i = 1 \sum m E [\boldmath z_{i}^{t} \boldmath Σ_{i}^{- 2} H_{λ}^{(1)} (y_{i})],

\boldmath I_{λ \boldmath β} = - i = 1 \sum m \boldmath X_{i}^{t} \boldmath Σ_{i}^{- 1} E [H_{λ}^{(1)} (y_{i})], I_{λ τ^{2}} = - i = 1 \sum m E [\boldmath z_{i}^{t} \boldmath Σ_{i}^{- 1} \boldmath 1_{n_{i}} \boldmath 1_{n_{i}}^{t} \boldmath Σ_{i}^{- 1} H_{λ}^{(1)} (y_{i})],

I_{λ λ} = i = 1 \sum m E [H_{λ}^{(1)} (y_{i})^{t} \boldmath Σ_{i}^{- 1} H_{λ}^{(1)} (y_{i})] + i = 1 \sum m E [\boldmath z_{i}^{t} \boldmath Σ_{i}^{- 1} H_{λ}^{(2)} (y_{i})] - i = 1 \sum m j = 1 \sum n_{i} E [\frac{\partial ^{2}}{\partial λ ^{2}} lo g H_{λ}^{'} (y_{ij})],

(H_{λ} (Y_{i, n_{i} + 1}), \dots, H_{λ} (Y_{i N_{i}}))^{t} ∣ y_{i} \sim N ((θ_{i, n_{i} + 1}, \dots, θ_{i N_{i}})^{t}, s_{i}^{2} \boldmath 1_{N_{i} - n_{i}} \boldmath 1_{N_{i} - n_{i}}^{t} + σ^{2} \boldmath I_{N_{i} - n_{i}}),

(H_{λ} (Y_{i, n_{i} + 1}), \dots, H_{λ} (Y_{i N_{i}}))^{t} ∣ y_{i} \sim N ((θ_{i, n_{i} + 1}, \dots, θ_{i N_{i}})^{t}, s_{i}^{2} \boldmath 1_{N_{i} - n_{i}} \boldmath 1_{N_{i} - n_{i}}^{t} + σ^{2} \boldmath I_{N_{i} - n_{i}}),

H_{λ} (Y_{ij}) ∣ y_{i} = θ_{ij} + s_{i} z_{i} + σ w_{ij}, j = n_{i} + 1, \dots, N_{i},

H_{λ} (Y_{ij}) ∣ y_{i} = θ_{ij} + s_{i} z_{i} + σ w_{ij}, j = n_{i} + 1, \dots, N_{i},

μ_{i} ∣ y_{i} = d \frac{1}{N _{i}} {j = 1 \sum n_{i} T (y_{ij}) + j = n_{i} + 1 \sum N_{i} T \circ H_{λ}^{- 1} (θ_{ij} + s_{i} z_{i} + σ w_{ij})},

μ_{i} ∣ y_{i} = d \frac{1}{N _{i}} {j = 1 \sum n_{i} T (y_{ij}) + j = n_{i} + 1 \sum N_{i} T \circ H_{λ}^{- 1} (θ_{ij} + s_{i} z_{i} + σ w_{ij})},

I_{α}^{N} = (Q_{α /2} (y_{i}, \boldmath ϕ), Q_{1 - α /2} (y_{i}, \boldmath ϕ)),

I_{α}^{N} = (Q_{α /2} (y_{i}, \boldmath ϕ), Q_{1 - α /2} (y_{i}, \boldmath ϕ)),

CP (a) = E^{*} [I {Q_{a /2} (y_{i}^{*}, \boldmath ϕ^{*}) \leq μ_{i}^{*} \leq Q_{1 - a /2} (y_{i}^{*}, \boldmath ϕ^{*})}],

CP (a) = E^{*} [I {Q_{a /2} (y_{i}^{*}, \boldmath ϕ^{*}) \leq μ_{i}^{*} \leq Q_{1 - a /2} (y_{i}^{*}, \boldmath ϕ^{*})}],

I_{α}^{C} = (Q_{a^{*} /2} (y_{i}, \boldmath ϕ), Q_{1 - a^{*} /2} (y_{i}, \boldmath ϕ)),

I_{α}^{C} = (Q_{a^{*} /2} (y_{i}, \boldmath ϕ), Q_{1 - a^{*} /2} (y_{i}, \boldmath ϕ)),

(A) (2 λ)^{- 1} (Y_{ij}^{λ} - Y_{ij}^{- λ}) = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim N (0, τ^{2}), ε_{ij} \sim N (0, σ^{2}) (B) (2 λ)^{- 1} (Y_{ij}^{λ} - Y_{ij}^{- λ}) = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim t_{5} (0, τ^{2}), ε_{ij} \sim t_{5} (0, σ^{2}) (C) Y_{ij} = exp (μ_{ij}) v_{i} ε_{ij}, v_{i} \sim Γ (1/ τ^{2}, 1/ τ^{2}), ε_{ij} \sim Γ (1/ σ^{2}, 1/ σ^{2}) (D) Y_{ij} = 0.2 exp (U_{ij}) + 0.8 U_{ij}^{2}, U_{ij} = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim N (0, τ^{2}), ε_{ij} \sim N (0, σ^{2}),

(A) (2 λ)^{- 1} (Y_{ij}^{λ} - Y_{ij}^{- λ}) = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim N (0, τ^{2}), ε_{ij} \sim N (0, σ^{2}) (B) (2 λ)^{- 1} (Y_{ij}^{λ} - Y_{ij}^{- λ}) = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim t_{5} (0, τ^{2}), ε_{ij} \sim t_{5} (0, σ^{2}) (C) Y_{ij} = exp (μ_{ij}) v_{i} ε_{ij}, v_{i} \sim Γ (1/ τ^{2}, 1/ τ^{2}), ε_{ij} \sim Γ (1/ σ^{2}, 1/ σ^{2}) (D) Y_{ij} = 0.2 exp (U_{ij}) + 0.8 U_{ij}^{2}, U_{ij} = μ_{ij} + v_{i} + ε_{ij}, v_{i} \sim N (0, τ^{2}), ε_{ij} \sim N (0, σ^{2}),

μ_{i} = \frac{1}{N} j = 1 \sum N T_{α} (Y_{ij}), T_{α} (x) = (\frac{z - x}{z})^{α} I (x < z),

μ_{i} = \frac{1}{N} j = 1 \sum N T_{α} (Y_{ij}), T_{α} (x) = (\frac{z - x}{z})^{α} I (x < z),

\text{MSE}_{i}=\frac{1}{R}\sum_{r=1}^{R}\Big{(}{\widehat{\mu}}_{i}^{(r)}-\mu_{i}^{(r)}\Big{)}^{2},

\text{MSE}_{i}=\frac{1}{R}\sum_{r=1}^{R}\Big{(}{\widehat{\mu}}_{i}^{(r)}-\mu_{i}^{(r)}\Big{)}^{2},

CP_{i} = \frac{1}{R} r = 1 \sum R I (μ_{i}^{(r)} \in CI_{i}^{(r)}) and AL_{i} = \frac{1}{R} r = 1 \sum R ∣ CI_{i}^{(r)} ∣,

CP_{i} = \frac{1}{R} r = 1 \sum R I (μ_{i}^{(r)} \in CI_{i}^{(r)}) and AL_{i} = \frac{1}{R} r = 1 \sum R ∣ CI_{i}^{(r)} ∣,

SDP: SDP-s: SS: SL: (2 λ)^{- 1} {(E_{ij} + c)^{λ} - (E_{ij} + c)^{- λ}} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, (2 λ)^{- 1} {(E_{ij} + c^{*})^{λ} - (E_{ij} + c^{*})^{- λ}} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, sinh (b sinh^{- 1} (E_{ij}) - a) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, lo g (E_{ij} + c^{*}) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij},

SDP: SDP-s: SS: SL: (2 λ)^{- 1} {(E_{ij} + c)^{λ} - (E_{ij} + c)^{- λ}} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, (2 λ)^{- 1} {(E_{ij} + c^{*})^{λ} - (E_{ij} + c^{*})^{- λ}} = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, sinh (b sinh^{- 1} (E_{ij}) - a) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij}, lo g (E_{ij} + c^{*}) = \boldmath x_{ij}^{t} \boldmath β + v_{i} + ε_{ij},

(SDP) λ = 0.090 (1.99 \times 1 0^{- 3}), c = 4319 (170.69)

(SDP) λ = 0.090 (1.99 \times 1 0^{- 3}), c = 4319 (170.69)

(SDP-s) λ = 0.290 (8.18 \times 1 0^{- 4})

(SS) a = - 0.584 (6.82 \times 1 0^{- 2}), b = 0.463 (5.62 \times 1 0^{- 3}),

v_{i}^{*} = \frac{τ}{σ ^{2} + n _{i} τ ^{2}} j = 1 \sum n_{i} (H (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β), and ε_{ij}^{*} = σ^{- 1} {H (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β - τ v_{i}^{*}},

v_{i}^{*} = \frac{τ}{σ ^{2} + n _{i} τ ^{2}} j = 1 \sum n_{i} (H (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β), and ε_{ij}^{*} = σ^{- 1} {H (y_{ij}) - \boldmath x_{ij}^{t} \boldmath β - τ v_{i}^{*}},

\frac{\partial L}{\partial \boldmath β}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sshonosuke/TMMP
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic and Environmental Valuation · demographic modeling and climate adaptation · Income, Poverty, and Inequality

Full text

Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters

SHONOSUKE SUGASAWA

*Center for Spatial Information Science, The University of Tokyo

*TATSUYA KUBOKAWA

*Faculty of Economics, The University of Tokyo

Abstract. For estimating area-specific parameters (quantities) in a finite population, a mixed model prediction approach is attractive. However, this approach strongly depends on the normality assumption of the response values although we often encounter a non-normal case in practice. In such a case, transforming observations to make them suitable for normality assumption is a useful tool, but the problem of selecting suitable transformation still remains open. To overcome the difficulty, we here propose a new empirical best predicting method by using a parametric family of transformations to estimate a suitable transformation based on the data. We suggest a simple estimating method for transformation parameters based on the profile likelihood function, which achieves consistency under some conditions on transformation functions. For measuring variability of point prediction, we construct an empirical Bayes confidence interval of the population parameter of interest. Through simulation studies, we investigate numerical performance of the proposed methods. Finally, we apply the proposed method to synthetic income data in Spanish provinces in which the resulting estimates indicate that the commonly used log-transformation would not be appropriate.

Key words: Confidence interval; Empirical Bayes; Finite population; Mean squared error; Random effect; Small area estimation.

Introduction

The mixed model prediction based on random effect models has been widely used in small area estimation (Rao and Molina, 2015). The random effect models used in small area estimation are mainly divided into two models: the Fay-Herriot model (Fay and Herriot, 1979) and the nested error regression model (Battese et al., 1988). Especially, the nested error regression model has been used for estimating population parameters in a finite population. Here we consider a finite population consisting of $m$ areas and each area has $N_{i}$ units for $i=1,\ldots,m$ . Let $Y_{ij}$ be a characteristic of the $j$ th unit in the $i$ th area, the main purpose is to estimate the area-specific parameter defined as

[TABLE]

where $T(\cdot)$ is a known (user-specified) function. The simplest choice is $T(x)=x$ , in which $\mu_{i}$ corresponds to the finite population mean, and this case has been studied in the literature; Chambers et al. (2014), Jiang and Lahiri (2006), Lahiri and Mukherjee (2007) and Schmit et al. (2016). On the other hand, as noted by Molina and Rao (2010), more complex forms of $T(\cdot)$ are often used in practice. For example, in poverty mapping, we often adopt the FGT poverty measure $T(x)=\{(z-x)/z\}^{\alpha}I(x<z)$ (Foster et al., 1984), where $z$ is a suitable poverty threshold.

If all the units $Y_{ij}$ in the $i$ th area were observed, we could calculate the true value of $\mu_{i}$ . However, only a part of the units are available in practice. Let $n_{i}(<N_{i})$ be the number of sampled units and $y_{s}=\{y_{ij},\ j=1,\dots,n_{i},\ i=1,\ldots,m\}$ be the sampled data. It is known that the direct estimator of $\mu_{i}$ using the observed units has high variability, especially in the case that $n_{i}$ is much smaller than $N_{i}$ . In real application, some covariates associated with $Y_{ij}$ are available not only for sampled but also for non-sampled units, which are denoted by ${\text{\boldmath$ x $}}_{ij}$ with $j=1,\ldots,N_{i}$ and $i=1,\ldots,m$ . Hence, one aims to estimate $\mu_{i}$ based on the sampled data and information on covariates. To this end, a typical strategy is to assume that all the units follow the nested error regression model (Battese et al., 1988):

[TABLE]

where ${\text{\boldmath$ x $}}_{ij}$ and $\beta$ are $p$ -dimensional vectors of covariates and regression coefficients, $v_{i}$ is the area-specific effect which follows $N(0,\tau^{2})$ and ${\varepsilon}_{ij}$ is a sampling error distributed as $N(0,{\sigma}^{2})$ . Note that the model (2) leads to normality assumption of $Y_{ij}$ . Then, the conditional distribution of the non-sampled data $Y_{ij}$ given all the sampled data $y_{s}$ is given by

[TABLE]

which follows from the normality of $Y_{ij}$ under the model (2). Then the best predictor of $\mu_{i}$ under squared error loss is the conditional expectation ${\rm E}[\mu_{i}|y_{s}]$ , which has the form

[TABLE]

Here, the expectation ${\rm E}[T(Y_{ij})|y_{s}]$ could be computed via the Monte Carlo integration by generating a large number of random samples from the conditional distribution (3). Moreover, the best predictor ${\widetilde{\mu}}_{i}$ depends on the unknown model parameters ${\text{\boldmath$ \beta $}},\tau^{2}$ and ${\sigma}^{2}$ in the model (2), so that these parameters should be replaced with their estimated counterparts. To this end, the model parameters in the model (2) would be estimated based on the sampled data $y_{s}$ by using, for example, the maximum likelihood or restricted maximum likelihood methods.

It is observed that the key assumption in deriving the best predictor (4) is the normality of $Y_{ij}$ in the model (2), which enables us to obtain the simple expression of the conditional distribution (3). However, we often encounter the case where the normality assumption is not plausible for $Y_{ij}$ . For instance, in poverty mapping, $Y_{ij}$ would be a welfare variable (e.g. income), so that the distribution of $Y_{ij}$ would be right skewed and would not be normal. In this case, several methods have been considered so far. For instance, Chambers and Tzavidis (2006) proposed the M-quantile models and Sinha and Rao (2009) and Chambers et al. (2014) among others proposed robust methods against outliers under normality assumptions. However, these methods basically aims to estimate only the mean parameters, corresponding $T(x)=x$ in (1), and do not take account of the data characteristics like skewness. On the other hand, Molina and Rao (2010) considered the nested error regression model (2) for the transformed variables $H(Y_{ij})$ instead of $Y_{ij}$ , which would be able to take account of data characteristics. For instance, the log-transformation, $H(x)=\log x$ , would be the most standard approach in small area estimation (e.g. Slud and Haiti, 2006; Molina and Martin, 2017) when $Y_{ij}$ is right skewed as often appeared in welfare variables. However, we can not know how skew the true data distribution is whereas the use of log-transformation forces the amount of skewness in the data distribution. This problem can be regarded as the misspecification of the transformation, under which the predictor of $\mu_{i}$ might be biased or inefficient. In such a case, it would be more preferable to consider a parametric family of transformations and we propose to estimate the transformation based on the data for solving the misspecification as much as possible. This approach would enable us to get better prediction although the problem of violating normality assumption are not necessarily solved due to limitations arising from the use of a parametric family of transformations. We derive a form of the best predictor of $\mu_{i}$ and provide a simple estimating method for transformation parameters based on profile likelihood function, which produces a consistent estimator under some regularity conditions. We also construct an empirical Bayes confidence interval of $\mu_{i}$ for measuring the variability of the point prediction. The proposed intervals are shown to have $O(m^{-1})$ coverage error, and we also suggest the parametric bootstrap calibration for confidence intervals with further accuracy.

This paper is organized as follows: In Section 2, we describe the proposed prediction method as well as parameter estimation of the model parameters. In Section 3, we construct an empirical Bayes confidence interval of $\mu_{i}$ . In Section 4, we present the results from simulation studies and a data application. In Section 5, we give conclusions and some discussions. The technical proofs are given in Appendix.

Adaptively Transformed Mixed Model Prediction

Transformed best predictor

Let $H_{{\lambda}}(\cdot)$ be a family of transformations with parameter ${\lambda}$ . The transformation parameter ${\lambda}$ might be multidimensional, but we treat ${\lambda}$ as a scalar parameter for notational simplicity. The assumptions and specific choices of $H_{{\lambda}}(\cdot)$ will be discussed in the subsequent section. We assume that the transformed variable $H_{{\lambda}}(y_{ij})$ follows the nested error regression model:

[TABLE]

where ${\text{\boldmath$ x $}}_{ij}$ and $\beta$ are $p$ -dimensional vectors of fixed covariates and regression coefficients, $v_{i}$ and ${\varepsilon}_{ij}$ are an area-specific effect and a sampling error, respectively. Here we assume that $v_{i}$ and ${\varepsilon}_{ij}$ are mutually independent and distributed as $v_{i}\sim N(0,\tau^{2})$ and ${\varepsilon}_{ij}\sim N(0,{\sigma}^{2})$ with unknown two variance parameters $\tau^{2}$ and ${\sigma}^{2}$ . It is worth noting that, owing to the area effect $v_{i}$ , the units in the same area are mutually correlated while the units in the different area are independent. Specifically, from (5), it holds ${\rm Cor}(H_{{\lambda}}(Y_{ij}),H_{{\lambda}}(Y_{ik}))=(\tau^{2}+{\sigma}^{2})^{-1}\tau^{2},\ j\neq k$ , thereby the units in the same area are mutually correlated and the degree of correlation is determined by the ratio $\tau^{2}/{\sigma}^{2}$ . The normality assumptions of $v_{i}$ and ${\varepsilon}_{ij}$ leads to the assumption that $H_{{\lambda}}(Y_{ij})\sim N({\text{\boldmath$ x $}}_{ij}^{t}{\text{\boldmath$ \beta $}},\tau^{2}+{\sigma}^{2})$ . Although, there would not necessarily exist the suitable transformation parameter to make the transformed response hold the normality assumption exactly, our theory will be developed under the normality assumption.

In model (5), the response variable $Y_{ij}$ is assumed to follow a class of distributions defined by transforming the normal distribution $N({\text{\boldmath$ x $}}_{ij}^{t}{\text{\boldmath$ \beta $}},\tau^{2}+{\sigma}^{2})$ by $H_{{\lambda}}^{-1}(\cdot)$ , the inverse function of $H_{{\lambda}}(\cdot)$ . For example, the use of log-transformation leads to the assumption that $Y_{ij}$ follows a log-normal distribution. Moreover, the use of a transformation does not change the correlation structure that the units in the same areas are correlated while different areas are kept independent. On the other hand, the mean and variance $Y_{ij}$ could be a complicate function of the mean ${\text{\boldmath$ x $}}_{ij}^{t}{\text{\boldmath$ \beta $}}$ and variance $\tau^{2}+{\sigma}^{2}$ in the transformed scale, so that it could induce some relationships between mean and variance as observed in a log-normal distribution, and the correlations among $Y_{ij}$ ’s in the same areas would not be necessarily constant.

Let $y_{s}=\{y_{ij},\ j=1,\dots,n_{i},\ i=1,\ldots,m\}$ be the sampled data. From the model (5), we have $H_{{\lambda}}(Y_{ij})|y_{s}\sim N({\theta}_{ij},s_{i}^{2}+{\sigma}^{2})$ , $j=n_{i}+1,\ldots,N_{i}$ , where

[TABLE]

Hence, the best predictor of $\mu_{i}$ given in (1) can be obtained as

[TABLE]

where the expectation is taken with respect to $u_{ij}\sim N({\theta}_{ij},s_{i}^{2}+{\sigma}^{2})$ , and $T\circ H_{{\lambda}}^{-1}(\cdot)$ is the composite function of $T(\cdot)$ and $H_{{\lambda}}^{-1}$ , the inverse function of $H_{{\lambda}}(\cdot)$ . Although the expectation ${\rm E}[T\circ H_{{\lambda}}^{-1}(u_{ij})]$ does not have a closed form in general, it can be easily computed via the Monte Carlo integration. We call the best predictor (7) adaptively transformed best predictor (ATP).

Estimation of structural parameters

Let ${\text{\boldmath$ \phi $}}=({\text{\boldmath$ \beta $}}^{t},\tau^{2},{\sigma}^{2},{\lambda})^{t}$ be a vector of unknown model parameters in (5). We here propose estimating $\phi$ based on the following log-marginal likelihood function:

[TABLE]

where $({\text{\boldmath$ {\Sigma} $}}_{i})_{k\ell}=\tau^{2}+{\sigma}^{2}I(k=\ell)$ , $H_{{\lambda}}(y_{i})=(H_{{\lambda}}(y_{i1}),\ldots,H_{{\lambda}}(y_{in_{i}}))^{t}$ , ${\text{\boldmath$ X $}}_{i}=({\text{\boldmath$ x $}}_{i1}^{t},\ldots,{\text{\boldmath$ x $}}_{in_{i}}^{t})^{t}$ , and $H^{\prime}_{{\lambda}}(\cdot)$ denotes the derivative of $H_{{\lambda}}(\cdot)$ . The maximum likelihood estimator of $\phi$ can be defined as the maximizer of $L({\text{\boldmath$ \phi $}})$ .

For maximizing the likelihood function $L({\text{\boldmath$ \phi $}})$ , we first note that the profile log-likelihood function of ${\lambda}$ can be expressed as

[TABLE]

where $\text{ML}({\lambda})$ is the maximum log-likelihood of the nested error regression model with response values $H_{{\lambda}}(y_{ij})$ and covariate vectors ${\text{\boldmath$ x $}}_{ij}$ . Since the computation of $\text{ML}({\lambda})$ can be readily carried out via well-developed numerical methods, e.g. sae package in R (Molina and Marhuenda, 2015), the point evaluation of the profile likelihood (9) is quite easy. Hence, we may obtain the maximizer of $\text{PL}({\lambda})$ by using, for example, the golden section method (Brent et al., 1973). Once we obtain the estimator ${\widehat{\lambda}}$ , we get the estimators of other parameters by applying the nested error regression model to the data set $\{H_{{\widehat{\lambda}}}(y_{ij}),{\text{\boldmath$ x $}}_{ij}\}$ .

For estimating the two variance parameters $\tau^{2}$ and ${\sigma}^{2}$ , the restricted maximum likelihood (REML) method (Jiang, 1996) might be more attractive than the maximum likelihood method in terms of estimating variance components since the REML method is known to produce estimates with smaller bias than the maximum likelihood method. To implement the REML estimation, the first three terms in (8) are changed to those of the REML method, and the resulting function could be maximized by profiling as used in the maximum likelihood method. We note that the estimator of transformation parameters would be changed by adapting the REML estimation, and it would be hard to investigate theoretical differences in terms of estimating transformation parameters. Therefore, we consider only the maximum likelihood estimator for simplicity.

Class of transformations

When applying the aforementioned prediction method, the concrete choice of $H_{{\lambda}}(\cdot)$ would be important in practice. For the choice of suitable transformations, we need to take account of the data characteristics. For instance, income data is often right skewed, so that a family of transformations that includes the log-transformation would be suitable.

Before considering the concrete family of transformations, we introduce the following class of transformations for theoretical guarantee of the proposed method.

Assumption 1.

(Class of transformations)

$H_{{\lambda}}$ * is a differentiable and monotone function, and the range of $H_{{\lambda}}$ is $\mathbb{R}$ for all ${\lambda}$ .*

2.

For fixed $x$ , $H_{{\lambda}}(x)$ as the function of ${\lambda}$ is differentiable.

3.

The function $|\partial H_{{\lambda}}(w)/\partial{\lambda}|$ , $|\partial^{2}H_{{\lambda}}(w)/\partial{\lambda}^{2}|$ and $|\partial^{2}\log H^{\prime}_{{\lambda}}(w)/\partial{\lambda}^{2}|$ with $w=H_{{\lambda}}^{-1}(x)$ are bounded from the upper by $C_{1}\{\exp(C_{2}x)+\exp(-C_{2}x)\}$ with some constants $C_{1},C_{2}>0$ .

The first condition is crucial in this context. If the range of $H_{{\lambda}}$ is not $\mathbb{R}$ , but some subset $A\subset\mathbb{R}$ , the inverse function $H_{{\lambda}}^{-1}$ cannot be defined on $\mathbb{R}\setminus A$ , which causes problems in computing the predictor (7). Although the Box-Cox transformation (Box and Cox, 1964) is widely used for positive valued data and has been adopted in small area estimation (Li and Lahiri, 2007), it does not belong to the class of transformations due to the limited range of the Box-Cox transformation.

When we focus on estimating poverty indicators based on welfare variables like income, the following two properties of parametric transformations would be preferable: one is simplicity of the inverse function $H^{-1}_{{\lambda}}(x)$ used in computing the predictor (7), and the other is eliminating skewness of data distribution which would be often the case in income distribution. There area several parametric transformations related to the Box-Cox transformation, e.g. John and Draper (1980) and Yeo and Johnson (2000). However, the transformation by John and Draper (1980) cannot necessarily eliminate the skewness since the transformation is an odd function. On the other hand, the transformation by Yeo and Johnson (2000) has a relatively complicated form and so does the inverse transformation. As transformations that belongs to the class and satisfies two desirable properties, we consider two parametric transformations, the dual power (DP) transformation (Yang, 2006) and the sinh-arcsinh (SS) transformation (Jones and Pewsey, 2009).

The DP transformation is described as

[TABLE]

where $\lim_{{\lambda}\to 0}H_{{\lambda}}^{\rm DP}(x)=\log x$ . In the context of small area estimation, the Fay-Herriot model has been extended by Sugasawa and Kubokawa (2017) with use of the DP transformation. It is easy to confirm that the range of DP transformation is $\mathbb{R}$ . Note that $\lambda$ controls the skewness of the transformed random variable $H_{{\lambda}}^{\rm DP}(y_{ij})$ . The inverse function and the Jacobian appeared in the predictor (7) and the profile likelihood (9), respectively, are given by

[TABLE]

The original DP transformation (10) by Yang (2006) can be applied only for a positive valued response. However, as demonstrated in Section 4.3, welfare variables can take negative values depending on their definition. In this case, we propose the shifted-DP (SDP) transformation of the form $H_{{\lambda},c}(x)=\{(x+c)^{{\lambda}}-(x+c)^{-{\lambda}}\}/2{\lambda}$ , where $c\in(\min(y_{ij})+{\varepsilon},\infty)$ with some small ${\varepsilon}>0$ .

The SS transformation has the following form:

[TABLE]

where $\sinh(x)=(e^{x}-e^{-x})/2$ is the hyperbolic sine function, $\sinh^{-1}(x)=\log(x+\sqrt{x^{2}+1})$ , and two transformation parameter $a$ and $b$ control skewness and tail heaviness of the transformed random variable $H_{a,b}^{\rm SS}(y_{ij})$ , respectively. The inverse transformation and the Jacobian are obtained as

[TABLE]

Since the domain of the SS transformation is the whole real line, it could be used not only for positive valued data but also real valued data.

As mentioned at the beginning of this section, which transformation should be used would depend on the data characteristics. Meanwhile, we may select the suitable parametric transformation in a more objective way by using information criteria of the form: $-2\text{ML}+a(N)(p+q+2)$ , where ML is the maximum log-likelihood, namely the maximum value of (8), $q$ is the number of transformation parameters and $N=\sum_{i=1}^{m}n_{i}$ is the total number of sampled units. Setting $a(N)=2$ and $a(N)=\log N$ correspond to AIC-like and BIC-like criterion, respectively.

Large sample properties

We here consider the large sample properties of the estimator of structural parameters. To this end, we assume the following condition:

Assumption 2.

(Assumptions under large $m$ )

The true parameter vector ${\text{\boldmath$ \phi $}}_{0}$ is an interior point of the parameter space $\Phi$ .

2.

$0<\min_{i=1,\ldots,m}N_{i}\leq\max_{i=1,\ldots,m}N_{i}<\infty$ .

3.

The elements of ${\text{\boldmath$ X $}}_{i}$ are uniformly bounded and ${\text{\boldmath$ X $}}_{i}^{t}{\text{\boldmath$ X $}}_{i}$ is positive definite.

4.

$m^{-1}\sum_{i=1}^{m}{\text{\boldmath$ X $}}_{i}^{t}{\text{\boldmath$ {\Sigma} $}}_{i}^{-1}{\text{\boldmath$ X $}}_{i}$ * converges to a positive definite matrix as $m\to\infty$ .*

The first condition implies that there exists the true transformation parameter that the transformed random variable $H_{{\lambda}}(Y_{ij})$ achieves normality while it might not necessarily hold in practice. Since the asymptotic variance and covariance matrix of the maximum likelihood estimator can be derived from the Fisher information matrix, we first provide the Fisher information matrix in the following theorem, where the proof is given in Appendix.

Theorem 1.

When the Fisher information is denoted by $I_{\phi_{k}\phi_{j}}=-{\rm E}[\partial^{2}L({\text{\boldmath$ \phi $}})/\partial\phi_{k}\partial\phi_{j}]$ , we have

[TABLE]

where $H_{{\lambda}}^{(k)}(y_{i})=\partial^{k}H_{{\lambda}}(y_{i})/\partial{\lambda}^{k}$ for $k=1,2$ , ${\text{\boldmath$ z $}}_{i}=H_{{\lambda}}(y_{i})-{\text{\boldmath$ X $}}_{i}{\text{\boldmath$ \beta $}}$ , and ${\rm E}[\cdot]$ denotes the expectation with respect to $y_{ij}$ ’s following the model (5). Then, under Assumptions 1 and 2, the maximum likelihood estimator ${\widehat{\text{\boldmath$ \phi $}}}$ is asymptotically distributed as ${\widehat{\text{\boldmath$ \phi $}}}\sim N({\text{\boldmath$ \phi $}},{\text{\boldmath$ I $}}_{{\text{\boldmath$ \phi $}}}^{-1})$ .

From Theorem 1, it is observed that the information matrix of $({\text{\boldmath$ \beta $}}^{t},\tau^{2},{\sigma}^{2})$ does not depend on the transformation parameter ${\lambda}$ , and their expressions are the same as those of the traditional nested error regression models. While the two variance parameters $\tau^{2}$ and ${\sigma}^{2}$ are orthogonal to $\beta$ in the sense that ${\text{\boldmath$ I $}}_{{\text{\boldmath$ \beta $}}\tau^{2}}={\text{\boldmath$ I $}}_{{\text{\boldmath$ \beta $}}{\sigma}^{2}}=0$ , the transformation parameter ${\lambda}$ is not orthogonal to the others. The expectations appeared in the Fisher matrix is not analytically tractable, but it can be easily estimated by replacing the expectation with its sample counterpart. In the case that ${\lambda}$ is multidimensional, the extension of Theorem 1 is straightforward. The expressions of $H_{{\lambda}}^{(k)}(y_{i})$ and $\partial^{2}\log H_{{\lambda}}^{\prime}(y_{ij})/\partial{\lambda}^{2}$ could be analytically complicated and require tedious algebraic calculations. In such a case, the numerical derivative would be useful since we need to compute only the point values of the derivatives.

Empirical Bayes Confidence Intervals

Asymptotically valid confidence intervals

Measuring the variability of the transformed empirical best predictor ${\widehat{\mu}}_{i}$ is an important issue in practice. Traditionally, the mean squared error (MSE) of ${\widehat{\mu}}_{i}$ has been used, and several methods ranging from analytical method (Prasad and Rao, 1990) to numerical methods (Hall and Maiti, 2006) have been considered. On the other hand, an empirical Bayes confidence interval of $\mu_{i}$ is more preferable since it can provide distributional information than MSE although construction of the confidence interval is generally difficult. Here, we derive an asymptotically valid empirical Bayes confidence interval of $\mu_{i}$ .

The key to derivation of the confidence interval is the conditional distribution of $\mu_{i}$ given $y_{i}$ . Noting that ${\rm Cov}(H_{{\lambda}}(Y_{ij}),H_{{\lambda}}(Y_{ik})|y_{i})=\text{Var}(v_{i}|y_{i})=s_{i}^{2}$ for $j\neq k$ , it follows that

[TABLE]

namely, the each component has the expression

[TABLE]

where $z_{i}$ and $w_{ij}$ are mutually independent standard normal random variables, and ${\theta}_{ij}$ and $s_{i}$ are defined in (6). Then the posterior distribution of $\mu_{i}$ can be expressed as

[TABLE]

which is a complex function of standard normal random variables $z_{i}$ and $w_{ij}$ . However, random samples from the conditional distribution (12) can be easily simulated.

We define $Q_{a}(y_{i},{\text{\boldmath$ \phi $}})$ as the lower $100a\%$ quantile point of the posterior distribution of $\mu_{i}$ with the true $\phi$ , which satisfies ${\rm P}(\mu_{i}\leq Q_{a}(y_{i},{\text{\boldmath$ \phi $}})|y_{i})=a$ . Hence, the Bayes confidence interval of $\mu_{i}$ with nominal level $1-\alpha$ is obtained as $I_{\alpha}=(Q_{\alpha/2}(y_{i},{\text{\boldmath$ \phi $}}),Q_{1-\alpha/2}(y_{i},{\text{\boldmath$ \phi $}}))$ , which holds that ${\rm P}(\mu_{i}\in I_{\alpha})=1-\alpha$ . However, the interval $I_{\alpha}$ depends on the unknown parameter $\phi$ , so that the feasible version of $I_{\alpha}$ is obtained by replacing $\phi$ with its estimator ${\widehat{\text{\boldmath$ \phi $}}}$ , namely

[TABLE]

which we call naive empirical Bayes confidence interval of $\mu_{i}$ . The two quantiles appeared in (13) can be computed by generating a large number of random samples from the conditional distribution (12). Owing to the asymptotic properties of ${\widehat{\text{\boldmath$ \phi $}}}$ , the coverage probability of the naive interval (13) converges to the nominal level as the number of areas $m$ tends to infinity as shown in the following theorem proved in Appendix.

Theorem 2.

Under Assumptions 1 and 2, it holds ${\rm P}(\mu_{i}\in I_{\alpha}^{N})=1-\alpha+O(m^{-1})$ .

Bootstrap calibrated intervals

As shown in Theorem 2, the coverage error of the naive interval (13) is of order $m^{-1}$ , which is not necessarily negligible when $m$ is not sufficiently large. Since the number of $m$ is usually moderate in practice, the calibrated intervals with higher accuracy would be valuable. Following Chatterjee, et al. (2008), Hall and Maiti (2006), we construct a second order corrected empirical Bayes confidence interval $I_{\alpha}^{C}$ satisfying $P(\mu_{i}\in I_{\alpha}^{C})=1-\alpha+o(m^{-1})$ .

To begin with, we define the bootstrap estimator of the coverage probability of the naive interval. Let $Y_{ij}^{\ast}$ be the parametric bootstrap samples generated from the estimated model (5) with ${\text{\boldmath$ \phi $}}={\widehat{\text{\boldmath$ \phi $}}}$ , and $y_{i}^{\ast}=\{Y_{ij}^{\ast},\ j=1,\ldots,n_{i}\}$ . Moreover, let $\mu_{i}^{\ast}$ be the bootstrap version of $\mu_{i}$ based on $Y_{ij}^{\ast}$ ’s. Since the coverage probability is ${\rm P}(Q_{a/2}(y_{i},{\widehat{\text{\boldmath$ \phi $}}})\leq\mu_{i}\leq Q_{1-a/2}(y_{i},{\widehat{\text{\boldmath$ \phi $}}}))$ , its parametric bootstrap estimator can be defined as

[TABLE]

where the expectation is taken with respect to the bootstrap samples $Y_{ij}^{\ast}$ ’s. Based on the coverage probability, we define the calibrated nominal level $a^{\ast}$ as the solution of the equation $\text{CP}(a^{\ast})=1-\alpha$ , which can be solved by the bisectional method (Brent, 1973). Then, the calibrated interval is given by

[TABLE]

which has second order accuracy as shown in the following theorem proved in Appendix.

Theorem 3.

Under Assumptions 1 and 2, it holds ${\rm P}(\mu_{i}\in I_{\alpha}^{C})=1-\alpha+o(m^{-1})$ .

Numerical Studies

Evaluation of prediction errors

We first evaluate the prediction errors of the proposed predictors together with some existing methods. To this end, we considered the following data generating processes:

[TABLE]

where $i=1,\ldots,m$ , $j=1,\ldots,N$ , $\mu_{ij}=\beta_{0}+\beta_{1}x_{1ij}+\beta_{2}x_{2ij}+\beta_{3}x_{3ij}$ , $(\beta_{0},\beta_{1},\beta_{2},\beta_{3})=(2,1,-0.5,1)$ , $\tau=0.5$ , ${\sigma}=0.8$ , and $x_{1ij},x_{2ij}$ and $x_{3ij}$ were generated from Bernoulli distributions with probabilities $0.3,0.5$ and $0.5$ , respectively. Based on the above models, we considered seven scenarios of data generating processes as summarized in Table 1. In this study, we focus on estimating the following parameters:

[TABLE]

where $z$ is defined as $0.6$ times median of $Y_{ij}$ ’s. Note that $T_{\alpha}(x)$ is known as FGT poverty measures (Foster et al., 1984). We considered two cases of $\alpha$ , $\alpha=0$ (poverty rate) and $\alpha=1$ (poverty gap). We set $m=30$ and divided $m$ areas into five groups with equal number of areas, and we set the same numbers of sampled units $n_{i}$ within the same groups. The group pattern of $n_{i}$ was $(10,20,30,40,50)$ . For the size of units $N$ , we considered two cases, $N=200$ and $N=400$ , to check sensitivity of ratios $n_{i}/N$ .

Among the generated $Y_{i1},\ldots,Y_{iN}$ , we used first $n_{i}$ observations $y_{i1}(=Y_{i1}),\ldots,y_{in_{i}}(=Y_{in_{i}})$ as the sampled data. Then, based on the sampled data $y_{ij}$ ’s and covariates $X_{ij}=(x_{1ij},x_{2ij},x_{3ij})^{t}$ , we predict $\mu_{i}$ based on the proposed adaptively transformed empirical best prediction (ATP) method with DP transformation (10) and the transformed empirical best prediction (TP) (Molina and Rao, 2010) with log-transformation. We used 1000 Monte Carlo samples in applying these two methods. As a competitor from other model-based methods, we employed the M-quantile method (Chambers and Tzavidis, 2006). We fitted the M-quantile model to the sampled data in the same way as in Chambers and Tzavidis (2006), and modified the distribution function estimator given in equation (5) in Chambers and Tzavidis (2006) to an estimator of $\mu_{i}$ by replacing the indicator function with $T_{\alpha}(\cdot)$ . Moreover, we computed the following direct estimator (DE): ${\widehat{\mu}}_{i}^{D}=n_{i}^{-1}\sum_{j=1}^{n_{i}}T_{\alpha}(y_{ij})$ . Note that the proposed model is correctly specified in Scenarios (s1) $\sim$ (s3), namely, there exists the true transformation parameter such that the transformed variable achieves normality. On the other hand, in Scenarios (s4) $\sim$ (s7), the proposed model is misspecified in the sense that there is no true transformation parameter that achieves normality

To compare the performances of the four methods, we computed mean squared error (MSE) defined as

[TABLE]

with $R=500$ , where ${\widehat{\mu}}_{i}^{(r)}$ and $\mu_{i}^{(r)}$ are the estimated and true values of $\mu_{i}$ , respectively, in the $r$ th iteration. The obtained values of MSEs are averaged within the same groups and the results are reported in Tables 2 and 3. From these tables, we can observe that the proposed method provides better estimates than the three existing methods in almost all cases, and there are not much differences between the two cases of $N$ . In scenario (s1), the performance between ATP and TP are almost the same while the ATP method is overfitting while the log-transformed model is correctly specified. Meanwhile, in the other scenarios, the ATP method can improve the estimation accuracy of TP method as well as MQ and DE methods, by adaptively estimating the transformation parameter from the data even when the model assumption in the TP method is violated. In Supplementary Material, we provide additional simulation results (e.g. relative bias and coefficient of variations).

Performance of empirical Bayes confidence intervals

We next evaluate the finite sample performance of the empirical Bayes confidence intervals given in Section 3. To this end, we adopted the same settings as used in scenario (s2) in Section 4.1 and focused on the same population parameters $\mu_{i}$ given in (15) with $\alpha=0$ and $1$ . For constructing confidence intervals of $\mu_{i}$ , we employed two methods: bootstrap calibrated confidence interval (14) as well as the naive confidence interval (13), which are denoted by BCI and NCI, respectively. We used 200 Monte Carlo samples and 100 bootstrap replications in applying these two methods. Note that theoretical coverage accuracy of BCI and NCI is $o(m^{-1})$ and $O(m^{-1})$ , where $m=30$ in this study.

To evaluate the performances of the two confidence intervals, based on $R=500$ simulation runs, we computed the following empirical coverage probability (CP) and average length of confidence interval (AL):

[TABLE]

where $\mu_{i}^{(r)}$ and $\text{CI}_{i}^{(r)}$ are the true value and the confidence interval in the $r$ th iteration. We averaged CP’s and AL’s within the same groups, and reported the results in Table 4. Table 4 shows that NCI tends to produce shorter confidence intervals and the coverage probability is smaller than the nominal level for all areas. On the other hand, BCI produces more accurate confidence intervals than NCI, which would support the theoretical property given in Theorem 3. Since underestimation of risk estimates may yield serious problems in practice, BCI would be appealing when the number of areas is not large.

Example: poverty mapping in Spain

We applied the proposed method to estimation of poverty indicators in Spanish provinces, using the synthetic income data available in sae package (Molina and Marhuenda, 2015) in R language. The similar data set was used in Molina and Rao (2010) and Molina et al. (2014). Such data are available for $m=52$ areas and the sample sizes (the number of observed units) range from $20$ to $1420$ , so that there are no out-of-sample areas. The total number of samples units is $17199$ . The welfare variable for the individuals is the equivalized annual net income denoted by $E_{ij}$ , noting that the small portions of $E_{ij}$ take negative values. The median of $E_{ij}$ is about $10800$ and area-wise medians of $E_{ij}$ range about from $7500$ to $13700$ , so that the area-wise distributions would be quite different. As auxiliary variables, we considered the indicators of the four groupings of ages (16-24, 25-49, 50-64 and $\geq$ 65), the indicator of having Spanish nationality, the indicators of education levels (primary education and post-secondary education), and the indicators of two employment categories (employed, unemployed). An intercept term is also included in our model.

Let ${\text{\boldmath$ x $}}_{ij}$ be the vector of auxiliary variables including an intercept term. We consider the four models with different families of transformations, SDP transformation, SDP transformation with known shift (SDP-s), SS transformation and shifted log-transformation (SL), which are described as

[TABLE]

where $v_{i}\sim N(0,\tau^{2})$ , ${\varepsilon}_{ij}\sim N(0,{\sigma}^{2})$ and $c^{\ast}=|\min(E_{ij})|+1$ . By maximizing the profile likelihood function of transformation parameters, we obtained the following estimates:

[TABLE]

where the values in the parentheses are the corresponding standard errors calculated from the Fisher information matrix given in Theorem 1. From the above result, it can be observed that the approximate $95\%$ confidence intervals of the transformation parameter ${\lambda}$ in SDP and SDP-s are bounded from [math], which means that the log-transformed model would be inappropriate.

For comparing the four models, we calculated AIC and BIC given in the end of Section 2.3. We reported the values in Table 5, which shows that both AIC and BIC values of the SDP model are significantly smaller than those of the other models. Moreover, to see the adequacy of normality assumption of $v_{i}$ and ${\varepsilon}_{ij}$ in the models (16), we computed

[TABLE]

and their QQ-plots are shown in Figures 1 and 2. Although the normality of $v_{i}$ seems plausible in all the four models from Figure 1, the normality of ${\varepsilon}_{ij}$ in the SL model would not be appropriate from Figure 2.

Finally, we calculated the estimates of poverty indicators based on FGT poverty measures given in (15), where we set $z$ as the $0.6$ times the median of $E_{ij}$ ’s. In particular, we estimated the poverty rate ( $\alpha=0$ ) and poverty gap ( $\alpha=1$ ). Since the auxiliary variables of non-sampled units in five provinces are available, we computed the estimates of poverty indicators of the provinces based on the four models in (16) with 100 Monte Carlo samples as well as the direct estimator (DE). Moreover, we computed $95\%$ confidence intervals (CI) of the poverty indicators based on 100 bootstrap samples. The results are given in Table 6. It is observed that the four model-based estimates are very different from DE even in the provinces with relatively large number of sampled units (e.g. Sevilla). Moreover, the SDP and SDP-s models provide quite similar estimates while the SL model provide relatively different estimates from the others. However, based on AIC and BIC values and QQ-plot in Figure 2, the validity of SL method would be doubtful, so that the estimates given in Table 6 would not be reliable.

Conclusions and Discussion

We have introduced the use of the parametric family of transformations for estimating (predicting) general area-specific parameters based on the mixed effects models. We have provided the best predictor of the parameter as well as the maximum likelihood method for estimating model parameters. Moreover, for measuring variability of the predictor, we constructed the mpirical Bayes confidence interval of the area parameter. The simulation and empirical studies have revealed that the use of parametric transformations would improve the prediction accuracy of the existing method using specified transformations.

As demonstrated in the simulation studies, the proposed method using the parametric family of transformations outperformed the prediction method using specified transformations when the specified transformation is not true. Hence, the proposed method would be promising and recommended as an alternative tool for prediction methods with specified transformations. However, when the estimated transformation is close to a well-known one, we may not necessarily employ the proposed method and it would be recommended to simply use the well-known transformation. For example, when the estimate of $\lambda$ in the DP transformation is close to [math], it would be better to simply use the log-transformation.

In this paper, we developed the methodology under the situation where the random effects $v_{i}$ ’s are mutually independent, following Molina and Rao (2010). However, $v_{i}$ ’s might be spatially correlated in some cases and several works have been focused on introducing spatial correlations in small area estimation (e.g. Pratesi and Salvati, 2009; Schmit et al., 2016). The detailed discussion introducing spatial correlation in this context is left to a future work.

Although we considered an empirical Bayes approach in this paper, the hierarchical Bayes approach as considered in Molina et al. (2014), by assigning some prior distributions for model parameters, would be useful as well. Moreover, it would be interesting to consider the use of the penalized spline method for modeling the regression part (e.g. Opsomer et al., 2008) or a semiparametric transformation approach (e.g. Nesser, et al., 1996) rather than the full parametric approach, which would achieve more flexible modeling whereas both would be computationally burdensome. The detailed investigation are left to valuable future studies.

Acknowledgement The authors are supported by Grant-in-Aid for Scientific Research (18K12757, 15H01943 and 26330036) from Japan Society for the Promotion of Science.

Appendix

A1. Proof of Theorem 1. From the likelihood function (8), the derivatives are given by

[TABLE]

where ${\text{\boldmath$ z $}}_{i}=H_{{\lambda}}(y_{i})-{\text{\boldmath$ X $}}_{i}{\text{\boldmath$ \beta $}}$ . Since ${\rm E}[{\text{\boldmath$ z $}}_{i}]={\text{\boldmath$ 0 $}}$ , it follows that ${\rm E}[\partial^{2}L/\partial{\text{\boldmath$ \beta $}}\partial\tau^{2}]={\rm E}[\partial^{2}L/\partial{\text{\boldmath$ \beta $}}\partial{\sigma}^{2}]={\text{\boldmath$ 0 $}}$ . The other elements of the Fisher information can be obtained by a straightforward calculation. Moreover, under Assumptions 1 and 2, each element of the Fisher information matrix is finite, so that the asymptotic normality of ${\widehat{\text{\boldmath$ \phi $}}}$ follows.

A2. Proof of Theorem 2. Let ${\text{\boldmath$ \phi $}}_{0}$ be the true values of parameters. It suffices to show that $P(\mu_{i}\leq Q_{a}(y_{i},{\widehat{\text{\boldmath$ \phi $}}}))=a+O(m^{-1})$ for $a\in(0,1)$ . We first note that

[TABLE]

where $F(\cdot;y_{i},{\text{\boldmath$ \phi $}}_{0})$ is a distribution function of $\mu_{i}$ given $y_{i}$ . Let $G(y_{i},{\widehat{\text{\boldmath$ \phi $}}},{\text{\boldmath$ \phi $}}_{0})=F(Q_{a}(y_{i},{\widehat{\text{\boldmath$ \phi $}}});y_{i},{\text{\boldmath$ \phi $}}_{0})$ , noting that $0\leq G(y_{i},{\widehat{\text{\boldmath$ \phi $}}},{\text{\boldmath$ \phi $}}_{0})\leq 1$ and $G(y_{i},{\text{\boldmath$ \phi $}}_{0},{\text{\boldmath$ \phi $}}_{0})=a$ . The Taylor expansion of $G(y_{i},{\widehat{\text{\boldmath$ \phi $}}},{\text{\boldmath$ \phi $}}_{0})$ shows that

[TABLE]

where ${\text{\boldmath$ \phi $}}^{\ast}$ is on the line connecting ${\widehat{\text{\boldmath$ \phi $}}}$ and ${\text{\boldmath$ \phi $}}_{0}$ . Then, it follows that

[TABLE]

where

[TABLE]

Using the Cauchy-Schwarz inequality, we have

[TABLE]

From the asymptotic normality of ${\widehat{\text{\boldmath$ \phi $}}}$ given in Theorem 1, it holds that ${\rm E}[|{\widehat{\phi}}_{k}-\phi_{k}|^{r}]=O(m^{-r/2})$ . Moreover, since $0\leq G(y_{i},{\text{\boldmath$ \phi $}},{\text{\boldmath$ \phi $}}_{0})\leq 1$ and ${\text{\boldmath$ \phi $}}_{0}$ is an interior point, it holds $|G(y_{i},{\text{\boldmath$ \phi $}}_{1},{\text{\boldmath$ \phi $}}_{0})-G(y_{i},{\text{\boldmath$ \phi $}}_{2},{\text{\boldmath$ \phi $}}_{0})|\leq 2$ for all ${\text{\boldmath$ \phi $}}_{1},{\text{\boldmath$ \phi $}}_{2}\in N_{{\text{\boldmath$ \phi $}}_{0}}$ with $N_{{\text{\boldmath$ \phi $}}_{0}}=\{{\text{\boldmath$ \phi $}};\|{\text{\boldmath$ \phi $}}-{\text{\boldmath$ \phi $}}_{0}\|\leq{\varepsilon}\}$ , thereby the partial derivatives of $G(y_{i},{\text{\boldmath$ \phi $}},{\text{\boldmath$ \phi $}}_{0})$ at ${\text{\boldmath$ \phi $}}={\text{\boldmath$ \phi $}}_{0}$ are bounded. Then, we obtain $R_{2}=O(m^{-1})$ . Using a similar evaluation, we can show that $R_{3}=O(m^{-1})$ . Regarding $R_{1}$ , it is noted that

[TABLE]

From Lohr and Rao (2009), it holds ${\rm E}[{\widehat{\text{\boldmath$ \phi $}}}-{\text{\boldmath$ \phi $}}_{0}|y_{i}]=m^{-1}{\text{\boldmath$ b $}}_{{\text{\boldmath$ \phi $}}}-{\text{\boldmath$ I $}}_{{\text{\boldmath$ \phi $}}}^{-1}\partial L_{i}(y_{i},{\text{\boldmath$ \phi $}}_{0})/\partial{\text{\boldmath$ \phi $}}+o_{p}(m^{-1})$ , where $\sum_{i=1}^{m}L_{i}(y_{i},{\text{\boldmath$ \phi $}}_{0})\equiv L({\text{\boldmath$ \phi $}})$ and $b_{{\text{\boldmath$ \phi $}}}=\lim_{m\to\infty}m{\rm E}[{\widehat{\text{\boldmath$ \phi $}}}-{\text{\boldmath$ \phi $}}_{0}]$ is the asymptotic bias of ${\widehat{\text{\boldmath$ \phi $}}}$ . Hence, we have

[TABLE]

which is $O(m^{-1})$ . Therefore, the proof is completed.

A3. Proof of Theorem 3. From the proof of Theorem 2, we have

[TABLE]

where $c(a,{\text{\boldmath$ \phi $}})$ is a smooth function of $a$ and $\phi$ . Take $a^{\ast}$ and $\widehat{a}^{\ast}$ so that they satisfy $F_{a^{\ast}}({\text{\boldmath$ \phi $}}_{0})=a$ and $F_{\widehat{a}^{\ast}}({\widehat{\text{\boldmath$ \phi $}}})=a$ , respectively. Then, it holds $\widehat{a}^{\ast}-a^{\ast}=o_{p}(1)$ since ${\widehat{\text{\boldmath$ \phi $}}}-{\text{\boldmath$ \phi $}}=o_{p}(1)$ . From the above expansion, we have

[TABLE]

so that $\widehat{a}^{\ast}-a^{\ast}=o_{p}(m^{-1})$ . Hence, it follows that

[TABLE]

which completes the proof.

A4. Checking assumptions of transformations. We here check the assumption 3 in Assumption 1 for the dual power (DP) transformation (10) and sinh-arcsinh (SS) transformation (11).

(DP transformation) We first note that $H_{{\lambda}}^{-1}(x)=O(x^{1/{\lambda}})$ as $x\to\infty$ . By putting $x=-t$ for $t>0$ , we have

[TABLE]

as $t\to\infty$ . A straightforward calculation shows that

[TABLE]

thereby, it follows that

[TABLE]

as $|x|\to\infty$ . Moreover, since

[TABLE]

a similar evaluation leads to $\big{|}\partial^{2}H_{{\lambda}}(w)/\partial{\lambda}^{2}\big{|}=O(|x|(\log|x|)^{2})$ as $|x|\to\infty$ . Regarding $\partial^{2}\log H_{{\lambda}}^{\prime}(x)/\partial{\lambda}^{2}$ , it holds that

[TABLE]

as $|x|\to\infty$ , so that the DP transformation satisfies the assumption. When the location parameter is used, namely, $H_{{\lambda},c}(x)=\{(x+c)^{{\lambda}}-(x+c)^{-{\lambda}}\}/2{\lambda}$ , it is noted that $\partial^{k}H_{{\lambda},c}(x)/\partial c^{k}=\partial^{k}H_{{\lambda},c}(x)/\partial x^{k}$ , so that the quite similar evaluation shows that the shifted-DP transformation also satisfies the assumption.

(SS transformation) It follows that

[TABLE]

Note that $\sinh^{-1}(x)=O(\log|x|)$ as $|x|\to\infty$ , so that $H_{a,b}^{-1}(x)=O(\exp(b^{-1}\log|x|))=O(|x|^{1/b})$ . Then, we have

[TABLE]

as $|x|\to\infty$ . Moreover, it holds that

[TABLE]

thereby a similar evaluation shows that $\partial^{2}H_{a,b}(x)/\partial^{2}a=O(|x|)$ , $\partial^{2}H_{a,b}(x)/\partial^{2}b=O(|x|(\log|x|)^{2})$ and $\partial^{2}H_{a,b}(x)/\partial a\partial b=O(|x|\log|x|)$ as $|x|\to\infty$ . On the other hand, a straightforward calculation shows that

[TABLE]

which are bounded by the function $\partial H_{a,b}(x)/\partial a$ and $\partial H_{a,b}(x)/\partial b$ , respectively. It is not difficult to show that the second partial derivatives of $\log H_{a,b}^{\prime}(x)$ are bounded by polynomial functions of the second partial derivatives of $H_{a,b}(x)$ and $H_{a,b}(x)$ , thereby the assumption is satisfied.

Bibliography65

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2] Battese, G.E., Harter, R.M. and Fuller, W.A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association , 83 , 28-36.
3[3]
4[4] Brent, R. (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs N.J. Prentice-Hall.
5[5]
6[6] Box, G.E.P. and Cox, D.R. (1964). An analysis of transformation (with discussion). Journal of the Royal Statistical Society: Series B , 26 , 211-252.
7[7]
8[8] Chambers, R. and Tzavidis, N. (2006). M-quantile models for small area estimation. Biometrika , 93 , 255-268.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Adaptively Transformed Mixed Model Prediction of General Finite Population Parameters

Introduction

Adaptively Transformed Mixed Model Prediction

Transformed best predictor

Estimation of structural parameters

Class of transformations

Assumption 1**.**

Large sample properties

Assumption 2**.**

Theorem 1**.**

Empirical Bayes Confidence Intervals

Asymptotically valid confidence intervals

Theorem 2**.**

Bootstrap calibrated intervals

Theorem 3**.**

Numerical Studies

Evaluation of prediction errors

Performance of empirical Bayes confidence intervals

Example: poverty mapping in Spain

Conclusions and Discussion

Assumption 1.

Assumption 2.

Theorem 1.

Theorem 2.

Theorem 3.