A simple recipe for making accurate parametric inference in finite   sample

St\'ephane Guerrier; Mucyo Karemera; Samuel Orso; Maria-Pia; Victoria-Feser

arXiv:1901.06750·stat.ME·January 23, 2019

A simple recipe for making accurate parametric inference in finite sample

St\'ephane Guerrier, Mucyo Karemera, Samuel Orso, Maria-Pia, Victoria-Feser

PDF

Open Access

TL;DR

This paper proposes a new method for finite-sample parametric inference that offers accurate results without relying solely on asymptotic approximations, addressing a key challenge in statistical testing.

Contribution

It introduces a theoretical framework with general conditions ensuring accurate finite-sample inference, providing an alternative to traditional asymptotic methods like the bootstrap.

Findings

01

The method guarantees finite-sample accuracy under specified conditions.

02

Theoretical demonstration of the method's validity in finite samples.

03

Provides a practical approach for exact inference in finite-sample scenarios.

Abstract

Constructing tests or confidence regions that control over the error rates in the long-run is probably one of the most important problem in statistics. Yet, the theoretical justification for most methods in statistics is asymptotic. The bootstrap for example, despite its simplicity and its widespread usage, is an asymptotic method. There are in general no claim about the exactness of inferential procedures in finite sample. In this paper, we propose an alternative to the parametric bootstrap. We setup general conditions to demonstrate theoretically that accurate inference can be claimed in finite sample.

Tables20

Table 1. Table 1 : c ^ ^ 𝑐 \hat{c} : estimated coverage probabilities, I ¯ ¯ 𝐼 \bar{I} : median interval length, s ¯ ¯ 𝑠 \bar{s} : average time in seconds to compute the intervals for one trial.

$θ_{0}$	$α$	$\hat{c}$	$\bar{I}$	$\bar{s}$	$\hat{c}$	$\bar{I}$	$\bar{s}$	$\hat{c}$	$\bar{I}$	$\bar{s}$
		SwiZs			parametric bootstrap			BCa bootstrap
1.5	50%	50.66%	0.5129	0.1622	49.13%	0.5794	0.0358	47.69%	0.4906	0.0333
	75%	75.39%	0.8839		73.27%	1.0504		71.64%	0.8607
	90%	90.15%	1.2861		87.03%	1.6734		86.64%	1.2815
	95%	94.68%	1.5540		91.42%	2.1935		91.82%	1.5800
	99%	98.84%	2.1052		96.05%	3.8820		97.13%	2.2714
3.5	50%	50.08%	1.7594	0.2010	47.65%	2.8832	0.0349	44.94%	1.8716	0.0322
	75%	74.62%	3.2780		70.36%	6.6243		68.80%	3.7372
	90%	90.39%	5.2129		84.50%	20.665		84.36%	6.5202
	95%	94.85%	6.8416		89.63%	240.11		90.62%	9.6584
	99%	98.73%	10.788		95.11%	3104.1		95.60%	29.011
6	50%	48.61%	4.2027	0.2093	46.54%	11.463	0.0342	44.29%	4.6886	0.0305
	75%	74.39%	8.3688		68.34%	245.75		69.99%	12.245
	90%	89.56%	16.087		80.83%	2586.4		87.45%	41.335
	95%	94.61%	26.250		85.06%	3376.8		93.05%	515.51
	99%	98.90%	361.28		95.55%	4827.0		95.94%	2261.8

Table 2. Table 2 : Empirical proportion of times the bias-adjusted maximum likelihood estimator is jointly out of the parameter space 𝚯 𝚯 \bm{\Theta} .

$n = 35$	$n = 50$	$n = 100$	$n = 150$
38.78%	21.94%	3.02%	0.40%

Table 3. Table 3 : Average time in seconds to estimate a conditional distribution on S = 10 , 000 𝑆 10 000 S=10,000 points and total time in hours for the M = 10 , 000 𝑀 10 000 M=10,000 independent trials.

	SwiZs	indirect inference	parametric bootstrap
Average time $[s e c o n d s]$	0.97	134.18	197.15
Total time $[h o u r s]$	2.7	372.5	547.4

Table 4. Table 4 : 95% coverage probabilities of confidence intervals from the SwiZs and asymptotic theory.

	SwiZs	asymptotic
$θ_{1}$	0.9442	0.9187
$θ_{2}$	0.9398	0.8115
$θ_{3}$	0.9382	0.8121
$θ_{4}$	0.9432	0.7688
$θ_{5}$	0.9450	0.7737
$θ_{6}$	0.9397	0.9233
$θ_{7}$	0.9357	0.9170
$θ_{8}$	0.9398	0.9237
$θ_{9}$	0.9391	0.9218
$θ_{10}$	0.9400	0.9208
$θ_{11}$	0.9424	0.9208
$θ_{12}$	0.9375	0.9214
$θ_{13}$	0.9368	0.9204
$θ_{14}$	0.9389	0.9210
$θ_{15}$	0.9400	0.9207
$θ_{16}$	0.9400	0.9183
$θ_{17}$	0.9361	0.9183
$θ_{18}$	0.9449	0.9241
$θ_{19}$	0.9412	0.9218
$θ_{20}$	0.9427	0.9240

Table 5. Table 5: Average computationnal time in seconds to approximate a distribution on S = 10 , 000 𝑆 10 000 S=10,000 points.

	SwiZs	Boot	AB	RSwiZs	RBoot
$n = 35$	0.1430	0.0222	0.0197	0.5613	0.0998
$n = 50$	0.2002	0.0293	0.0268	0.7889	0.1320
$n = 100$	0.3826	0.0526	0.0504	1.3520	0.2314
$n = 150$	0.5580	0.0753	0.0736	1.7792	0.3291
$n = 250$	0.8998	0.1228	0.1211	2.3141	0.5174
$n = 500$	1.7763	0.2364	0.2398	3.2132	0.9848

Table 6. Table 6: Estimated coverage probabilities.

	SwiZs		Boot		BA		RSwiZs		RBoot
$α$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$
$n = 35$
50%	49.48	50.07	43.10	44.26	0.00	0.00	42.73	44.07	36.72	36.84
75%	74.49	75.14	65.82	65.39	0.00	0.00	65.84	66.59	55.00	55.06
90%	89.31	89.39	80.64	78.74	0.00	0.00	81.41	81.97	64.47	64.26
95%	94.27	94.34	86.71	84.28	0.03	0.00	87.58	87.41	67.33	67.13
99%	98.26	98.43	91.23	91.07	0.75	0.00	93.84	93.53	69.64	70.39
$n = 50$
50%	49.59	49.88	44.48	45.30	0.01	0.00	45.70	46.93	37.37	37.64
75%	74.73	76.67	68.43	67.84	0.08	0.00	67.40	68.21	57.44	56.73
90%	89.89	90.62	83.15	81.57	0.76	0.00	82.51	82.75	69.52	68.81
95%	94.67	94.94	89.26	87.11	1.92	0.00	88.47	88.35	73.01	72.49
99%	98.40	98.46	95.19	93.69	10.86	0.00	94.79	94.80	75.97	76.43
$n = 100$
50%	49.86	49.95	47.52	48.04	20.52	27.75	49.44	49.80	36.19	35.48
75%	75.37	75.88	72.00	71.59	44.13	57.82	73.07	74.32	57.01	55.61
90%	90.20	90.42	86.69	85.86	69.68	81.85	86.54	86.83	73.68	71.96
95%	95.41	95.67	92.06	90.96	81.89	91.13	91.69	91.52	80.75	79.17
99%	98.85	98.91	97.32	96.42	94.93	98.74	96.85	96.79	86.96	86.38
$n = 150$
50%	50.12	49.80	48.36	48.58	47.05	49.78	49.80	49.82	33.94	33.00
75%	74.85	75.32	72.41	72.63	70.68	72.58	74.44	74.69	55.12	53.45
90%	90.31	90.32	87.58	86.85	86.94	89.18	88.95	89.22	72.14	70.01
95%	95.08	95.35	93.03	92.11	93.26	94.89	93.60	93.74	80.17	78.15
99%	99.08	99.10	97.92	97.43	98.72	99.28	97.81	97.69	90.07	88.56
$n = 250$
50%	49.46	49.84	48.60	49.01	47.61	47.09	49.55	49.90	29.16	28.45
75%	75.02	74.49	73.59	72.75	72.09	72.63	74.83	74.80	49.94	47.56
90%	89.55	89.81	88.05	88.11	89.54	90.13	89.56	89.58	67.50	65.25
95%	94.77	94.79	93.56	93.34	94.79	95.68	94.50	94.70	76.90	74.39
99%	99.02	99.03	98.46	97.92	99.18	99.50	98.61	98.70	89.37	87.24
$n = 500$
50%	50.08	49.89	49.29	49.81	48.76	48.67	50.26	49.64	20.51	18.95
75%	74.73	74.36	73.90	73.64	73.68	73.85	74.55	74.68	37.76	34.96
90%	89.53	89.75	88.86	88.69	89.03	89.22	89.45	89.80	56.15	52.68
95%	94.92	94.86	94.11	94.22	94.33	94.77	94.92	94.80	66.89	63.51
99%	98.97	98.99	98.62	98.40	99.01	99.07	98.94	99.03	83.63	80.06

Table 7. Table 7: Estimated coverage probabilities of Gini index.

	SwiZs	Boot	BA	RSwiZs	RBoot
$α$	Gini index
$n = 35$
50%	50.22	44.26	0.02	44.27	36.84
75%	76.03	65.44	0.72	67.12	55.06
90%	91.07	78.96	68.11	83.07	64.36
95%	96.76	84.35	100.00	89.43	67.19
99%	98.84	91.10	100.00	93.88	70.41
$n = 50$
50%	49.89	45.30	0.00	46.94	37.64
75%	76.86	67.84	0.00	68.26	56.73
90%	90.83	81.58	41.20	82.68	68.82
95%	95.17	87.16	71.42	88.40	72.49
99%	98.92	93.76	99.82	95.14	76.45
$n = 100$
50%	49.95	48.04	32.96	49.80	35.48
75%	75.88	71.59	59.90	74.32	55.61
90%	90.42	85.86	82.63	86.83	71.96
95%	95.74	90.98	91.44	91.64	79.19
99%	98.85	96.46	98.73	96.83	86.43
$n = 150$
50%	49.80	48.58	46.30	49.82	33.00
75%	75.32	72.63	72.68	74.69	53.45
90%	90.32	86.85	89.18	89.22	70.01
95%	95.35	92.12	94.87	93.73	78.15
99%	99.06	97.47	99.27	97.71	88.60
$n = 250$
50%	49.84	49.01	46.99	49.90	28.45
75%	74.49	72.75	72.41	74.80	47.56
90%	89.81	88.11	88.95	89.58	65.25
95%	94.81	93.34	94.99	94.69	74.43
99%	99.04	97.93	99.48	98.68	87.34
$n = 500$
50%	49.89	49.81	48.67	49.64	18.95
75%	74.36	73.64	73.85	74.68	34.96
90%	89.75	88.69	89.22	89.80	52.68
95%	94.86	94.22	94.77	94.79	63.57
99%	98.98	98.41	99.03	99.02	80.28

Table 8. Table 8: Estimated coverage probabilities of value-at-risk at 95%.

	SwiZs	Boot	BA	RSwiZs	RBoot
$α$	95% value-at-risk
$n = 35$
50%	47.30	46.08	20.92	45.34	41.13
75%	73.76	67.53	55.77	70.38	61.00
90%	90.05	80.35	93.73	88.08	73.92
95%	95.67	85.36	98.92	94.80	79.41
99%	99.17	91.63	99.97	99.25	87.26
$n = 50$
50%	48.14	47.23	31.76	46.40	41.27
75%	73.39	69.40	63.30	70.22	61.47
90%	89.63	82.24	91.60	87.07	74.72
95%	94.89	87.41	97.72	93.60	80.20
99%	99.23	93.17	99.90	99.27	87.87
$n = 100$
50%	49.75	48.90	48.33	49.18	39.94
75%	74.68	72.61	75.68	72.93	61.39
90%	89.48	86.38	91.97	87.16	75.97
95%	95.07	91.17	96.79	94.17	82.45
99%	99.23	96.31	99.75	99.11	90.45
$n = 150$
50%	50.10	49.19	49.47	49.91	37.43
75%	74.13	73.17	75.42	73.57	59.31
90%	89.77	87.25	91.21	88.49	75.26
95%	94.76	92.57	96.18	93.31	81.76
99%	98.89	97.34	99.61	98.46	91.00
$n = 250$
50%	50.28	49.52	50.02	50.24	34.09
75%	75.29	74.25	74.87	74.75	55.55
90%	89.43	88.10	90.27	89.13	72.28
95%	94.66	93.26	95.15	94.14	80.35
99%	98.89	97.85	99.10	98.67	90.11
$n = 500$
50%	49.15	48.63	49.00	49.22	27.45
75%	74.88	74.01	74.63	74.53	45.61
90%	90.02	89.46	90.37	89.93	62.84
95%	94.97	94.45	95.18	94.85	72.65
99%	98.92	98.32	98.87	98.96	86.63

Table 9. Table 9: Estimated coverage probabilities of expected shortfall at 95%.

	SwiZs	Boot	BA	RSwiZs	RBoot
$α$	95% expected shortfall
$n = 35$
50%	50.33	48.55	0.02	50.08	47.38
75%	74.97	72.60	0.72	74.70	71.28
90%	89.61	87.63	68.11	89.24	86.35
95%	94.65	92.87	100.00	94.37	92.23
99%	98.80	97.97	100.00	98.72	97.48
$n = 50$
50%	49.48	48.24	0.00	49.28	47.06
75%	74.81	72.74	0.00	74.45	71.28
90%	89.76	88.07	41.20	89.25	86.85
95%	94.74	93.32	71.42	94.48	92.16
99%	98.89	97.92	99.82	98.62	97.48
$n = 100$
50%	49.94	49.16	32.96	49.64	47.22
75%	74.47	74.12	59.90	74.37	72.21
90%	90.13	89.15	82.63	89.99	87.57
95%	95.10	94.23	91.44	95.00	93.13
99%	98.98	98.55	98.73	98.91	98.10
$n = 150$
50%	49.91	49.49	46.30	49.81	48.13
75%	75.03	74.25	72.68	74.95	72.45
90%	89.82	89.31	89.18	89.74	87.76
95%	95.05	94.37	94.87	94.98	93.15
99%	98.91	98.62	99.27	98.86	98.14
$n = 250$
50%	50.53	50.64	46.99	50.44	47.94
75%	75.01	74.97	72.41	74.91	72.31
90%	89.96	89.72	88.95	89.98	87.75
95%	95.11	94.58	94.99	95.13	93.16
99%	99.04	98.70	99.48	99.06	98.14
$n = 500$
50%	49.25	49.34	48.67	49.48	46.61
75%	74.50	74.29	73.85	74.28	70.91
90%	90.02	89.56	89.22	89.99	86.47
95%	95.05	94.77	94.77	95.13	92.52
99%	99.01	99.01	99.03	99.04	98.23

Table 10. Table 10: Estimated median interval length.

	SwiZs		Boot		BA		RSwiZs		RBoot
$α$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$
$n = 35$
50%	2.19	1.85	7.52	6.04	0.26	0.34	1.89	1.64	7.97	6.73
75%	4.79	3.92	216.08	179.86	0.46	0.54	3.84	3.37	27.20	23.62
90%	11.18	8.56	9710.48	8673.53	1.31	1.09	6.96	5.97	86.42	75.85
95%	24.30	18.00	2.55 $\times 10^{4}$	2.18 $\times 10^{4}$	8.99	8.89	9.89	7.92	161.13	142.10
99%	2488.08	1849.66	1.19 $\times 10^{5}$	1.05 $\times 10^{5}$	3.18 $\times 10^{9}$	3.30 $\times 10^{9}$	22.62	17.10	435.99	401.28
$n = 50$
50%	1.78	1.51	3.61	2.98	0.39	0.42	1.56	1.34	5.04	4.20
75%	3.60	2.97	10.55	8.78	0.66	0.68	3.11	2.65	14.89	12.37
90%	6.78	5.41	642.67	551.95	1.22	0.94	5.57	4.83	44.67	38.37
95%	10.78	8.38	7.40 $\times 10^{3}$	6.31 $\times 10^{3}$	6.13	5.27	7.80	6.70	84.42	73.24
99%	54.20	39.06	5.57 $\times 10^{4}$	4.82 $\times 10^{4}$	1.09 $\times 10^{7}$	1.04 $\times 10^{7}$	15.60	12.65	231.61	202.96
$n = 100$
50%	1.26	1.06	1.69	1.39	0.64	0.60	1.19	1.01	2.73	2.27
75%	2.32	1.92	3.32	2.74	1.08	1.02	2.23	1.87	6.01	5.01
90%	3.74	3.04	6.28	5.20	1.55	1.36	3.67	3.03	13.00	10.88
95%	4.92	3.94	10.30	8.58	1.93	1.54	4.89	4.00	22.10	18.69
99%	8.58	6.63	181.34	153.63	20.11	16.79	8.41	6.95	64.18	55.35
$n = 150$
50%	1.02	0.86	1.21	1.01	0.71	0.62	1.00	0.85	2.02	1.68
75%	1.82	1.52	2.24	1.88	1.23	1.08	1.80	1.52	4.00	3.35
90%	2.78	2.30	3.71	3.11	1.78	1.59	2.80	2.32	7.50	6.28
95%	3.52	2.89	5.05	4.26	2.12	1.90	3.58	2.95	11.12	9.34
99%	5.38	4.35	10.59	8.97	2.86	2.27	5.62	4.52	26.58	22.47
$n = 250$
50%	0.78	0.66	0.85	0.72	0.64	0.55	0.79	0.66	1.45	1.21
75%	1.36	1.15	1.52	1.29	1.13	0.96	1.38	1.16	2.68	2.24
90%	2.01	1.69	2.34	1.99	1.68	1.44	2.07	1.72	4.41	3.68
95%	2.48	2.08	2.97	2.52	2.07	1.78	2.56	2.12	5.94	4.97
99%	3.56	2.92	4.72	4.01	2.96	2.55	3.69	3.01	10.84	9.10
$n = 500$
50%	0.55	0.46	0.57	0.48	0.50	0.42	0.56	0.47	0.97	0.81
75%	0.94	0.80	0.99	0.84	0.87	0.74	0.96	0.81	1.71	1.43
90%	1.37	1.16	1.47	1.25	1.27	1.08	1.41	1.18	2.63	2.20
95%	1.66	1.40	1.80	1.53	1.54	1.32	1.71	1.43	3.31	2.78
99%	2.27	1.90	2.55	2.16	2.16	1.83	2.35	1.95	5.05	4.22

Table 11. Table 11: Performances of point estimators.

	SwiZs: mean		SwiZs: median		MLE		AB		RSwiZs: mean		RSwiZs: median		WMLE
	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$	$θ_{1}$	$θ_{2}$
Mean bias
$n = 35$	2511.13	2226.09	2504.27	2230.19	2492.15	2241.82	-1.38 $\times 10^{12}$	-1.34 $\times 10^{12}$	13.33	11.50	13.38	11.53	13.78	12.10
$n = 50$	832.02	739.28	829.87	739.77	827.45	742.50	-1.54 $\times 10^{11}$	-1.55 $\times 10^{11}$	5.99	5.19	6.07	5.22	6.52	5.70
$n = 100$	45.96	37.47	45.71	37.28	45.81	37.48	-6.65 $\times 10^{8}$	-5.22 $\times 10^{8}$	1.20	1.03	1.26	1.05	1.72	1.47
$n = 150$	1.03	0.91	0.96	0.82	1.06	0.92	-1.60 $\times 10^{4}$	-1.48 $\times 10^{4}$	0.48	0.42	0.52	0.43	0.96	0.82
$n = 250$	0.17	0.15	0.15	0.12	0.21	0.18	-0.02	-0.02	0.20	0.18	0.21	0.17	0.62	0.53
$n = 500$	0.08	0.07	0.07	0.06	0.10	0.08	0.00	0.00	0.08	0.08	0.08	0.06	0.45	0.39
Median bias
$n = 35$	0.4583	0.4894	0.0538	0.0276	0.5885	0.4654	-1.5551	-1.2966	0.2523	0.3257	0.0561	0.0309	0.9571	0.7846
$n = 50$	0.2083	0.2374	0.0250	0.0197	0.3684	0.3008	-1.1319	-0.9168	0.1691	0.2039	0.0335	0.0213	0.7112	0.5986
$n = 100$	0.0801	0.0824	0.0191	0.0135	0.1770	0.1389	-0.4093	-0.3267	0.0813	0.0905	0.0228	0.0195	0.5025	0.4289
$n = 150$	0.0358	0.0434	0.0051	0.0021	0.1011	0.0851	-0.2259	-0.1848	0.0385	0.0470	0.0063	0.0041	0.4140	0.3623
$n = 250$	0.0151	0.0265	-0.0022	0.0028	0.0541	0.0521	-0.1255	-0.1011	0.0184	0.0268	-0.0017	0.0029	0.3686	0.3268
$n = 500$	0.0129	0.0150	0.0050	0.0046	0.0331	0.0275	-0.0560	-0.0473	0.0145	0.0163	0.0049	0.0034	0.3449	0.3056
Root mean squared error
$n = 35$	17263.26	15552.08	17223.34	15587.83	17137.54	15667.69	2.97 $\times 10^{13}$	2.95 $\times 10^{13}$	59.16	50.35	59.00	50.44	58.45	50.95
$n = 50$	7996.07	7382.94	7982.45	7395.00	7957.28	7418.68	5.15 $\times 10^{12}$	5.62 $\times 10^{12}$	27.55	24.08	27.52	24.13	27.32	24.35
$n = 100$	1331.57	1055.16	1330.24	1056.18	1328.51	1057.59	4.41 $\times 10^{10}$	3.36 $\times 10^{10}$	6.15	5.21	6.22	5.27	6.26	5.37
$n = 150$	36.30	32.42	36.27	32.44	36.24	32.48	1.11 $\times 10^{6}$	1.06 $\times 10^{6}$	2.46	2.13	2.56	2.20	2.70	2.34
$n = 250$	0.77	0.66	0.75	0.63	0.78	0.66	0.58	0.49	0.92	0.79	1.01	0.85	1.26	1.07
$n = 500$	0.46	0.39	0.46	0.38	0.47	0.40	0.42	0.35	0.49	0.41	0.50	0.42	0.77	0.66
Mean absolute deviation
$n = 35$	2.1893	2.0002	1.5119	1.2537	2.0914	1.7082	0.5845	0.3672	1.7446	1.4744	1.5891	1.2890	2.5445	2.0878
$n = 50$	1.5636	1.4044	1.2510	1.0720	1.5649	1.3200	0.4261	0.3293	1.3908	1.2241	1.2901	1.0831	1.9384	1.6231
$n = 100$	0.9693	0.8220	0.8979	0.7479	1.0042	0.8306	0.5443	0.4800	0.9576	0.8300	0.9091	0.7685	1.2615	1.0552
$n = 150$	0.7571	0.6546	0.7291	0.6191	0.7807	0.6627	0.5752	0.4942	0.7685	0.6633	0.7396	0.6308	0.9975	0.8454
$n = 250$	0.5871	0.4942	0.5737	0.4782	0.5991	0.4995	0.5058	0.4256	0.5959	0.4984	0.5810	0.4827	0.7737	0.6368
$n = 500$	0.4084	0.3440	0.4041	0.3390	0.4130	0.3456	0.3818	0.3200	0.4127	0.3516	0.4076	0.3452	0.5295	0.4502

Table 12. Table 12: Average computational time in seconds to approximate a distribution on S = 10 , 000 𝑆 10 000 S=10,000 points.

	SwiZs	Parametric bootstrap
$N = 25$	1.87	0.20
$N = 100$	6.49	0.73
$N = 400$	35.60	4.58
$N = 1, 600$	245.59	37.80

Table 13. Table 13: Estimated coverage probabilities.

	SwiZs					parametric bootstrap
$α$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$
$n = 5$ $m = 5$
50%	51.78	53.87	48.54	54.18	70.38	42.37	43.61	44.60	32.27	28.10
75%	76.89	78.87	73.58	81.67	89.09	64.17	66.19	66.20	48.35	41.80
90%	91.87	92.93	88.89	94.10	98.80	78.38	81.94	81.07	61.72	46.87
95%	96.45	97.04	94.32	97.83	99.98	84.58	88.45	86.61	68.68	47.30
99%	99.54	99.71	98.73	99.87	100.00	91.93	95.40	93.54	79.03	47.61
$n = 10$ $m = 10$
50%	50.10	51.20	50.70	50.65	62.48	46.25	45.37	50.05	40.01	39.84
75%	75.16	77.08	74.92	75.64	85.74	69.81	68.68	74.48	60.54	59.68
90%	90.38	92.03	90.20	90.61	95.49	84.81	84.32	88.65	75.01	73.29
95%	95.23	96.40	95.23	94.96	97.86	90.71	90.32	93.95	81.30	79.29
99%	99.16	99.54	99.25	99.09	99.64	96.45	96.76	98.41	89.37	84.71
$n = 20$ $m = 20$
50%	50.78	49.10	49.97	49.74	49.85	49.03	47.58	49.63	45.40	45.75
75%	75.28	74.45	75.24	74.89	75.88	73.08	71.87	75.06	67.66	66.98
90%	90.06	89.79	89.95	90.28	90.75	87.59	87.02	89.73	81.76	81.83
95%	95.05	94.83	94.79	95.06	95.97	93.10	92.69	94.59	87.48	87.52
99%	98.96	98.97	98.93	98.90	99.50	97.77	97.82	98.75	94.20	94.15
$n = 40$ $m = 40$
50%	49.52	48.48	49.80	52.42	53.19	49.41	48.92	49.94	47.47	47.95
75%	74.70	72.86	75.27	77.89	78.39	74.22	73.34	75.63	70.93	71.46
90%	90.07	88.10	89.69	91.81	92.46	89.30	87.99	89.70	85.62	86.34
95%	95.15	94.09	94.71	96.27	96.59	94.37	93.65	94.82	91.29	91.82
99%	99.01	98.62	98.99	99.37	99.43	98.56	98.39	98.90	96.80	96.67

Table 14. Table 14: Estimated median interval length.

	SwiZs					parametric bootstrap
$α$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$
$n = 5 m = 5$
50%	0.3303	0.2243	0.4976	1.2050	0.1755	0.2712	0.1728	0.4453	1.5575	0.0005
75%	0.5940	0.3882	0.8552	2.0974	0.4491	0.4606	0.2947	0.7607	3.5624	0.0012
90%	0.9314	0.5682	1.2436	3.1286	1.1761	0.6577	0.4217	1.0909	12.9753	0.0024
95%	1.1956	0.6934	1.5222	3.9149	3.7094	0.7845	0.5031	1.3051	13.9626	0.0036
99%	1.8698	1.0031	2.3468	9.8944	8.6739	1.0290	0.6623	1.7335	15.3409	0.0070
$n = 10 m = 10$
50%	0.2230	0.1198	0.2136	0.7311	1.0080	0.2038	0.1069	0.2099	0.7676	1.6745
75%	0.3902	0.2068	0.3638	1.2540	1.8614	0.3471	0.1818	0.3594	1.3370	8.6134
90%	0.5817	0.3008	0.5210	1.8131	2.9290	0.4953	0.2601	0.5144	1.9844	11.7988
95%	0.7162	0.3658	0.6218	2.1764	3.9196	0.5887	0.3097	0.6140	2.4462	12.6107
99%	1.0284	0.5130	0.8177	2.8992	7.9667	0.7745	0.4075	0.8055	3.6688	13.8600
$n = 20 m = 20$
50%	0.1547	0.0699	0.1006	0.4750	0.5665	0.1482	0.0674	0.0998	0.4733	0.6557
75%	0.2672	0.1205	0.1718	0.8065	0.9934	0.2530	0.1149	0.1708	0.8102	1.1462
90%	0.3900	0.1752	0.2455	1.1499	1.4857	0.3622	0.1643	0.2447	1.1655	1.7189
95%	0.4718	0.2117	0.2926	1.3701	1.8096	0.4311	0.1957	0.2918	1.3964	2.1535
99%	0.6436	0.2894	0.3833	1.8121	2.4686	0.5645	0.2569	0.3825	1.8686	3.4277
$n = 40 m = 40$
50%	0.1056	0.0452	0.0490	0.2816	0.1124	0.1056	0.0451	0.0493	0.3194	0.3628
75%	0.1810	0.0772	0.0834	0.4466	0.3469	0.1804	0.0770	0.0839	0.5429	0.6249
90%	0.2596	0.1107	0.1191	0.6923	0.6031	0.2576	0.1102	0.1197	0.7759	0.9014
95%	0.3100	0.1323	0.1420	0.8523	0.7672	0.3070	0.1313	0.1423	0.9257	1.0804
99%	0.4094	0.1747	0.1870	1.1467	1.1309	0.4020	0.1724	0.1864	1.2163	1.4420

Table 15. Table 15: Performances of point estimators

	SwiZs: mean					SwiZs: median					Maximum likelihood
	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$
Mean bias $\times 100$
$N = 25$	-0.0647	-0.3827	-3.1193	1.8554	1.5502	-0.0761	-0.3732	-2.4630	6.4149	3.3175	-0.0708	-0.4203	-1.2985	-5.8224	-0.3807
$N = 100$	0.2843	-0.0320	-0.2911	2.4583	0.6119	1.6374	-0.1452	0.7182	-1.8475	1.8127	0.0685	0.0314	-0.0166	-2.8806	-0.6425
$N = 400$	0.0163	0.0374	0.0739	1.2927	0.0944	0.0149	0.0386	0.0514	0.9056	0.1565	0.0245	0.0417	0.0133	-1.3425	-0.2785
$N = 1, 600$	0.0010	0.0385	0.0183	-0.9811	-0.2965	-0.0011	0.0394	0.0120	-1.1600	-0.2121	0.0130	0.0343	-0.0021	-0.6265	-0.1253
Median bias $\times 100$
$N = 25$	-0.0341	-0.2171	-3.8669	-6.5130	-0.0876	-0.0018	-0.2114	-3.3736	-0.8483	0.0121	0.0327	-0.2932	-2.1012	-10.1138	-3.9990
$N = 100$	0.4345	0.0289	-0.4759	0.1208	-0.0951	5.3959	-1.4459	0.5589	-0.7598	0.0354	0.1838	0.0069	-0.1815	-4.8730	-1.2975
$N = 400$	0.0020	-0.0378	0.0422	0.4196	-0.1116	0.0149	-0.0286	0.0211	-0.0405	-0.0068	-0.0140	-0.0261	-0.0220	-2.1176	-0.4517
$N = 1, 600$	-0.0332	0.0500	0.0082	-1.0639	-0.1813	-0.0060	0.0543	0.0041	-0.0818	-0.0021	-0.0098	0.0480	-0.0098	-1.1378	-0.1833
Root mean squared error $\times 100$
$N = 25$	24.6914	16.0625	9.2357	27.0499	6.2389	24.7198	16.0766	8.6916	24.2014	8.3432	24.7291	16.0853	8.1605	18.5249	6.8108
$N = 100$	16.4663	8.8542	3.9374	14.7976	3.5251	14.3449	7.7017	4.1388	11.5080	3.3680	16.5630	8.7967	3.8703	12.0714	3.1774
$N = 400$	11.4174	5.2549	1.8779	9.1330	1.8623	11.4174	5.2550	1.8752	8.9859	1.7515	11.4182	5.2554	1.8689	8.2404	1.7092
$N = 1, 600$	7.8721	3.4528	0.9119	4.7681	0.6698	7.9083	3.4524	0.9117	4.4706	0.5759	7.8981	3.4532	0.9110	5.7583	1.0216
Mean absolute deviation $\times 100$
$N = 25$	24.4139	15.8780	8.2892	23.3025	0.6468	24.4872	15.9113	8.1000	17.1094	0.2293	24.4752	15.9014	7.8528	15.1530	0.0015
$N = 100$	16.7958	8.9936	3.8427	13.3232	2.8386	13.0610	6.2264	2.9059	8.0151	1.4453	16.9915	8.9079	3.8351	10.8654	3.0194
$N = 400$	11.2283	5.3202	1.8651	8.8004	1.8018	11.2417	5.3225	1.8695	8.8299	1.4204	11.2634	5.3160	1.8653	7.8895	1.6541
$N = 1, 600$	7.9220	3.4259	0.9115	4.3804	0.5033	7.9954	3.4277	0.9108	0.2978	0.0214	7.9745	3.4325	0.9082	5.7040	0.9952

Table 16. Table 16: Asymptotic results

	Coverage probability					Median interval length
$α$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$	$β_{0}$	$β_{1}$	$σ_{ϵ}^{2}$	$σ_{α}^{2}$	$σ_{γ}^{2}$
$n = 5$ $m = 5$
50%	43.16	44.94	48.66	40.42	36.49	0.2770	0.1791	0.1043	0.1868	0.0375
75%	67.51	69.17	73.83	64.17	70.73	0.4942	0.3180	0.1836	0.3625	0.0945
90%	83.68	86.75	88.79	81.88	96.33	0.7612	0.4897	0.2764	0.6358	0.2010
95%	90.37	93.23	93.83	88.93	98.88	0.9671	0.6226	0.3431	0.8982	0.3095
99%	97.04	98.93	98.54	96.95	99.75	1.4991	0.9746	0.4982	1.8138	0.7069
$n = 10$ $m = 10$
50%	46.38	45.98	50.75	45.86	44.91	0.2060	0.1082	0.0525	0.1422	0.0383
75%	70.85	71.03	75.36	70.65	68.84	0.3591	0.1888	0.0901	0.2583	0.0690
90%	87.23	87.08	90.04	86.58	85.82	0.5321	0.2806	0.1304	0.4088	0.1078
95%	93.20	93.27	95.12	92.37	93.09	0.6534	0.3449	0.1569	0.5299	0.1392
99%	98.41	98.53	98.95	98.02	99.59	0.9264	0.4903	0.2111	0.8593	0.2265
$n = 20$ $m = 20$
50%	49.20	47.62	49.92	48.00	47.31	0.1491	0.0677	0.0251	0.1048	0.0216
75%	73.66	72.54	75.09	72.49	72.86	0.2571	0.1168	0.0429	0.1845	0.0381
90%	88.70	88.34	89.97	88.33	88.10	0.3742	0.1700	0.0616	0.2774	0.0573
95%	94.09	94.02	94.81	93.80	93.72	0.4524	0.2055	0.0735	0.3445	0.0712
99%	98.56	98.61	98.94	98.40	98.59	0.6167	0.2801	0.0972	0.5019	0.1038
$n = 40$ $m = 40$
50%	49.46	49.32	49.79	48.67	49.01	0.1060	0.0452	0.0122	0.0748	0.0136
75%	74.46	73.78	75.28	73.52	74.77	0.1819	0.0776	0.0209	0.1295	0.0236
90%	89.88	88.76	89.70	88.89	89.83	0.2623	0.1119	0.0299	0.1899	0.0346
95%	94.95	94.28	94.85	94.22	94.71	0.3148	0.1343	0.0356	0.2310	0.0420
99%	98.98	98.86	98.99	98.77	98.82	0.4212	0.1797	0.0468	0.3194	0.0582

Table 17. Table 17: Estimated coverage probabilities.

	SwiZs			Indirect inference			Parametric bootstrap
	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$
50%	46.92	38.68	56.73	40.59	9.95	54.59	18.31	10.23	20.96
75%	71.56	55.41	81.80	68.01	34.11	84.50	32.70	20.96	37.38
90%	87.55	67.77	94.47	87.62	57.13	96.04	48.62	35.24	53.71
95%	93.16	74.78	97.97	94.66	70.22	98.75	57.05	46.03	63.21
99%	98.17	90.06	99.90	98.84	94.89	99.94	71.99	65.43	77.64

Table 18. Table 18: Estimated median interval length.

	SwiZs			Indirect inference			Parametric bootstrap
	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$
50%	0.0235	0.0805	0.1379	0.0382	0.0468	0.1368	0.0263	0.0420	0.1134
75%	0.0404	0.1467	0.2357	0.0911	0.0978	0.2389	0.0460	0.0757	0.2051
90%	0.0585	0.2207	0.3378	0.1563	0.1914	0.3835	0.0708	0.1185	0.3131
95%	0.0705	0.2733	0.4032	0.2225	0.2952	0.5432	0.0895	0.1533	0.3855
99%	0.0952	0.3934	0.5407	0.5331	0.7152	1.6084	0.1327	0.2514	0.5562

Table 19. Table 19: Estimated coverage probabilities under different conditions than Table 17 .

	SwiZs: starting value is $𝜽_{0}$			SwiZs: sample size is $n = 1, 000$ .
	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$
50%	50.22	58.64	49.98	50.07	46.06	49.37
75%	75.24	91.25	74.24	75.24	71.82	74.77
90%	90.52	99.82	89.55	89.73	89.84	89.49
95%	95.37	100.00	94.87	94.81	95.41	94.69
99%	99.09	100.00	99.02	98.95	99.28	99.10

Table 20. Table 20: Performances of point estimator.

	SwiZs: mean			SwiZs: median			Indirect inference			Indirect inference: mean			Indirect inference: median
	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{1}$	$θ_{2}$	$θ_{3}$
Mean bias	0.0037	-0.0149	0.0006	0.0057	-0.0096	0.0002	2 $\times 10^{90}$	3 $\times 10^{90}$	1.6107	0.0309	0.0254	3 $\times 10^{89}$	0.0157	0.0297	0.0201
Median bias	0.0026	-0.0219	-0.0044	0.0046	-0.0157	-0.0041	0.0135	0.0270	0.0181	0.0295	0.0235	0.0772	0.0150	0.0257	0.0200
RMSE	0.0197	0.0764	0.0890	0.0200	0.0762	0.0888	2 $\times 10^{92}$	3 $\times 10^{92}$	135.72	0.0451	0.0976	3 $\times 10^{91}$	0.0254	0.1041	0.0851
MAD	0.0192	0.0705	0.0884	0.0190	0.0718	0.0882	0.0307	0.1069	0.1405	0.0365	0.0918	0.1109	0.0182	0.0968	0.0823

Equations496

x = g (θ, u) .

x = g (θ, u) .

\hat{π}_{n} \in Π_{n} = π \in Π argzero \frac{1}{n} i = 1 \sum n ψ (g (θ_{0}, u_{0 i}), π) = π \in Π argzero Ψ_{n} (θ_{0}, u_{0}, π),

\hat{π}_{n} \in Π_{n} = π \in Π argzero \frac{1}{n} i = 1 \sum n ψ (g (θ_{0}, u_{0 i}), π) = π \in Π argzero Ψ_{n} (θ_{0}, u_{0}, π),

\hat{θ}_{n}^{(s)} \in Θ_{n}^{(s)} = θ \in Θ argzero \frac{1}{n} i = 1 \sum n ψ (g (θ, u_{s i}^{*}), \hat{π}_{n}) = θ \in Θ argzero Ψ_{n} (θ, u_{s}^{*}, \hat{π}_{n}),

\hat{θ}_{n}^{(s)} \in Θ_{n}^{(s)} = θ \in Θ argzero \frac{1}{n} i = 1 \sum n ψ (g (θ, u_{s i}^{*}), \hat{π}_{n}) = θ \in Θ argzero Ψ_{n} (θ, u_{s}^{*}, \hat{π}_{n}),

\hat{θ}_{EMM, n}^{(s)} \in Θ_{EMM, n}^{(s)} = θ \in Θ argzero \frac{1}{H} h = 1 \sum H Ψ_{n} (θ, u_{s h}^{*}, \hat{π}_{n}),

\hat{θ}_{EMM, n}^{(s)} \in Θ_{EMM, n}^{(s)} = θ \in Θ argzero \frac{1}{H} h = 1 \sum H Ψ_{n} (θ, u_{s h}^{*}, \hat{π}_{n}),

\hat{π}_{II, n}^{(s)} (θ) \in Π_{II, n}^{(s)} = π \in Π argzero Ψ_{n} (θ, u_{s}^{*}, π), θ \in Θ,

\hat{π}_{II, n}^{(s)} (θ) \in Π_{II, n}^{(s)} = π \in Π argzero Ψ_{n} (θ, u_{s}^{*}, π), θ \in Θ,

\hat{θ}_{II, n}^{(s)} \in Θ_{II, n}^{(s)} = θ \in Θ argzero d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ)), \hat{π}_{n} \in Π_{n}, \hat{π}_{II, n}^{(s)} \in Π_{n}^{(s)},

\hat{θ}_{II, n}^{(s)} \in Θ_{II, n}^{(s)} = θ \in Θ argzero d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ)), \hat{π}_{n} \in Π_{n}, \hat{π}_{II, n}^{(s)} \in Π_{n}^{(s)},

\hat{θ}_{II, B, m}^{(s)} \in Θ_{II, B, m}^{(s)} = θ \in Θ argzero d (\hat{π}_{n}, \frac{1}{B} b = 1 \sum B \hat{π}_{II, b, m}^{(s)} (θ)),

\hat{θ}_{II, B, m}^{(s)} \in Θ_{II, B, m}^{(s)} = θ \in Θ argzero d (\hat{π}_{n}, \frac{1}{B} b = 1 \sum B \hat{π}_{II, b, m}^{(s)} (θ)),

Θ_{n}^{(s)} = Θ_{II, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{II, n}^{(s)} .

\hat{θ}_{Boot, n}^{(s)} \in Θ_{Boot, n}^{(s)} = θ \in Θ argzero Ψ_{n} (\hat{π}_{n}, u_{s}^{*}, θ), s \in N_{S}^{+} .

\hat{θ}_{Boot, n}^{(s)} \in Θ_{Boot, n}^{(s)} = θ \in Θ argzero Ψ_{n} (\hat{π}_{n}, u_{s}^{*}, θ), s \in N_{S}^{+} .

Ψ_{n} (θ, u_{s}, π) = Ψ_{n} (π, u_{s}, θ) = 0 .

Ψ_{n} (θ, u_{s}, π) = Ψ_{n} (π, u_{s}, θ) = 0 .

Θ_{n}^{(s)} = Θ_{Boot, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{Boot, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{Boot, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{Boot, n}^{(s)} .

d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ^{⋆})) \leq ε,

d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ^{⋆})) \leq ε,

ε ↓ 0 lim Pr (d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ^{⋆})) \leq ε) = 1, θ^{⋆} \sim P .

ε ↓ 0 lim Pr (d (\hat{π}_{n}, \hat{π}_{II, n}^{(s)} (θ^{⋆})) \leq ε) = 1, θ^{⋆} \sim P .

Θ_{n}^{(s)} = ε ↓ 0 lim Θ_{ABC, n}^{(s)} (ε) .

Θ_{n}^{(s)} = ε ↓ 0 lim Θ_{ABC, n}^{(s)} (ε) .

\hat{θ}_{GFD, n}^{(s)} \in Θ_{GFD, n}^{(s)} = θ \in Θ argzero d (x, g (θ, u_{s}^{*})) .

\hat{θ}_{GFD, n}^{(s)} \in Θ_{GFD, n}^{(s)} = θ \in Θ argzero d (x, g (θ, u_{s}^{*})) .

\lim_{\varepsilon\downarrow 0}\left[\operatorname*{argmin}_{\bm{\theta}\in\bm{\Theta}}\left\lVert\mathbf{x}-\mathbf{g}\left(\bm{\theta},\mathbf{u}^{\ast}_{s}\right)\right\rVert\Big{|}\min_{\bm{\theta}}\left\lVert\mathbf{x}-\mathbf{g}\left(\bm{\theta},\mathbf{u}^{\ast}_{s}\right)\right\rVert\leq\varepsilon\right],

\lim_{\varepsilon\downarrow 0}\left[\operatorname*{argmin}_{\bm{\theta}\in\bm{\Theta}}\left\lVert\mathbf{x}-\mathbf{g}\left(\bm{\theta},\mathbf{u}^{\ast}_{s}\right)\right\rVert\Big{|}\min_{\bm{\theta}}\left\lVert\mathbf{x}-\mathbf{g}\left(\bm{\theta},\mathbf{u}^{\ast}_{s}\right)\right\rVert\leq\varepsilon\right],

Θ_{II, n}^{(s)} = Θ_{GFD, n}^{(s)} .

Θ_{II, n}^{(s)} = Θ_{GFD, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{GFD, n}^{(s)} .

Θ_{n}^{(s)} = Θ_{GFD, n}^{(s)} .

Pr (\hat{θ}_{n} \in C_{\hat{π}_{n}} ∣ \hat{π}_{n}) \geq 1 - α, α \in (0, 1),

Pr (\hat{θ}_{n} \in C_{\hat{π}_{n}} ∣ \hat{π}_{n}) \geq 1 - α, α \in (0, 1),

C_{\hat{π}_{n}} = Θ_{n} ∖ {\underline{Q}_{α_{1}} \cup \overline{Q}_{α_{2}}}, α_{1} + α_{2} = α .

C_{\hat{π}_{n}} = Θ_{n} ∖ {\underline{Q}_{α_{1}} \cup \overline{Q}_{α_{2}}}, α_{1} + α_{2} = α .

θ_{0} = \hat{θ}_{n} = θ \in Θ argzero Ψ_{n} (θ, u_{0}, \hat{π}_{n}) .

θ_{0} = \hat{θ}_{n} = θ \in Θ argzero Ψ_{n} (θ, u_{0}, \hat{π}_{n}) .

θ_{0} \in Θ_{n} .

θ_{0} \in Θ_{n} .

x = d g (θ, u) = d \mathbcal g \circ (id_{Θ} \times b) (θ, u) = d \mathbcal g (θ, v),

x = d g (θ, u) = d \mathbcal g \circ (id_{Θ} \times b) (θ, u) = d \mathbcal g (θ, v),

g (θ, u, σ^{2}) = θ + σ - 2 ln (u_{1}) cos (2 π u_{2}),

g (θ, u, σ^{2}) = θ + σ - 2 ln (u_{1}) cos (2 π u_{2}),

Ψ_{n} (θ, u^{*}, π) = φ_{p} (θ, w, π),

Ψ_{n} (θ, u^{*}, π) = φ_{p} (θ, w, π),

\hat{π}_{n}

\hat{π}_{n}

\hat{θ}_{n}

\int_{Θ_{n}} f_{\hat{θ}_{n} ∣ \hat{π}_{n}} (\hat{θ}_{n} ∣ \hat{π}_{n}) d θ = \int_{W_{n}} f (a (w) ∣ \hat{π}_{n}) ∣ J (w ∣ \hat{π}_{n}) ∣ d w,

\int_{Θ_{n}} f_{\hat{θ}_{n} ∣ \hat{π}_{n}} (\hat{θ}_{n} ∣ \hat{π}_{n}) d θ = \int_{W_{n}} f (a (w) ∣ \hat{π}_{n}) ∣ J (w ∣ \hat{π}_{n}) ∣ d w,

J (w ∣ \hat{π}_{n}) = \frac{det ( D _{θ} φ _{\hat{π}_{n}} ( a ( w ) , w ) )}{det ( D _{w} φ _{\hat{π}_{n}} ( a ( w ) , w ) )} .

J (w ∣ \hat{π}_{n}) = \frac{det ( D _{θ} φ _{\hat{π}_{n}} ( a ( w ) , w ) )}{det ( D _{w} φ _{\hat{π}_{n}} ( a ( w ) , w ) )} .

\hat{θ}_{II, n}^{(s)} \in Θ_{II, n}^{(s)} = θ \in Θ argzero d [h (x_{0}), g (θ, w_{s})] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Markov Chains and Monte Carlo Methods

Full text

A simple recipe for making accurate parametric inference in finite sample

Stéphane Guerrier

Department of Statistics

Pennsylvania State University

University Park, PA 16802, USA

[email protected]

&Mucyo Karemera

Department of Statistics

Pennsylvania State University

University Park, PA 16802, USA

[email protected] &Samuel Orso

Geneva School of Economics and Management

University of Geneva

Geneva, Switzerland

[email protected] &Maria-Pia Victoria-Feser

Geneva School of Economics and Management

University of Geneva

Geneva, Switzerland

[email protected]

1 Introduction

The algorithmic principle of the bootstrap method is quite simple: reiterate the mechanism that produces an estimator on pseudo-samples. But when it comes to estimators that are numerically complicated to obtain, the bootstrap is less attractive to use due to the numerical burden. If one estimator is hard to find, reiterating compounds this issue. Paraphrasing Emile in the French comedy La Cité de la Peur: we can implement the bootstrap when the estimator is simple to obtain or we can compute a numerically complex point estimator, but it is too computationally cumbersome to do both.

Although this limitation is purely practical and tends to be reduced by the ever increasing computational power at our disposal, everyone would agree that it is nonetheless attractive to have a method that frees the user from the computational burden, or at least provides an answer within a reasonable time. In this chapter, we explore a special case of the efficient method of moments ([1]) that encompasses both the computation of numerically complex estimators and of a “bootstrap distribution” at a reduced cost. The idea deviates from the algorithmic principle of the bootstrap: the proposed method no longer attempts at reproducing the sample mechanism that lead to an estimator, but instead, tries to find every estimators that may have produced the observed sample, or more often, some statistics on the sample.

The idea is not new though, several methods follow this pattern. The indirect inference method ([2, 3]) similarly attempts at finding the point estimate that lead to statistics obtained from simulated samples as close as possible to the same statistics on the observed sample. Mostly used in econometric and financial contexts, indirect inference has been successfully applied to the estimation of stable distribution ([4]), stochastic volatility models ([5, 6]), financial contingent claims ([7]), dynamic panel models ([8]), dynamic stochastic equilibrium models ([9]), continuous time models ([10]), diffusion processes ([11]); but it has also been used in queueing theory ([12]), robust estimation of generalized linear latent variable models ([13]), robust income distribution ([14]), high dimensional generalized linear model and penalized regression ([15]). Often presented as the Bayesian counterpart of the indirect inference, the approximate Bayesian computation ([16, 17]) aims at finding the values that match the statistics computed on simulated samples and the statistics on the observed sample, with a certain degree approximation. The method has however grown in a different context of applications. For example, it has been successfully employed in population genetics ([18]), in ecology ([19]), in evolutionary biology ([20, 21]). Less popular, R.A. Fisher’s fiducial inference (see for instance [22, 23, 24, 25, 26]) and related methods such as the generalized fiducial inference ([27, 28, 29]), D.A.S. Fraser’s structural inference ([30], see also [31]), Dempster-Shafer theory ([32, 33]) and inferential models ([34, 35, 36]) follow a similar pattern, the main idea being to find all possible values that permit to generate simulated sample as close as possible to the observed sample, but without specifying any prior distribution.

Regardless of the difference in philosophy of the aforementioned methods, they have in common that they are usually very demanding in computational resources when implemented for non-trivial applications. This is a major difference with the approach we endorse in this chapter. By letting the statistics be the solution of an estimating function of the same dimension as the quantity of interest, we demonstrate that it is possible to bypass the computation of the same statistics on simulated sample by directly estimating the quantity of interest within the estimating function, resulting thereby in a potential significant gain in computational time. In Section 3, we demonstrate in finite sample that under some weak conditions the estimators resulting from our approach is equivalent to the estimators one would have obtained using certain forms of indirect inference, approximate Bayesian computation or fiducial inference approaches, whereas it is different than parametric bootstrap estimators, except in the case of a location parameter. This section innovates on two aspects. First, it implicates that our approach can be employed in practice to solve problems that relate to indirect inference, approximated Bayesian compuation and fiducial inference in a computationally efficient manner. Second, it proves or disproves formally the link between the aforementioned methods, and this in the most general situation as the results remain true for any sample size.

Contructing tests or confidence regions that controls over the error rates in the long-run is probably one of the most important problem in statistics ever since at least Neyman-Pearson famous article [37]. Yet, the theoretical justification for most methods in statistics is asymptotic. The bootstrap for example, despite its simplicity and its widespread usage is an asymptotic method ([38]); for the other methods, see for example [39] for approximate Bayesian computation, [2] for indirect inference and [29] for generalized fiducial inference. There are in general no claim about the exactness of the inferential procedures in finite sample (see [36] for one of the exceptions). In Section 4, we study theoretically the frequentist error rates of confidence regions constructed on the distribution issued from our proposed approach. In particular, we demonstrate under some strong, but frequently encountered, conditions that the confidence regions have exact coverage probabilities in finite sample. Asymptotic justification is nonetheless provided in Section 5. In addition, we bear the comparison with the asymptotic properties of indirect inference method to conclude that, surprisingly, both approaches reach the same conclusion but under distinct conditions. Some leads are evoked, but we lack to elucidate the fundamental reason behind such discrepancy.

Although the proposed method is first and foremost computational, surprisingly in some situations explicit closed-form solutions may be found. We gather a non-exhaustive number of such examples, some important, in Section 6. The numerical study in Section 7 ends this chapter. We study via Monte Carlo simulations the coverage probabilities obtained from our approach and compare with others on a variety of problems. We conclude that in most situations, exact coverage probability computed within a reasonable computational time can be claimed with our method.

2 Setup

Let $\mathbb{N}$ ( $\mathbb{N}^{+}$ ) be the sets of all positive integers including (excluding) 0. For any positive integer $n$ , let $\mathbb{N}_{n}$ be the set whose elements are the integers $0,1,2,\dots,n$ ; similarly $\mathbb{N}^{+}_{n}=\{1,2,\dots,n\}$ .

We consider a sequence of random variables $\{\mathbf{x}_{i}:i\in\mathbb{N}^{+}_{n}\}$ , possibly multivariate, to follow an assumely known distribution $F_{\bm{\theta}}$ , indexed by a vector of parameters $\bm{\theta}\in\bm{\Theta}\subset{\rm I\!R}^{p}$ . We suppose that it is easy to generate artificial samples $\mathbf{x}^{\ast}$ from $F_{\bm{\theta}}$ . Specifically, we generate the random variable $\mathbf{x}$ with a known algorithm that associates $\bm{\theta}$ and a random variable $\mathbf{u}$ . We denote the generating mechanism as follows:

[TABLE]

The random variable $\mathbf{u}$ follows a known model $F_{\mathbf{u}}$ that does not depend on $\bm{\theta}$ . Using this notation, the observed sample is $\mathbf{x}_{0}=\mathbf{g}(\bm{\theta}_{0},\mathbf{u}_{0})$ and the artificial sample is $\mathbf{x}^{\ast}=\mathbf{g}(\bm{\theta},\mathbf{u}^{\ast})$ , where $\mathbf{u}_{0}$ and $\mathbf{u}^{\ast}$ are realizations of $\mathbf{u}$ .

Example 1 (Normal).

Suppose $\mathbf{x}\sim\mathcal{N}(\theta,1)$ , then four examples of possible generating mechanism are:

$\mathbf{g}(\bm{\theta},\mathbf{u})=\bm{\theta}+\mathbf{u}$ * where $\mathbf{u}\sim\mathcal{N}(0,1)$ ,* 2. 2.

$\mathbf{g}(\bm{\theta},\mathbf{u})=\bm{\theta}+\sqrt{2}\operatorname*{erf}^{-1}(2\mathbf{u}-1)$ * where $\mathbf{u}\sim\mathcal{U}(0,1)$ and $\operatorname*{erf}(z)=\frac{2}{\sqrt{\pi}}\int_{0}^{z}e^{-t^{2}}\mathop{}\!\mathrm{d}t$ is the error function,* 3. 3.

$\mathbf{g}(\bm{\theta},\mathbf{u})=\bm{\theta}+\sqrt{-2\ln(\mathbf{u}_{1})}\cos(2\pi\mathbf{u}_{2})$ * where $\mathbf{u}=(\mathbf{u}_{1},\mathbf{u}_{2})^{T}$ , $\mathbf{u}_{1}\sim\mathcal{U}(0,1)$ and $\mathbf{u}_{2}\sim\mathcal{U}(0,1)$ ,* 4. 4.

$\mathbf{g}(\bm{\theta},\mathbf{u})=\bm{\theta}+\mathbf{u}_{2}\sqrt{\frac{-2\ln(\mathbf{u}_{3})}{\mathbf{u}_{3}}}$ * where $\mathbf{u}=(\mathbf{u}_{1},\mathbf{u}_{2},\mathbf{u}_{3})$ , $\mathbf{u}_{3}=\mathbf{u}_{1}+\mathbf{u}_{2}$ , $\mathbf{u}_{1}\sim\mathcal{U}(0,1)$ , $\mathbf{u}_{2}\sim\mathcal{U}(0,1)$ .*

A possible counter-example is the following: $\mathbf{g}(\bm{\theta},\mathbf{u})=\mathbf{u}-\bm{\theta}$ where $\mathbf{u}\sim\mathcal{N}(2\bm{\theta},1)$ . Clearly $\mathbf{x}=\mathbf{g}(\bm{\theta},\mathbf{u})$ , but this $\mathbf{g}$ is not adequate because the distribution of $\mathbf{u}$ depends on $\bm{\theta}$ .

We now define the estimators we wish to study.

Definition 2 (SwiZs).

We consider the following sequence of estimators:

[TABLE]

where $\bm{\psi}$ is an estimating function and $s\in\mathbb{N}^{+}_{S}$ . The estimators $\hat{\bm{\pi}}_{n}$ are referred as the auxiliary estimators. Any sequence of estimators $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ is called Switched Z-estimators, or in short, SwiZs. The collection of the solutions is $\bm{\Theta}_{n}=\cup_{s\in\mathbb{N}^{+}_{S}}\bm{\Theta}^{(s)}_{n}$ .

Remark 1.

The SwiZs in the Definition 2 may arguably be viewed as a special case of the Efficient Method of Moment (EMM) estimator proposed by [1]. Indeed, to have an EMM estimator the only modification to the Definition 2 is

[TABLE]

where $H\in\mathbb{N}^{+}$ . Ergo, the SwiZs and EMM coincide whenever $H=1$ . Note that in general the EMM is defined with $H$ large and $S=1$ .

3 Equivalent methods

As already remarked, the SwiZs does not appear to be a new estimator. The SwiZs in fact offers a new point of view to different existing methods as it federates several techniques under the same hat. In this Section, we show the equivalence or disequivalence of the SwiZs to other existing methods, for any sample size $n$ , to conclude that the distribution obtained by the SwiZs is (approximatively) a Bayesian posterior, and thereby that it is valid for the purpose of inference.

The EMM and the indirect inference estimator of [3, 2] are known to have the same asymptotic distribution when $\dim(\bm{\pi})=\dim(\bm{\theta})$ (see Proposition 4.1 in [40]). In the next result, we demonstrate that the SwiZs and a certain form of indirect inference estimator are equivalent for any $n$ .

Definition 3 (indirect inference estimators).

Let $\hat{\bm{\pi}}_{n}$ and $\{\mathbf{u}_{j}:j\in\mathbb{N}\}$ be defined as in the Definition 2. We consider the following sequence of estimators, for $s\in\mathbb{N}^{+}_{S}$ :

[TABLE]

where $d$ is a metric. We call $\{\hat{\bm{\theta}}_{\text{II},n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ the indirect inference estimators. The collections of solutions are denoted $\bm{\Pi}_{\text{II},n}=\cup_{s\in\mathbb{N}^{+}_{S}}\bm{\Pi}_{\text{II},n}^{(s)}$ and $\bm{\Theta}_{\text{II},n}=\cup_{s\in\mathbb{N}^{+}_{S}}\bm{\Theta}_{\text{II},n}^{(s)}$ .

Remark 2.

In Definition 3, we are implicitly assuming that $\bm{\Theta}$ contains at least one of, possibly many zeros, of the distance between the auxiliary estimators on the sample and the pseudo-sample. Therefore, the theory is the same for any measure of distance that we denote generically by $d$ .

Remark 3.

The indirect inference estimators in Definition 3 is a special case of the more general form

[TABLE]

$B\in\mathbb{N}^{+}$ , $m\geq n$ . In Definition 3 we fixed $B=1$ and $m=n$ . [2] considered two cases: first, $B$ large, $m=n$ and $S=1$ , second, $B=1$ , $m$ large and $S=1$ . For both cases, the $\ell_{2}$ -norm was used as the measure of distance (see the preceding remark).

Assumption 4 (uniqueness).

For all $(\bm{\theta},s)\in\bm{\Theta}\times\mathbb{N}_{S}$ , $\operatorname*{argzero}_{\bm{\pi}\in\bm{\Pi}}\bm{\Psi}_{n}(\bm{\theta},\mathbf{u}_{s},\bm{\pi})$ has a unique solution

Theorem 5 (Equivalence SwiZs/indirect inference).

If Assumption 4 is satisfied, then the following holds for any $s\in\mathbb{N}^{+}_{S}$ :

[TABLE]

Theorem 5 is striking because it concludes that a certain form of EMM, the SwiZs, and indirect inference estimators (as in Definition 3) are actually the very same estimators, not only asymptotically, but for any sample size, and under a very mild condition. Indeed, Assumption 4 requires the roots of the estimating function to be well separated so there exists a unique solution. This requirement is unrestrictive and it is typically satisfied. One may even wonder what would be the purpose of an estimating function for which Assumption 4 would not hold. In this spirit, Assumption 4 may be qualified as the “minimum criterion” for choosing an estimating function.

Even if the optimizer is perfect, Theorem 5 does not imply that the exact same values are found using the SwiZs or the indirect inference estimators, but that they belong to the same set of solutions, and thereby that they share the same statistical properties. Hence, Theorem 5 offers us two different ways of computing the same estimators. Simple calculations however show that the SwiZs is computationally more attractive. Indeed, if we let $k$ denotes the cost evaluation of $\bm{\Psi}_{n}$ , $l$ the numbers of evaluations of $\bm{\Psi}_{n}$ for obtaining an auxiliary estimator or the final estimator, then the SwiZs has a total cost of roughly $\mathcal{O}(2kl)$ whereas it is $\mathcal{O}(kl+kl^{2})$ for the indirect inference estimator, so a reduction in order of $\mathcal{O}(kl^{2})$ . This computational efficiency of the SwiZs accounts for the fact that it is not necessary to compute $\hat{\bm{\pi}}_{\text{II},n}$ , and thus avoids the numerical problem of the indirect inference estimator of having an optimization nested within an optimization. This discrepancy is also, quite surprisingly, reflected in the theory we develop in Section 4 for the finite sample properties and in Section 5 for the asymptotic properties.

At first glance, the SwiZs may appear similar to the parametric bootstrap (see the Definiton 6 below). If we strengthen our assumptions and think of the auxiliary estimator as an unbiased estimator of $\bm{\theta}$ , it is natural to think of the SwiZs and the parametric bootstrap as being equivalent. In any cases, both methods use the exact same ingredients, so we may wonder whether actually they are the same. The next result demonstrates that in fact, they will be seldom equivalent.

Definition 6 (parametric bootstrap).

Let $\hat{\bm{\pi}}_{n}$ and $\{\mathbf{u}_{j}:j\in\mathbb{N}\}$ be defined as in Definition 2. We consider the following sequence of estimators:

[TABLE]

The collection of the solutions is $\bm{\Theta}_{\text{Boot},n}=\cup_{s\in\mathbb{N}^{+}_{S}}\bm{\Theta}^{(s)}_{\text{Boot},n}$ .

Remark 4.

For the solutions $\bm{\Theta}^{(s)}_{\text{Boot},n}$ in Definition 6 to be nonempty, the parametric bootstrap requires that $\bm{\Pi}_{n}\subset\bm{\Theta}$ . The SwiZs has not such requirement.

Assumption 7.

The zeros of the estimating functions are symmetric on $(\bm{\theta},\bm{\pi})$ , that is

[TABLE]

Theorem 8 (equivalence SwiZs/parametric bootstrap).

If and only if Assumption 7 is satisfied, then it holds that

[TABLE]

Assumption 7 is very restrictive, so Theorem 8 suggests that in general the SwiZs and the parametric bootstrap are not equivalent. This may appear as a surprise as only the argument $\bm{\theta}$ and $\bm{\pi}$ are interchanged in the estimating function. Then, if they are different, the question of which one should be preferred naturally arises. We do not attempt at answering this question, but we rather prefer to stimulate debates by giving motivations for using the SwiZs. Popularized by [41], the bootstrap has been a long-standing technique for (frequentist) statistician, it is relatively straightforward to implement and has a well-established theory (see for instance [38]). On the other hand, although the idea of the SwiZs has been arguably around for decades (see the comparison with the fiducial inference at the end of this section), we lack evidence of its widespread usage, at least not under the form presented here. When facing situations where $\hat{\bm{\pi}}_{n}$ is an unbiased estimator of $\bm{\theta}_{0}$ , compared to the parametric bootstrap, the SwiZs is more demanding for the implementation and is generally less numerically efficient (see Section 7) suggesting that solving $\bm{\Psi}_{n}(\bm{\theta},\bm{\pi})$ in $\bm{\theta}$ is computationally more involved than in $\bm{\pi}$ . However, in all the other situations where for example $\hat{\bm{\pi}}_{n}$ may be an (asymptotically) biased estimator of $\bm{\theta}_{0}$ , a sample statistic or a consistent estimator of a different model, the parametric bootstrap cannot be invoked directly, at least not with the same form as in Definition 8. Indeed, the parametric bootstrap requires $\hat{\bm{\pi}}_{n}$ to be a consistent estimator of $\bm{\theta}_{0}$ . Therefore, when considering complex model for which a consistent estimator is not readily available at a reasonable cost, the SwiZs may be computationally more attractive. The rest of this section aims at demonstrating that the distribution of the SwiZs is valid for the purpose of inference, whereas the following section theorizes the inferential properties of the SwiZs in finite sample for which Sections 6 and 7 gather evidences. But before, having emphasized their differences, we would like to share a rather common problem on which the parametric bootstrap and the SwiZs are equivalent.

The condition under which the SwiZs and the parametric bootstrap are equivalent (Assumption 7) is very strong and generally not met. There is one situation however where this condition holds, if the inferential problem is on the parameter of a location family as formalized in the next Proposition 9.

Proposition 9 (equivalence SwiZs/parametric bootstrap in location family problems).

Suppose that $x$ is a univariate random variable identically and independently distributed according to a location family, that is $x\overset{d}{=}\theta+y$ , where $\theta\in{\rm I\!R}$ is the location parameter. If the auxiliary parameter is estimated by the sample average and $x$ is symmetric around 0, that is $x\overset{d}{=}-x$ , then

[TABLE]

The conditions which satisfies Proposition 9 are restrictive. Indeed, they are satisfied for location families for which the centered random variable is symmetric. Proposition 9 holds for example with a Gaussian, a Student, a Cauchy and a Laplace random variables (variance and degrees of freedom known), but not, for example, for a generalized extreme value, a skewed Laplace and a skewed $t$ random variables (even with non-location parameters being fixed). The proof uses an average as the auxiliary estimator, but it should be easily extended to other estimator of location such as the trimmed mean. Proposition 9 is illustrated with a Cauchy random variable in Example 40 of Section 6.

Although the parametric bootstrap and the SwiZs will lead rarely to the same estimators, in spite of the similitude of their forms, the next result demonstrates that the distribution of the SwiZs corresponds in fact to (some sort of) a Bayesian posterior. Likewise the indirect inference, the approximate Bayesian computation (ABC) techniques were proposed to respond to complex problems. The two techniques are often presented to be respectively the frequentist and the Bayesian approaches to a same problem and have even been mixed sometimes (see [42]). We now show under what conditions the SwiZs and the ABC are equivalent, but before, we need to give more precision on what type of ABC. Often dated back to [43], the ABC has evolved and covers now a broad-spectrum of techniques such as rejection sampling (see e.g. [16, 17]), the Markov chain Monte Carlo (see e.g. [44, 45]), the sequential Monte Carlo sampling (see e.g. [46, 47, 48]) among others (see [49] for a review). The equivalence between the SwiZs and the ABC is demonstrated with a rejection sampling presented in the next definition. However, the note of [50] suggests that this result may be extended to Markov chain Monte Carlo and sequential Monte Carlo sampling algorithms. We leave such rigorous demonstration for further research.

Definition 10 (Approximate Bayesian Computation (ABC) estimators).

Let $\hat{\bm{\pi}}_{n}$ and $\{\mathbf{u}_{j}:j\in\mathbb{N}\}$ be defined as in Definition 2. Let $\hat{\bm{\pi}}^{(s)}_{\text{II},n}(\bm{\theta})$ be defined as in Definition 3. We consider the following algorithm. For a given $\varepsilon\geq 0$ , for a given infinite sequence $\{\mathbf{u}_{s}:s\in\mathbb{N}^{+}_{S}\}$ , for a given infinite sequence of empty sets $\{\bm{\Theta}^{(s)}_{\text{ABC},n}(\varepsilon):s\in\mathbb{N}^{+}_{S}\}$ , for a given prior distribution $\mathcal{P}$ of $\bm{\theta}$ , repeat (indefinitely) the following steps:

Generate $\bm{\theta}^{\star}\sim\mathcal{P}$ . 2. 2.

Compute $\hat{\bm{\pi}}_{\text{II},n}^{(s)}\left(\bm{\theta}^{\star}\right)$ . 3. 3.

If the following criterion is satisfied

[TABLE]

add $\bm{\theta}^{\star}$ to the set $\bm{\Theta}^{(s)}_{\text{ABC},n}$ , i.e. $\bm{\Theta}^{(s)}_{\text{ABC},n}(\varepsilon)=\bm{\Theta}^{(s)}_{\text{ABC},n}(\varepsilon)\cup\{\bm{\theta}^{\star}\}$ .

For a given $s\in\mathbb{N}^{+}_{S}$ , we denote by $\hat{\bm{\theta}}^{(s)}_{\text{ABC},n}(\varepsilon)$ an element of $\bm{\Theta}^{(s)}_{\text{ABC},n}(\varepsilon)$ . The collection of the solutions is denoted $\bm{\Theta}_{\text{ABC},n}(\varepsilon)=\cup_{s\in\mathbb{N}^{+}}\bm{\Theta}^{(s)}_{\text{ABC},n}(\varepsilon)$ .

Remark 5.

The ABC algorithm presented in Definition 10 is a specific version of the simple accept/reject algorithm proposed by [16, 17], where the auxiliary estimators are the solution of an estimating function and the dimensions of $\bm{\pi}$ and $\bm{\theta}$ are the same.

Definition 11 (posterior distribution).

The distribution of the infinite sequence $\{\hat{\bm{\theta}}^{(s)}_{\text{ABC},n}(\varepsilon):s\in\mathbb{N}^{+}_{S}\}$ issued from Definition 10 is referred to as the $(\varepsilon,\hat{\bm{\pi}}_{n})$ -approximate posterior distribution. If $\varepsilon=0$ , we have the $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution. If $\hat{\bm{\pi}}_{n}$ is a sufficient statistic, we have the $\varepsilon$ -approximate posterior distribution. If both $\varepsilon=0$ and $\hat{\bm{\pi}}_{n}$ is sufficient, then we simply refer to the posterior distribution.

Remark 6.

In Definition 11, we mention two sources of approximation to the posterior distribution, $\varepsilon$ and $\hat{\bm{\pi}}_{n}$ . There is actually a third source of approximation stemming from the number of simulations $S$ , if indeed $S<\infty$ . Since it is common to every methods presented, it is left implicit.

Assumption 12 (existence of a prior).

For every $s\in\mathbb{N}^{+}_{S}$ and for all $n$ , there exists a prior distribution $\mathcal{P}$ such that

[TABLE]

Theorem 13 (Equivalence SwiZs/ABC).

If Assumptions 4 and 12 are satisfied, then the following holds:

[TABLE]

From Theorem 13 and Definition 11, we have clearly established that the distribution obtained by the SwiZs is a $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution. Yet, the conclusion reached by Theorem 13 is surprising at two different levels: first, Theorem 13 implies the possibility of obtaining an $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution without specifying explicitly a prior distribution by using the SwiZs, second, whereas, for each $s\in\mathbb{N}^{+}_{S}$ , it would in general require a very large number of sampled $\bm{\theta}^{\star}$ for the ABC to approach an $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution ( $\varepsilon=0)$ , it is obtainable by the SwiZs at a much reduced cost. Indeed, for a given $s\in\mathbb{N}^{+}_{S}$ , it demands in general a considerable number of attempts to sample a $\bm{\theta}^{\star}$ that satisfies the matching criterion with an error of $\varepsilon\approx 0$ , whereas it is replaced by one optimization for the SwiZs, so it may be more computationally efficient to use the SwiZs. Note also that in the situation where one has a prior knowledge on $\bm{\theta}$ , the SwiZs may be modified, for example, by including an importance sampling weight, in the same fashion that the ABC would be modified when the prior distribution is improper (see e.g. [51]). However, for some problems, the optimizations to obtain the SwiZs distribution may be numerically cumbersomes and the ABC may prove itself a facilitating alternative (for example [52] argued in this direction for some of their examples when comparing the indirect inference and the ABC).

Switching between the SwiZS and the ABC algorithms for estimating a posterior poses the fundamental and practical question of which prior distribution to use. Assumption 12 stating that a prior distribution exists is very reasonable and widely accepted (although a frequentist fundamentalist may argue differently), but the result of Theorem 13 brings at least three questions: which prior distribution satisfies both the SwiZs and the ABC at the same time, whether the prior distribution under which Theorem 13 holds is unique and whether there is an “optimal” prior in the numerical sense (that would produce $\bm{\theta}^{\star}$ satisfying “rapidly” the matching criteria as defined at the point 3 of Definition 10). We do not answer these questions because, firstly, the numerical problems we face in Section 7 are achievable quite efficiently by the SwiZs, secondly, they would deserve much more attention than what we are able to conduct in the present. Thus, we content ourselves by mentioning only briefly studies made on this direction. In order to approach this topic, we first need to present an ultimate technique.

The possibility of obtaining an (approximate) Bayesian posterior without specifying explicitly a prior distribution on the parameters of interest inescapably links the SwiZs to R.A. Fisher’s controversial fiducial inference (see for instance [22, 23, 24, 25, 26]). Here we keep the SwiZs neutral and do not aim at reanimating any debate. It is delicate to give an unequivocal definition of the fiducial inference as it has changed on many occasion over time (see [53] for a comprehensive historical review) and we rather give the presentation with the generalized fiducial inference proposed by [27] (see also [28, 29]) which includes R.A. Fisher’s fiducial inference. Other efforts to generalize R.A. Fisher’s fiducial inference include Fraser’s structural inference ( [30], see also [31]), the Dempster-Shafer theory ( [32, 33], see also [54]) refined later with the concept of inferential models ([34, 35]). As argued by [27], Fraser’s structural inference may be viewed as a special case of the generalized fiducial inference where the generating function $\mathbf{g}$ has a specific structure. The concept of inferential models is similar to the generalized fiducial inference in appearance but they differ in their respective theory. The departure point of the inferential models is to conduct inference with the conditional distribution of the pivotal quantity $\mathbf{u}$ given $\mathbf{x}_{0}$ after the sample has been observed. It is argued that keeping $\mathbf{u}\sim F_{\mathbf{u}}$ after the sample has been observed makes the whole procedure subjective ([35]), but the idea is essentially a gain in efficiency of the estimators. Also this idea is sound (see Lemma 22 in the next section), we do not see how it can be applied for the practical examples we use in Section 7, and more fundamentally, we do not understand how such conditional distribution may be built without some form of prior (and arguably subjective) knowledge on $\mathbf{u}_{0}$ . We therefore leave such consideration for further research and limit the equivalence to the generalized fiducial inference given in the next definition.

Definition 14 (Generalized fiducial inference).

The generalized fiducial distribution is given by

[TABLE]

Remark 7.

The generalized fiducial distribution in Definition 14 is slightly more specific than usually defined in the literature. In Definition 1 in [29], it is given by

[TABLE]

for any norm. Here, in addition, we assume that $\bm{\Theta}$ contains at least one of, possibly many, zeros.

If we let the sample size equals the dimension of the parameter of interest, $n=p$ , then it is obvious from their definitions that the generalized fiducial distribution and the indirect inference estimators are equivalent. We formalize this finding for the sake of the presentation.

Assumption 15.

The followings hold:

i.

$\hat{\bm{\pi}}_{n}=\mathbf{x}$ ; 2. ii.

$\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})=\mathbf{g}(\bm{\theta},\mathbf{u})$ .

Proposition 16.

If Assumption 15 is satisfied, then the following holds:

[TABLE]

Also the link between the indirect inference and the generalized fiducial inference seems self-evident, it was, at the best of our knowledge, never mentioned in the literature. It may be explained by the two different goals that each of these methods target, that may respectively be loosely summarized as finding a point-estimate of a complex problem and making Bayesian inference without using a prior distribution. Having established this equivalence, the connection with the SwiZs is direct from Theorem 5 and formalize in the next proposition.

Proposition 17.

If Assumptions 4 and 15 are satisfied, then the following holds:

[TABLE]

In the light of Proposition 17, the SwiZs may appear equivalent to the generalized fiducial inference under a very restrictive condition. Indeed, the only possibility for Assumption 15 to hold is that the sample size must equal the dimension of the problem. But we would be willing to concede that this apparent rigidity is thiner as one may propose to use sufficient statistics with minimal reduction on the sample, thereby leaving $n$ greater than $p$ , and Proposition 16 would still hold. Such situation however is confined to problems dealing with exponential families as demonstrated by the Pitman-Koopman-Darmois theorem, so in general, when $n$ is greater than $p$ and the problem at hand is outside of the exponential family, the SwiZs and the generalized fiducial inference are not equivalent.

Although the link between the generalized fiducial inference and the indirect inference has remained silent, the connection with the former to the ABC has been much more emphased. Indeed, the algorithms proposed to solve the generalized fiducial inference problems are mostly borrowed from the ABC literature (see [55]). Therefore, the discussion we conducted above on the numerical aspects of the SwiZs and the ABC still holds here, the SwiZs may be an efficient alternative to solve the generalized fiducial inference problem.

The generalized fiducial inference is also linked by [29] to what may be called “non-informative” prior approaches (see [56] for a broad discussion of this concept). More specifically, it appears that some distribution resulting from the generalized fiducial inference corresponds to the posterior distribution obtained by [57] based on a data-dependent prior proportional to the likelihood function in the absence of information. This result enlarges the previous vision brought by [58] that concluded that R.A. Fisher’s fiducial inference is “Bayes inconsistent” (in the sense that the Bayes’ theorem cannot be invoked) apart from problems on the Gaussian and the gamma distributions. [58]’s results relied on a narrower definition of fiducial inference than brought by the generalized fiducial inference, so whether the generalized fiducial inference has become Bayes consistent for broader problems nor [57] approach with an uninformative prior is Bayes inconsistent remains an open question. But most importantly, the strong link between the generalized fiducial inference and this non-informative prior approach reveals the common goal towards which of these approaches tends, which might be stated as tackling the individual subjectivism in the Bayesian inference that has been one of the major subject of criticism ever since at least [22].

Last but not least, we complete the loop by the following Corollary which is a consequence of Theorems 5, 8 and 13, and Propositions 16 and 17.

Corollary 18.

We have the followings:

i.

If Assumptions 4 and 12 are satisfied, then $\bm{\Theta}_{\text{II},n}^{(s)}=\lim_{\varepsilon\downarrow 0}\bm{\Theta}_{\text{ABC},n}^{(s)}(\varepsilon)$ ; 2. ii.

If Assumptions 4, 12 and 7 are satisfied, then $\bm{\Theta}_{\text{Boot},n}^{(s)}=\lim_{\varepsilon\downarrow 0}\bm{\Theta}_{\text{ABC},n}^{(s)}(\varepsilon)$ ; 3. iii.

If Assumptions 4 and 7 are satisfied, then $\bm{\Theta}_{\text{II},n}^{(s)}=\lim_{\varepsilon\downarrow 0}\bm{\Theta}_{\text{Boot},n}^{(s)}(\varepsilon)$ ; 4. iv.

If Assumptions 4, 7 and 15 are satisfied, then $\bm{\Theta}_{\text{Boot},n}^{(s)}=\lim_{\varepsilon\downarrow 0}\bm{\Theta}_{\text{GFD},n}^{(s)}(\varepsilon)$ ; 5. v.

If Assumptions 4, 12 and 15 are satisfied, then $\bm{\Theta}_{\text{ABC},n}^{(s)}=\lim_{\varepsilon\downarrow 0}\bm{\Theta}_{\text{GFD},n}^{(s)}(\varepsilon)$ .

4 Exact frequentist inference in finite sample

Having demonstrated that the distribution of the SwiZs sequence, for a single experiment, is approximatively a Bayesian posterior, we now turn our interest to the long-run statistical properties of the SwiZs. Our point of view here is frequentist, that is we suppose that we have an indefinite number of independent trials with fixed sample size $n$ and fixed $\bm{\theta}_{0}\in\bm{\Theta}$ . For each experiment we compute an exact $\alpha$ -credible set, as given in the Definition 20 below, using the SwiZs independently: the knowledge acquired on an experiment is not used as a prior to compute the SwiZs on another experiment. The goal of this Section is to demonstrate under what conditions the SwiZs leads to exact frequentist inference when the sample size is fixed.

Definition 19 (sets of quantiles).

Let $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ be a $\hat{\bm{\pi}}_{n}$ -approximate posterior cumulative distribution function. We define the following sets of quantiles:

Let $\underline{Q}_{\alpha}=\left\{\hat{\bm{\theta}}_{n}\in\bm{\Theta}_{n},\alpha\in(0,1):F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}(\hat{\bm{\theta}}_{n})\leq\alpha\right\}$ be the set of all $\hat{\bm{\theta}}_{n}$ for which $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ is below the threshold $\alpha$ . 2. 2.

Let $\overline{Q}_{\alpha}=\left\{\hat{\bm{\theta}}_{n}\in\bm{\Theta}_{n},\alpha\in(0,1):F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}(\hat{\bm{\theta}}_{n})\geq 1-\alpha\right\}$ be the set of all $\hat{\bm{\theta}}_{n}$ for which $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ is above the threshold $1-\alpha$ .

Definition 20 (credible set).

Let $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ be a $\hat{\bm{\pi}}_{n}$ -approximate posterior cumulative distribution function. A set $C_{\hat{\bm{\pi}}_{n}}$ is said to be an $\alpha$ -credible set if

[TABLE]

where

[TABLE]

If we replace “ $\geq$ ” by the equal sign in (1), we say that the coverage probability of $C_{\hat{\bm{\pi}}_{n}}$ is exact.

Definition 20 is standard in the Bayesian literature (see e.g. [59]). Note that an $\alpha$ -credbile set can have an exact coverage only if the random variable is absolutely continuous. Such credible set is referred to as an “exact $\alpha$ -credible set”.

The next result gives a mean to verify the exactness of frequentist coverage of an exact $\alpha$ -credible set.

Proposition 21 (Exact frequentist coverage).

If a $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution evaluated at $\bm{\theta}_{0}\in\bm{\Theta}_{n}$ is a realization from a standard uniform variate identically and independently distributed, $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}(\bm{\theta}_{0})=u$ , $u\sim\mathcal{U}(0,1)$ , then every exact $\alpha$ -credible set built from the quantiles of $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ leads to exact frequentist coverage probability in the sense that $\Pr\left(C_{\hat{\bm{\pi}}_{n}}\ni\bm{\theta}_{0}\right)=1-\alpha$ (unconditionally).

Proposition 21 states that if the cumulative distribution function (cdf), obtained from the SwiZs, variates (across independent trials!) uniformly around $\bm{\theta}_{0}$ (fixed!), so does any quantities computed from the percentiles of this cdf, leading to exact coverage in the long-run. The proof relies on Borel’s strong law of large number. Although this result may be qualified of unorthodox by mixing both Bayesian posterior and frequentist properties, it arises very naturally. Replacing $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution by any conditional distribution on $\hat{\bm{\pi}}_{n}$ in Proposition 21 leads to the same result. This proposition is similar in form to the concept of confidence distribution formulated by [60] and later refined by [61, 62, 63]. The confidence distribution is however a concept entirely frequentist and could not be directly exploited here. The general theoretical studies on the finite sample frequentist properties are quite rare in the literature, we should eventually mention the study of [36], although the theory developped is around inferential models and different than our, the author uses the same criterion of uniformly distributed quantity to demonstrate the frequentist properties.

Remark 8.

In Proposition 21, we use a standard uniform variable as a mean to verify the frequentist properties. With the current statement of the proposition, other distributions with support in $[0,1]$ may be candidates to verify the exactness of the frequentist coverage. However, if we restrain the frequentist exactness to be $\Pr(C_{\hat{\bm{\pi}}_{n}}\ni\bm{\theta}_{0})=1-\alpha$ , $\Pr(\overline{Q}_{\alpha_{2}}\ni\bm{\theta}_{0})=\alpha_{2}$ and $\Pr(\underline{Q}_{\alpha_{1}}\ni\bm{\theta}_{0})=\alpha_{1}$ , for $\alpha=\alpha_{1}+\alpha_{2}$ , then the uniform distribution would be the only candidate.

In the light of Proposition 21, we now give the conditions under which the distribution of the sequence $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}\}$ , $F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}$ , leads to exact frequentist coverage probabilities. We begin with a lemma which is essential in the construction of our argument.

Lemma 22.

If the mapping $\bm{\pi}\mapsto\bm{\Psi}_{n}$ has unique zero in $\bm{\Pi}$ and the mapping $\bm{\theta}\mapsto\bm{\Psi}_{n}$ has unique zero in $\bm{\Theta}$ , then the following holds

[TABLE]

The idea behind Lemma 22 is that if one knew the true pivotal quantity $\mathbf{u}_{0}$ that generated the data, then one could directly recover the true quantity of interest $\bm{\theta}_{0}$ from the sample. Of course, both $\mathbf{u}_{0}$ and $\bm{\theta}_{0}$ are unknown (otherwise statisticians would be an extinct species!), but here we are exploiting the idea that, for a sufficiently large number of simulations $S$ , at some point we will generate $\mathbf{u}_{s}$ “close enough” to $\mathbf{u}_{0}$ . This idea is reflected in the following assumption.

Assumption 23.

Let $\bm{\Theta}_{n}\subseteq\bm{\Theta}$ be the set of the solutions of the SwiZs in the Definition 2. We have the following:

[TABLE]

The following functions are essential for convenient data reduction.

Assumption 24 (data reduction).

We have:

i.

There exists a Borel measurable surjection such that $\mathbf{b}(\mathbf{u})$ has the same dimension as $\mathbf{x}$ . 2. ii.

There exists a Borel measurable surjection such that $\mathbf{h}\circ\mathbf{b}(\mathbf{u})$ has the same dimension as $\bm{\theta}$ .

Remark 9.

The function $\mathbf{b}$ allows to work with a random variable of the same dimension as the observed variable. Indeed we have

[TABLE]

where $\mathbf{v}=\mathbf{b}(\mathbf{u})$ has the same dimension as $\mathbf{x}$ and $\operatorname{id}_{\bm{\Theta}}$ is the identity function on the set $\bm{\Theta}$ . On the other hand, the function $\mathbf{h}$ allows us to deal with random variables of the same dimension as $\bm{\theta}$ , and thus $\bm{\pi}$ .

Remark 10.

In Assumption 24, by saying the functions $\mathbf{h}$ and $\mathbf{b}$ are Borel measurable, we want to emphasis thereby that after applying these functions we still work with random variables, which is essential here.

To fix ideas, we consider the following example:

Example 25 (Explicit form for $\mathbf{h}$ and $\mathbf{b}$ ).

As in Example 1, suppose that $\mathbf{x}=x_{1},\cdots,x_{n}$ is identically and independently distributed according to $\mathcal{N}(\theta,\sigma^{2})$ , where $\sigma^{2}$ is known, and consider the generating function $\mathbf{g}\in\mathcal{G}$

[TABLE]

where $u_{1i},u_{2i}$ , $i=1,\cdots,n$ , are identically and independently distributed according to $\mathcal{U}(0,1)$ . Letting $\mathbf{v}\equiv\mathbf{b}(\mathbf{u})=\sqrt{-2\ln(u_{1})}\cos(2\pi u_{2})$ , we clearly have that $\mathbf{v}\sim\mathcal{N}(0,\mathbf{I}_{n})$ is a random variable of the same dimension as $\mathbf{x}$ . Now, if we consider $\mathbf{h}$ as the function that averages its argument, we have $w\equiv\mathbf{h}\circ\mathbf{b}(\mathbf{u})=\nicefrac{{1}}{{n}}\sum_{i=1}^{n}v_{i}$ , so by properties of Gaussian random variable we have that $w$ has a Gaussian distribution with mean 0 and variance $\nicefrac{{1}}{{n}}$ . Since $w$ is a scalar, it has the same dimensions as $\theta$ .

Example 25 shows explicit forms for functions in Assumption 24. It is however not requested to have an explicit form as we will see. Indeed, under Assumption 24, we can construct the following estimating function:

[TABLE]

where $\mathbf{w}=\mathbf{h}\circ\mathbf{b}(\mathbf{u}^{\ast})$ is a $p$ -dimensional random variable. The index $p$ in the estimating function $\bm{\varphi}_{p}$ aims at emphasing that $\mathbf{w}$ has the same dimensions as $\bm{\theta}$ and $\bm{\pi}$ , which is essential in our argument. Since the sample size $n$ and dimension $p$ are fixed here, it is disturbing. For some fixed $\bm{\theta}_{1}\in\bm{\Theta}$ and $\bm{\pi}_{1}\in\bm{\Pi}$ , it clearly holds that:

[TABLE]

Assumption 26 (characterization of $\bm{\varphi}_{p}$ ).

Let $\bm{\Theta}_{n}\subseteq\bm{\Theta}$ and $W_{n}$ be open subsets of ${\rm I\!R}^{p}$ . Let $\hat{\bm{\pi}}_{n}$ be the unique solution of $\bm{\Psi}_{n}(\bm{\theta}_{0},\mathbf{u}_{0},\bm{\pi})$ . Let $\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta},\mathbf{w})\equiv\bm{\varphi}_{p}(\bm{\theta},\mathbf{w},\hat{\bm{\pi}}_{n})$ be the map where $\hat{\bm{\pi}}_{n}$ is fixed. We have the followings:

i.

$\bm{\varphi}_{\hat{\bm{\pi}}_{n}}\in\mathcal{C}^{1}\left(\bm{\Theta}_{n}\times W_{n},{\rm I\!R}^{p}\right)$ * is once continuously differentiable on $\left(\bm{\Theta}_{n}\times W_{n}\right)\setminus K_{n}$ , where $K_{n}\subset\bm{\Theta}_{n}\times W_{n}$ is at most countable,* 2. ii.

$\det\left(D_{\bm{\theta}}\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta},\mathbf{w})\right)\neq 0$ , $\det\left(D_{\mathbf{w}}\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta},\mathbf{w})\right)\neq 0$ for every $(\bm{\theta},\mathbf{w})\in\left(\bm{\Theta}_{n}\times W_{n}\right)\setminus K_{n}$ , 3. iii.

$\lim_{\lVert(\bm{\theta},\mathbf{w})\rVert\to\infty}\left\lVert\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta},\mathbf{w})\right\rVert=\infty$ .

Assumption 27 (characterization of $\bm{\varphi}_{p}$ II).

Let $\bm{\Theta}_{n}\subseteq\bm{\Theta}$ , $W_{n}$ and $\bm{\Pi}_{n}\subseteq\bm{\Pi}$ be open subsets of ${\rm I\!R}^{p}$ . Let $\bm{\varphi}_{\bm{\theta}_{1}}(\mathbf{w},\bm{\pi})\equiv\bm{\varphi}_{p}(\bm{\theta}_{1},\mathbf{w},\bm{\pi})$ be the map where $\bm{\theta}_{1}\in\bm{\Theta}$ is fixed. Let $\bm{\varphi}_{\mathbf{w}_{1}}(\bm{\theta},\bm{\pi})\equiv\bm{\varphi}_{p}(\bm{\theta},\mathbf{w}_{1},\bm{\pi})$ be the map where $\mathbf{w}_{1}\in W_{n}$ is fixed. We have the followings:

i.

$\bm{\varphi}_{\bm{\theta}_{1}}\in\mathcal{C}^{1}\left(W_{n}\times\bm{\Pi}_{n},{\rm I\!R}^{p}\right)$ * is once continuously differentiable on $\left(W_{n}\times\bm{\Pi}_{n}\right)\setminus K_{1n}$ , where $K_{1n}\subset W_{n}\times\bm{\Pi}_{n}$ is at most countable,* 2. ii.

$\bm{\varphi}_{\mathbf{w}_{1}}\in\mathcal{C}^{1}\left(\bm{\Theta}_{n}\times\bm{\Pi}_{n},{\rm I\!R}^{p}\right)$ * is once continuously differentiable on $\left(\bm{\Theta}_{n}\times\bm{\Pi}_{n}\right)\setminus K_{2n}$ , where $K_{2n}\subset\bm{\Theta}_{n}\times\bm{\Pi}_{n}$ is at most countable,* 3. iii.

$\det\left(D_{\mathbf{w}}\bm{\varphi}_{\bm{\theta}_{1}}(\mathbf{w},\bm{\pi})\right)\neq 0$ , $\det\left(D_{\bm{\pi}}\bm{\varphi}_{\bm{\theta}_{1}}(\mathbf{w},\bm{\pi})\right)\neq 0$ for every $(\mathbf{w},\bm{\pi})\in\left(W_{n}\times\bm{\Pi}_{n}\right)\setminus K_{1n}$ , 4. iv.

$\det\left(D_{\bm{\theta}}\bm{\varphi}_{\mathbf{w}_{1}}(\bm{\theta},\bm{\pi})\right)\neq 0$ , $\det\left(D_{\bm{\pi}}\bm{\varphi}_{\mathbf{w}_{1}}(\bm{\theta},\bm{\pi})\right)\neq 0$ for every $(\bm{\theta},\bm{\pi})\in\left(\bm{\Theta}_{n}\times\bm{\Pi}_{n}\right)\setminus K_{2n}$ , 5. v.

$\lim_{\lVert(\mathbf{w},\bm{\pi})\rVert\to\infty}\left\lVert\bm{\varphi}_{\bm{\theta}_{1}}(\mathbf{w},\bm{\pi})\right\rVert=\infty$ , 6. vi.

$\lim_{\lVert(\bm{\theta},\bm{\pi})\rVert\to\infty}\left\lVert\bm{\varphi}_{\mathbf{w}_{1}}(\bm{\theta},\bm{\pi})\right\rVert=\infty$ .

Theorem 28.

If Assumptions 24 and 23 and one of Assumptions 26 or 27 are satisfied, then the followings hold:

There is a $\mathcal{C}^{1}$ -diffeomorphism map $\mathbf{a}:W_{n}\to\bm{\Theta}_{n}$ such that the distribution function of $\hat{\bm{\theta}}_{n}$ given $\hat{\bm{\pi}}_{n}$ is

[TABLE]

where

[TABLE] 2. 2.

For all $\alpha\in(0,1)$ , every exact $\alpha$ -credible set built from the percentiles of the distribution function have exact frequentist coverage probabilities.

Theorem 28 is very powerful as it concludes that the SwiZs (Assumptions 24, 23 and 26) and the indirect inference estimators (Assumption 24, 23 and 27) have exact frequentist coverage probabilities in finite sample. Our argument is based on the possibility of changing variables from $\hat{\bm{\theta}}_{n}$ to $\mathbf{w}$ , but also from $\mathbf{w}$ to $\hat{\bm{\theta}}_{n}$ (hence the diffeomorphism). This argument may appear tautological, but this is actually because we are able to make this change-of-variable in both directions that the conlcusion of Theorem 28 is possible (see the parametric bootstrap in Examples 41 and 43 for counter-examples). The result is very general because we do not suppose that we know explicitly the estimators $\hat{\bm{\theta}}_{n}$ and $\hat{\bm{\pi}}_{n}$ , neither the random variable $\mathbf{w}$ . Because of their unknown form, we employ a global implicit function theorem for our proof which permits to characterize the derivative of these estimators through their estimating function. One of the conclusion of the global implicit function theorem is the existence of a unique and global invertible function $\mathbf{a}$ . It seems not possible to reach the conclusion of Theorem 28 with a local implicit function theorem (usually encountered in textbooks), but it may be of interest for further research as some conditions may accordingly be relaxed.

Although powerful, Theorem 28’s conditions are restrictive or difficult to inspect, but not hard to believe as we now explain. First, the existence of the random variable $\mathbf{w}$ depends on the possibility to have data reduction as expressed in Assumption 24. We do not need to know explicitly $\mathbf{w}$ and $\mathbf{w}$ does not need to be unique, so essentially Assumption 24 holds for every problem for a which a maximum likelihood estimator exists (see e.g. [64], Theorem 2 in Chapter 7); see also [65, 35] for the construction of $\mathbf{w}$ by conditioning. Yet, it remains unclear if this condition holds in the situations when the likelihood function does not exist. The indirect inference and ABC literatures are overflowing with examples where the likelihood is not tractable, but one should keep in mind that such situation does not exclude the existence of a maximum likelihood, it is simply impractical to obtain one. Second, Assumption 23 states that the true value $\bm{\theta}_{0}$ belongs to the set of solutions. This condition can typically only be verified in simulations when controlling all the parameters of the experiment, although it is not critical to believe such condition holds when making a very large number of simulations $S$ . We interpret the inclusion of the set of solutions to $\bm{\Theta}$ as follows: once $\bm{\theta}_{0}\in\bm{\Theta}$ is fixed, it is not necessary to explore the whole set $\bm{\Theta}$ (that would require $S$ to be extremly large), but an area sufficiently large of $\bm{\Theta}$ such that it includes $\bm{\theta}_{0}$ . Third, Assumptions 26 and 27 are more technical and concerns the finite sample behavior of the estimating functions of, respectively, the SwiZs and the indirect inference estimators. Although we cannot conclude that Assumption 26 is weaker than Assumption 27, it seems easier to deal with the former.

Assumption 26 (i) requires the estimating function to be once continuously differentiable in $\bm{\theta}$ and $\mathbf{w}$ almost everywhere. The estimators $\hat{\bm{\theta}}_{n}$ and $\hat{\bm{\pi}}_{n}$ are not known in an explicit form, but they can be characterized by their derivatives using an implicit function theorem argument. Since $\bm{\theta}$ and $\mathbf{w}$ appears in the generating function $\mathbcal{g}$ , this assumption may typically be verified with the example at hand using a chain rule argument: the estimating function must be once continuously differentiable in the observations represented by $\mathbcal{g}$ , and $\mathbcal{g}$ must be once continuously differentiable in both its arguments. Discrete random variables are automatically ruled out by this last requirement, but this should not appear as a surprise as exactness of the coverage cannot be claimed in general for discrete distribution (see e.g. [66]). The smoothness requirement on the estimating function excludes for example estimators based on order statistics. In general, relying on non-smooth estimating function leads to less efficient estimators and less stable numerical solutions, but they may be an easier estimating function to choose in situations where it is not clear which one to select. Although, non-smooth estimating functions and discrete random variables are dismissed, the condition may nearly be satisfied when considering a $n$ large enough. Assumption 27 (i, ii) requires in addition the estimating equation to be once continuously differentiable in $\bm{\pi}$ .

Assumption 26 (ii), as well as Assumption 27 (iii, iv), essentially necessitate the estimating function to be “not too flat” globally. It is one of the weakest condition to have invertibility of the Jacobian matrices. Usually only one of the Jacobian has such requirement for an implicit function theorem, but since we are targeting a $\mathcal{C}^{1}$ -diffeomorphism, we strenghten the assumption on both Jacobians. Once verified the first derivative of the estimating function as explained in the preceding paragraph, the non-nullity of determinant may be appreciated, it typically depends on the model and the choosen estimating function. An example for which this condition is not globally satisfied is when considering robust estimators as the estimating function is constant on an uncountable set once exceeding some threeshold. This consideration gives raise to the question on whether this condition may be relaxed to hold only locally, condition which would be satisfied by the robust estimators, but Example 50 with the robust Lomax distribution in the Section 7 seems to indicate the opposite direction.

Assumption 26 (iii), as well as Assumption 27 (v, vi), is a necessary and sufficient condition to invoke Palais’ global inversion theorem ([67]) which is a key component of the global implicit function theorem of [68] we use. It can be verified in two steps by, first, letting $\mathbcal{g}$ diverges in the estimating function, and then letting $\bm{\theta}$ and $\mathbf{w}$ diverges in $\mathbcal{g}$ . Once again, robust estimators do not fulfill this requirement as their estimating functions do not diverge with $\mathbcal{g}$ but rather stay constant.

Theorem 28 is derived under sufficient conditions. In its actual form, although very general, it excludes some specific estimating functions and non-absolutly continuous random variable. It is of both practical and theoretical interest to develop results for a wider-range of situations. Such considerations are left for further research.

We finish this section by considering a special, though maybe common, case where the auxiliary estimator is known in an explicit form. Suppose $\hat{\bm{\pi}}_{n}=\mathbf{h}(\mathbf{x}_{0})$ where $\mathbf{h}$ is a known (surjective) function of the observations (see Assumption 24). We can define a (new) indirect inference estimator as follows:

[TABLE]

Remark 11.

The estimator defined in Equation 2 is a special case of the indirect inference estimators as expressed in Definition 3, and thus of the SwiZs by Theorem 5, where the auxiliary estimators $\hat{\bm{\pi}}_{n}$ and $\hat{\bm{\pi}}_{\text{II},n}$ are known in an explicit form.

Assumption 29 (characterization of $\mathbcal{g}$ ).

Let $\bm{\Theta}_{n}\subseteq\bm{\Theta}$ , $W_{n}$ be subsets of ${\rm I\!R}^{p}$ and $K_{n}\subset\bm{\Theta}_{n}\times W_{n}$ be at most countable. The followings hold:

i.

$\mathbcal{g}\in\mathcal{C}^{1}\left(\bm{\Theta}_{n}\times W_{n},{\rm I\!R}^{p}\right)$ * is once continuously differentiable on $(\bm{\Theta}_{n}\times W_{n})\setminus K_{n}$ ,* 2. ii.

$\det(D_{\bm{\theta}}\mathbcal{g}(\bm{\theta},\mathbf{w}))\neq 0$ * and $\det(D_{\mathbf{w}}\mathbcal{g}(\bm{\theta},\mathbf{w}))\neq 0$ for every $(\bm{\theta},\mathbf{w})\in(\bm{\Theta}_{n}\times W_{n})\setminus K_{n}$ ,* 3. iii.

$\lim_{\lVert(\bm{\theta},\mathbf{w})\rVert\to\infty}\lVert\mathbcal{g}(\bm{\theta},\mathbf{w})\rVert=\infty$ .

Proposition 30.

If Assumptions 24, 23 and 29 are satisfied, then the conclusions (1) and (2) of Theorem 28 hold. In particular, the distribution function is:

[TABLE]

where

[TABLE]

The message of Proposition 30 is fascinating: once the auxiliary estimator is known in an explicit form, the conditions to reach the conclusion of Theorem 28 simplify accounting for the fact that the implicit function theorem is no longer necessary. The discussion we have after Theorem 28 still holds, but the verification process of the conditions is reduced to inspecting the generating function.

5 Asymptotic properties

When $n\to\infty$ , different assumptions than in Section 4 may be considered to derive the distribution of the SwiZs. By Theorem 5, the SwiZs in Definition 2 and the indirect inference estimators in Definition 3 are equivalent for any $n$ . Yet, due to their different forms, the conditions to derive their asymptotic properties differ, at least in appearance. We treat both the asymptotic properties of the SwiZs and the indirect inference estimators in an unified fashioned and highlight their differences. We do not attempt at giving the weakest conditions possible as our goal is primarly to demonstrate in what theoretical aspect the SwiZs is different from the indirect inference estimators. The asymptotic properties of the indirect inference estimators were already derived by several authors in the literature, and we refer to [40], Chapter 4, for the comparison.

The following conditions are sufficient to prove the consistency of any estimator $\hat{\bm{\theta}}_{n}^{(s)}$ in Defintions 2 and 3. When it is clear from the context, we simply drop the suffix and denote $\hat{\bm{\theta}}_{n}$ for any of these estimators.

Assumption 31.

The followings hold:

i.

The sets $\bm{\Theta},\bm{\Pi}$ are compact, 2. ii.

For every $\bm{\pi}_{1},\bm{\pi}_{2}\in\bm{\Pi}$ , $\bm{\theta}\in\bm{\Theta}$ and $\mathbf{u}\sim F_{\mathbf{u}}$ , there exists a random value $A_{n}=\mathcal{O}_{p}(1)$ such that, for a sufficiently large $n$ ,

[TABLE] 3. iii.

For every $\bm{\theta}\in\bm{\Theta}$ , $\bm{\pi}\in\bm{\Pi}$ , the estimating function $\bm{\Psi}_{n}\left(\bm{\theta},\mathbf{u},\bm{\pi}\right)$ converges pointwise to $\bm{\Psi}(\bm{\theta},\bm{\pi})$ . 4. iv.

For every $\bm{\theta}\in\bm{\Theta}$ , $\bm{\pi}_{1},\bm{\pi}_{2}\in\bm{\Pi}$ , we have

[TABLE]

if and only if $\bm{\pi}_{1}=\bm{\pi}_{2}$ .

Assumption 32 (SwiZs).

The followings hold:

i.

For every $\bm{\theta}_{1},\bm{\theta}_{2}\in\bm{\Theta}$ , $\bm{\pi}\in\bm{\Pi}$ and $\mathbf{u}\sim F_{\mathbf{u}}$ , there exists a random value $B_{n}=\mathcal{O}_{p}(1)$ such that, for sufficiently large $n$ ,

[TABLE] 2. ii.

For every $\bm{\theta}_{1},\bm{\theta}_{2}\in\bm{\Theta}$ , $\bm{\pi}\in\bm{\Pi}$ , we have

[TABLE]

if and only if $\bm{\theta}_{1}=\bm{\theta}_{2}$ .

Assumption 33 (IIE).

The followings hold:

i.

For every $\bm{\theta}_{1},\bm{\theta}_{2}\in\bm{\Theta}$ , there exists a random value $C_{n}=\mathcal{O}_{p}(1)$ such that, for sufficiently large $n$ ,

[TABLE] 2. ii.

Let $\bm{\pi}(\bm{\theta})$ denotes the mapping towards which $\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})$ converges pointwise for every $\bm{\theta}\in\bm{\Theta}$ . For every $\bm{\theta}_{1},\bm{\theta}_{2}\in\bm{\Theta}$ , we have

[TABLE]

if and only if $\bm{\theta}_{1}=\bm{\theta}_{2}$ .

Theorem 34 (consistency).

Let $\{\hat{\bm{\pi}}_{n}\}$ be a sequence of estimators of $\{\bm{\Psi}_{n}(\bm{\pi})\}$ . For any fix $\bm{\theta}\in\bm{\Theta}$ , let $\{\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})\}$ be the sequence of estimators of $\{\bm{\Psi}_{n}(\bm{\theta},\bm{\pi})\}$ . Let $\{\hat{\bm{\theta}}_{n}\}$ be a sequence of estimators of $\{\bm{\Psi}_{n}(\bm{\theta})\}$ . We have the following:

If Assumption 31 holds, then any sequence $\{\hat{\bm{\pi}}_{n}\}$ converges in probability to $\bm{\pi}_{0}$ and any sequence $\{\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})\}$ converges in probability to $\bm{\pi}(\bm{\theta})$ ; 2. 2.

Moreover, if one of Assumptions 32 or 33 holds, then any sequence $\{\hat{\bm{\theta}}_{n}\}$ converges in probability to $\bm{\theta}_{0}$ .

Theorem 34 demonstrates the consistency of $\hat{\bm{\theta}}_{n}$ under two sets of conditions. Assumptions 31 and 33, or the conditions that are implied by these Assumptions, are regular in the literature of the indirect inference estimators (see [40], Chapter 4). More specifically, the mapping $\bm{\theta}\mapsto\bm{\pi}$ , usually referred to as the “binding” function (see e.g. [2]) or the “bridge relationship” (see [69]), is central in the argument and is required to have a one-to-one relationship (Assumption 33 (ii)). Surprisingly, in Theorem 34, such requirement may be substitued by the bijectivity of the deterministic estimating function with respect to $\bm{\theta}$ (Assumption 32 (ii)). Whereas the bijectivity of $\bm{\pi}(\bm{\theta})$ can typically only be assumed (if $\bm{\theta}\mapsto\bm{\pi}$ was known explicitly, then one would not need to use the indirect inference estimator unless of course one would be willing to lose statistical efficiency and numerical stability for no gain), there is more hope for Assumption 32 (ii) to be verifiable. Since both Assumptions 32 and 33 leads to the same conclusion, one would expect some strong connections between them. Since $\bm{\pi}(\bm{\theta})$ may be interpreted as the implicit solution of $\bm{\Psi}(\bm{\theta},\bm{\pi}(\bm{\theta}))=\mathbf{0}$ , it seems possible to link both Assumptions with the help of an implicit function theorem, but it typically requires further conditions on the derivatives of $\bm{\Psi}$ that are not necessary for obtaining the consistency results, and we thus leave such considerations for further research.

Proving the consistency of an estimator relies on two major conditions: the uniform convergence of the stochastic objective function and the bijectivity of the deterministic objective function (Assumption 31 (iv), Assumption 32 (ii), Assumption 33 (ii)). This second condition is referred to as the identifiability condition. It can sometimes be verified, or sometimes it is only assumed to hold, but it is typically appreciated in accordance with the chosen probabilistic model. Discrepancy among approaches mainly occurs on the demonstration of the uniform convergence. Here we rely on a stochastic version of the classical Arzelà-Ascoli theorem, see [70] for alternative approaches based on the theory of empirical processes. To satisfy this theorem, we require the parameter sets to be compact (Assumption 31 (i)), the stochastic objective function to converges pointwise (Assumption 31 (iii)) and the stochastic objective function to be Lipschitz (Assumption 31 (ii), Assumption 32 (i), Assumption 33 (i)). Note that the last requirement is in fact for the objective function to be stochastically equicontinuous, requirement verified by the Lipschitz condition, see also [71] for a broad discussion on this condition and alternatives. Some authors proposed to relax the compactness condition, see for example [72], but this is generally not a sensitive issue in practice. The pointwise convergence of the stochastic objective function may be appreciated up to further details depending on the context. For identically and independently distributed observations, typically the weak law of large numbers may be employed, thus requiring the stochastic objective function to have the same finite expected value across the observations. Other law of large numbers results may be used for serially dependent processes (see the Chapter 7 of [73]) and for non-identically distributed processes (see [74]), each results having its own conditions to satisfy.

We now turn our interest to the asymptotic distribution of an estimator $\hat{\bm{\theta}}_{n}$ . Likewise the consistency result, the following sufficient conditions, are separated to outline the difference between the SwiZs and the indirect inference estimators.

Assumption 35.

The followings hold:

i.

Let $\bm{\Theta}^{\circ},\bm{\Pi}^{\circ}$ , the interior sets of $\bm{\Theta},\bm{\Pi}$ , be open and convex subsets of ${\rm I\!R}^{p}$ , 2. ii.

$\bm{\theta}_{0}\in\bm{\Theta}^{\circ}$ * and $\bm{\pi}_{0}\in\bm{\Pi}^{\circ}$ ,* 3. iii.

$\bm{\Psi}_{n}\in\mathcal{C}^{1}\left(\bm{\Theta}^{\circ}\times\bm{\Pi}^{\circ},{\rm I\!R}^{p}\times{\rm I\!R}^{p}\right)$ * when $n$ is sufficiently large,* 4. iv.

For every $\bm{\theta}\in\bm{\Theta}^{\circ},\bm{\pi}\in\bm{\Pi}^{\circ}$ , $D_{\bm{\pi}}\bm{\Psi}_{n}(\bm{\theta},\mathbf{u},\bm{\pi}),D_{\bm{\theta}}\bm{\Psi}_{n}(\bm{\theta},\mathbf{u},\bm{\pi})$ converge pointwise to $D_{\bm{\pi}}\bm{\Psi}(\bm{\theta},\bm{\pi})\equiv\mathbf{K}(\bm{\theta},\bm{\pi}),D_{\bm{\theta}}\bm{\Psi}(\bm{\theta},\bm{\pi})\equiv\mathbf{J}(\bm{\theta},\bm{\pi})$ , 5. v.

$\mathbf{K}\equiv\mathbf{K}(\bm{\theta}_{0},\bm{\pi}_{0}),\mathbf{J}\equiv\mathbf{J}(\bm{\theta}_{0},\bm{\pi}_{0})$ * are nonsingular,* 6. vi.

$n^{1/2}\bm{\Psi}_{n}(\bm{\theta}_{0},\mathbf{u},\bm{\pi}_{0})\rightsquigarrow\mathcal{N}\left(\mathbf{0},\mathbf{Q}\right)$ , $\lVert\mathbf{Q}\rVert_{\infty}<\infty$ .

Assumption 36 (SwiZs II).

For every $\bm{\pi}_{1},\bm{\pi}_{2}\in\bm{\Pi}^{\circ}$ , $\bm{\theta}\in\bm{\Theta}^{\circ}$ and $\mathbf{u}\sim F_{\mathbf{u}}$ , there exists a random value $E_{n}=\mathcal{O}_{p}(1)$ such that, for sufficiently large $n$ ,

[TABLE]

Assumption 37 (IIE II).

The followings hold:

i.

$\hat{\bm{\pi}}_{\text{II},n}\in\mathcal{C}^{1}(\bm{\Theta}^{\circ},{\rm I\!R}^{p})$ * for sufficiently large $n$ ;* 2. ii.

For every $\bm{\theta}\in\bm{\Theta}^{\circ}$ , $D_{\bm{\theta}}\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})$ converges pointwise to $D_{\bm{\theta}}\bm{\pi}(\bm{\theta})$ .

Theorem 38 (asymptotic normality).

If the conditions of Theorem 34 are satisfied, we have the following additional results:

If Assumption 35 holds, then

[TABLE]

and

[TABLE] 2. 2.

Moreover, if Assumption 36 or 37 holds, then

[TABLE]

Theorem 38 gives the asymptotic distribution of both the auxiliary estimator and the estimator of interest. The conditions to derive the asymptotic distribution of the auxiliary estimator as expressed in Assumption 35 is regular for most estimators in the statistical literature. The proof of the first statement relies on the possibility to apply a delta method, which requires the estimating function to be once continuously differentiable (Assumption 35 (i), (ii) and (iii)). The case where this condition is not met is typically when $\bm{\theta}_{0}$ is a boundary point of $\bm{\Theta}$ . Not devoid of interest, this case is atypical and deserve to be treated on its own, this situation is therefore excluded by Assumption 35 (ii). In contrast, relaxing the smoothness requirement on the estimating function has received a much larger attention in the literature (see [72, 75, 70] among others). Here we content ourselves with the stronger smooth condition on the estimating function (Assumption 35 (iii)), maybe because it is largely admitted, but also maybe because the smoothness of the estimating function is already required when $n$ is finite by Theorem 28 to demonstrate the exact coverage probabilities, a situation that encourages us to consider smooth estimating function in the practical examples. The conditions for the Jacobian matrices to exist (Assumption 35 (iv)) and to be invertible (Assumption 35 (v)) are regular ones. The last condition is that a central limit theorem is applicable on the estimation equation (Assumption 35 (vi)). This statement is very general and its validity depends upon the context. For identically and independently distributed observations, one typically needs to verify Lindeberg’s conditions ([76]), which essentially requires that the two first moments exist and are finite. The requirements are similar if the observations are non-identically observed (see e.g. [77]). The conditions are also similar for stationary processes (see e.g. [78], for a review). Note eventually that, also as minor as it might be, the delta method (which is essentially a mean value theorem) largely in use in the statistical literature has recently been shown to be wrongly used by many for vector-valued function ([79]), this flaw has been taken into account in the present.

The proof of the second statement of Theorem 38 on the asymptotic distribution of the estimator of interest is more specific to the indirect inference literature. Compared to the proof of the first statement, it requires in addition that, for $n$ large enough, the binding function to be asymptotically differentiable with respect to $\bm{\theta}$ for the indirect inference estimator (Assumption 37) or the derivative of the estimating function with respect to $\bm{\theta}$ to be stochastically Lipschitz for the SwiZs (Assumption 36). For the same arguments we presented after the consistency Theorem 34, it may be more practical to verify Assumption 36 as the verification of Assumption 37 is impossible, at least directly, as the binding function is unknown. This is actually not entirely true as one may express the derivative of the binding function by invoking an implicit function theorem, the condition then may be verified on the resulting explicit derivative. The proof we use under Assumption 37 uses this mechanism, the derivative of the binding function is thus given by

[TABLE]

for every $\bm{\theta}$ in a neighborhood of $\bm{\theta}_{0}$ (see the proof in Appendix for more details). It is only by using this implicit function theorem argument that the exact same explicit distribution for both the SwiZs and the indirect inference estimators may be obtained. The same idea may be used then to find the derivative of $\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})$ and verify Assumption 37. Note eventually that [40] have an extra condition not required here (but that would as well be required) because they include a stochastic covariate with their indirect inference estimator.

Having demonstrated the asymptotic properties of one of the SwiZs estimators, $\hat{\bm{\theta}}_{n}^{(s)}$ , $s\in\mathbb{N}^{+}_{S}$ , we finish this section by giving the property of the average of the SwiZs sequence. The mean is an interesting estimator on its own and it is often considered as a point estimate in a Bayesian context.

Proposition 39.

Let $\bar{\bm{\theta}}_{n}$ be the average of $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ . If the conditions of Theorem 38 are satisfied, then it holds that

[TABLE]

where the factor $\gamma=1+1/S$ .

The discussion of the proof and the condition to obtain Theorem 38 are also valid for Proposition 39. The only point that deserves further explanations is on the factor $\gamma$ . This factor accounts for the numerical approximation of the $\hat{\bm{\pi}}_{n}$ -approximate posterior when $S$ is finite. It is not surprising though for someone familiar with the indirect inference literature. What may appear unclear is how this factor pass from 2 for one the SwiZs estimate in Theorem 38 to $\gamma<2$ for the mean in Proposition 38. If the $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ are independent, then it is well-known from the properties of the convolution of independent Gaussian random variables that $\gamma$ should equal 2. In fact, the pivotal quantities $\{\mathbf{u}_{s}:s\in\mathbb{N}^{+}_{S}\}$ are indeed independent, but each of the $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ shares a “common factor”, namely $\hat{\bm{\pi}}_{n}$ , and thus this common variability may be reduced by increasing $S$ . Note eventually that the average estimator in Proposition 39 has the same asymptotic distribution as the two indirect inference estimators considered by [2] (given that the dimension of $\bm{\theta}$ and $\bm{\pi}$ matches and that our implicit function theorem argument is used).

6 Examples

In this section, we illustrate the finite sample results of the Section 4 with some examples for which explicit solutions exist. Indeed, for all the examples, we are able to demonstrate analytically that the SwiZs’ $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution follows a uniform distribution when evaluated at the true value $\bm{\theta}_{0}$ , and thus concluding by Proposition 21 that any confidence regions built from the percentiles of this posterior have exact coverage probabilities in the long-run. In addition, and maybe more surprisingly, for most examples we are able to derive the explicit posterior distribution that the SwiZs targets. This message is formidable, one may not even need computations to characterize the distribution of $\hat{\bm{\theta}}_{n}$ given $\hat{\bm{\pi}}_{n}$ , but as one may foresee, these favorable situations are limited in numbers. Lastly, we illustrate Proposition 9 on the equivalence between the SwiZs and the parametric bootstrap with a Cauchy random variable in Example 40 to conclude that they are indeed the same. Since the SwiZs and the parametric bootstrap are seldom equivalent (see the discussion after Theorem 8), we also demonstrate the nonequivalence of the two methods in the case of uniform random variable with unknown upper bound (Example 41) and a gamma random variable with unknown rate (Example 43). The considerations of this section are not only theoretical but also practical as we treat the linear regression (Example 45) and the geometric Brownian motion when observed irregularly (Example 48), two models widely use.

Example 40 (Cauchy with unknown location).

Let $x_{i}\sim\text{Cauchy}(\theta,\sigma)$ , $\sigma>0$ known, $i=1,\ldots,n$ , be identically and independently distributed. Consider the generating function $g(\theta,u)=\theta+u$ where $u\sim\text{Cauchy}(0,\sigma)$ and the average as the (explicit) auxiliary estimator, $\hat{\pi}_{n}=\bar{x}$ . We have

[TABLE]

where $w=\frac{1}{n}\sum_{i=1}^{n}u_{i}$ . By the properties of the Cauchy distribution, we have that $w\sim\text{Cauchy}(0,\sigma)$ , that is the average of independent Cauchy variables has the same distribution of one of its components. Let $\hat{\theta}_{n}$ be the solution of $d(\hat{\pi}_{n},\hat{\theta}_{n}+w)=0$ , hence we have the explicit solution $\hat{\theta}_{n}=\hat{\pi}_{n}-w$ . Note that by symmetry of $w$ around 0 we have $w\overset{d}{=}-w$ , so $\hat{\theta}_{n}=\hat{\pi}_{n}+w$ . We therefore have that

[TABLE]

and by Proposition 21 the coverage obtained on the percentiles of the distribution of $\hat{\theta}_{n}|\hat{\pi}_{n}$ are exact in the long-run (frequentist).

The distribution of $\hat{\theta}_{n}|\hat{\pi}_{n}$ can be known in an explicit form. From the solution of $\hat{\theta}_{n}$ , we let $w=a(\theta)=\hat{\pi}_{n}+\theta$ . Following Proposition 30, we have

[TABLE]

Since $g(\theta,w)=\theta+w$ , the scaling factor is 1 and $\hat{\theta}_{n}|\hat{\pi}_{n}\sim\text{Cauchy}(\hat{\pi}_{n},\sigma)$ .

Eventually, we illustrate Theorem 8, more specifically Proposition 9, by showing that the parametric bootstrap is equivalent. The bootstrap estimators is $\hat{\theta}_{\text{Boot},n}=\frac{1}{n}\sum_{i=1}^{n}g(\hat{\pi}_{n},u_{i})=\hat{\pi}_{n}+w$ . It follows immediately that $\hat{\theta}_{n}=\hat{\theta}_{\text{Boot},n}$ and both estimators are equivalently distributed.

Example 41 (uniform with unknown upper bound).

Let $x_{i}\sim\mathcal{U}(0,\theta)$ , $i=1,\ldots,n$ , be identically and independently distributed. Consider the generating function $g(\theta,u)=u\theta$ where $u\sim\mathcal{U}(0,1)$ and the (explicit) auxiliary estimator $\max_{i}{x_{i}}$ . Clearly, $\max_{i}{x_{i}}=\theta\max_{i}{u_{i}}$ . Denote $w=\max_{i}{u_{i}}$ so the auxiliary estimator on the sample is $\hat{\pi}_{n}=w_{0}\theta_{0}$ . Now define the estimator $\hat{\theta}_{n}$ to be the solution such that $d(\hat{\pi}_{n},\hat{\theta}w)=0$ . An explicit solution exists and is given by $\hat{\theta}_{n}=\frac{\theta_{0}w_{0}}{w}$ . We therefore have that

[TABLE]

and by Proposition 21 the coverage obtained on the percentiles of the distribution of $\hat{\theta}_{n}$ are exact in the frequentist sense.

We can even go further by expliciting the distribution of $\hat{\theta}_{n}$ given $\hat{\pi}_{n}$ . Let define the mapping $a(\theta)=\frac{\theta_{0}w_{0}}{\theta}$ . By the change-of-variable formula we obtain:

[TABLE]

The maximum of $n$ standard uniform random variables has the density $f_{w}(w)=nw^{n-1}$ . The derivative is given by $\partial a(\theta)/\partial\theta=-\theta_{0}w_{0}/\theta^{2}$ . Note that by Proposition 30 we equivalently have

[TABLE]

Hence, we eventually obtain:

[TABLE]

Note that $\hat{\pi}_{n}$ is a sufficient statistic. Therefore we have obtained that the posterior distribution of $\hat{\theta}_{n}$ given $\hat{\pi}_{n}$ is a Pareto distribution parametrized by $\hat{\pi}_{n}$ , the minimum value of the support, and the sample size $n$ , as the shape parameter.

In view of the preceding display, it is not difficult to develop a similar result for the parametric bootstrap (see the Definition 6). The bootstrap estimator solution is simple, it is given by $\hat{\theta}_{\text{Boot},n}=\max_{i}u_{i}\hat{\pi}_{n}=\theta_{0}w_{0}w$ . We thus obtain

[TABLE]

so it cannot be concluded that $F_{\hat{\theta}_{\text{Boot},n}|\hat{\pi}_{n}}(\theta_{0})$ follows a uniform distribution and we cannot invoke Proposition 21. Note that however we cannot exclude that the parametric bootstrap leads to exact coverage probability in virtue of Proposition 21 (see Remark 8). The parametric bootstrap is well-known to be inadequate in such problem. This fact may be made more explicit as we give now the distribution of the parametric bootstrap estimators. Let define the mapping $w=b(\tilde{\theta})=\frac{\tilde{\theta}}{\theta_{0}w_{0}}$ . Note that $b(\theta_{0})=1/w_{0}\neq w_{0}$ . We obtain by the change-of-variable formula

[TABLE]

This distribution is known to be the power-function distribution, a special case of the Pearson Type I distribution (see [80]). More interestingly, we have the following relationship between the parametric bootstrap and the SwiZs estimates:

[TABLE]

Ultimately, note that the support of the distribution of $\hat{\theta}_{\text{Boot},n}$ is $(0,\hat{\pi}_{n})$ whereas it is $(\hat{\pi}_{n},+\infty)$ for the SwiZs, so both distributions never cross! Since $\hat{\pi}_{n}$ is systematically bias downward the true value $\theta_{0}$ , the coverage of the parametric bootstrap is always null. We illustrate this fact in the next figure.

Example 42 (exponential with unknown rate parameter).

Let $x_{i}\sim\mathcal{E}(\theta)$ , $i=1,\ldots,n$ , be identically and independently distributed. Consider the generating function $g(\theta,u)=\frac{u}{\theta}$ , where $u\sim\Gamma(1,1)$ , and the inverse of the average as auxiliary estimator, denoted $\bar{x}^{-1}$ . Clearly we have $\bar{x}^{-1}=\theta/w$ , where $w=\sum_{i=1}^{n}u_{i}/n$ , so $\hat{\pi}_{n}=\theta_{0}/w_{0}$ . The solution of $d(\hat{\pi}_{n},\theta/w)=0$ in $\theta$ is denoted $\hat{\theta}_{n}$ , it is given by $\hat{\theta}_{n}=\theta_{0}w/w_{0}=w\hat{\pi}_{n}$ . We therefore have

[TABLE]

It results from Proposition 21 that any intervals built from the percentiles of the distribution of $\hat{\theta}_{n}$ has exact frequentist coverage. The distsribution can be found in explicit form. We have by the additive property of the Gamma distribution that $w\sim\Gamma(n,1/n)$ (shape-rate parametrization). It immediately results from the change-of-variable formula that

[TABLE]

Note that $\hat{\pi}_{n}$ is a sufficient statistic so the obtained distribution is a posterior distribution.

This last example on an exponential variate can be (slightly) generalized to a gamma random variable as follows.

Example 43 (gamma with unknown rate parameter).

Consider the exact same setup as in Example 42 with the exception that $x_{i}\sim\Gamma(\alpha,\theta)$ and $u\sim\Gamma(\alpha,1)$ , where $\alpha>0$ is a known shape parameter. Following the same steps as in Example 42 we find the following posterior distribution:

[TABLE]

We also have that any intervals built from the percentiles of the posterior have exact frequentist coverage probabilities.

In view of this display and Example 42, we can derive the distribution of the parametric bootstrap. The estimator is obtained as follows:

[TABLE]

where $w\sim\Gamma\left(n\alpha,1/n\right)$ . It follows by the inverse of gamma variate and the change-of-variable formula that

[TABLE]

so $\hat{\theta}_{\text{Boot},n}\overset{d}{=}1/\hat{\theta}_{n}$ . Since $\hat{\pi}_{n}=\theta_{0}/w_{0}$ , we can also conclude that the parametric bootstrap is not uniformly distributed:

[TABLE]

The posterior distribution we obtained for the SwiZs in the last example coincides with the fiducial distribution [see Table 1 81], [see Example 21.2 82]. This correspondance is not surprising in view of the discussion held after Proposition 17. Indeed the gamma distribution is a member of the exponential family and we use a sufficient statistics as the auxiliary estimator, so the SwiZs and the generalized fiducial distribution are equivalent.

We now turn our attention to more general examples where $\bm{\theta}$ is not a scalar.

Example 44 (normal with unknown mean and unknown variance).

Let $x_{i}\sim\mathcal{N}(\mu,\sigma^{2})$ be identically and independently distributed and consider $g(\mu,\sigma^{2},u)=\mu+\sigma u$ where $u\sim\mathcal{N}(0,1)$ . Take the following auxiliary estimator, $\hat{\bm{\pi}}_{n}={(\bar{x},ks^{2})}^{T}=\mathbf{h}(x)$ , where $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ , $s^{2}=\sum_{i=1}^{n}{\left(x_{i}-\bar{x}\right)}^{2}$ and $k\in{\rm I\!R}$ is any constant. Note for example that $k<0$ , so the auxiliary estimator of the variance may be negative. Indeed the SwiZs accepts situation for which $\bm{\Pi}\cap\bm{\Theta}=\emptyset$ , it is clearly not the case of the parametric bootstrap for example (see Remark 4). We have that

[TABLE]

An explicit solution exists for $d(\hat{\bm{\pi}}_{n},g(\mu,\sigma^{2},\mathbf{w}))=0$ in $(\mu,\sigma^{2})$ and is given by

[TABLE]

Note that $\bar{x}_{0}=\mu_{0}+\sigma_{0}w_{0,1}$ and $s^{2}_{0}=\sigma^{2}_{0}w_{0,2}$ . We obtain the following

[TABLE]

Therefore, by Proposition 21, any region built from the percentiles of the posterior distribution of $\hat{\bm{\theta}}_{n}$ has exact frequentist coverage. This posterior distribution has a closed form.

Note that $w_{1}\sim\mathcal{N}(0,1/n)$ . Once realized that $u_{i}-\frac{1}{n}\sum_{j=1}^{n}u_{j}\sim\mathcal{N}(0,(n-1)/n)$ , it is not difficult to obtain that $w_{2}\sim\Gamma(n/2,n/2(n-1))$ , a gamma random variable (shape-rate parametrization). It is straightforward to remark that

[TABLE]

where $\Gamma^{-1}$ represents the inverse gamma distribution. The joint distribution is known in the Bayesian literature as the normal-inverse-gamma distribution (see [83]). We thus have the following joint distribution

[TABLE]

The distribution of $\hat{\mu}$ unconditionnaly on $\hat{\sigma}^{2}$ is a non-standardized $t$ -distribution with $n$ degrees of freedom,

[TABLE]

The results on the normal distribution (Example 44) can be generalized to the linear regression.

Example 45 (linear regression).

Consider the linear regression model $\mathbf{y}=\mathbf{X}\bm{\beta}+\bm{\epsilon}$ where $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I}_{n})$ and $\dim(\bm{\beta})=p$ . Suppose the matrix $\mathbf{X}^{T}\mathbf{X}$ is of full rank. A natural generating function is $\mathbf{g}(\bm{\beta},\sigma^{2},\mathbf{X})=\mathbf{X}\bm{\beta}+\sigma\mathbf{u}$ where $\mathbf{u}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n})$ (see Example 1 for other suggestions). Take the ordinary least squares as the auxiliary estimator so we have the following explicit form:

[TABLE]

where $\mathbf{P}=\mathbf{I}_{n}-\mathbf{H}$ is the projection matrix, $\mathbf{H}=\mathbf{X}{\left(\mathbf{X}^{T}\mathbf{X}\right)}^{-1}\mathbf{X}^{T}$ is the hat matrix, $\mathbf{y}_{0}$ denotes the observed responses and $k\in{\rm I\!R}$ is any constant. Note that $\mathbf{P}$ and $\mathbf{H}$ are symmetric idempotent matrices and that $\mathbf{P}\mathbf{X}=\mathbf{0}$ . An explicit solution exists for $\hat{\bm{\theta}}_{n}={(\hat{\bm{\beta}}^{T}\,\hat{\sigma}^{2})}^{T}$ . To find it, we use the indirect inference estimator, which by Theorem 5 is the equivalent to the SwiZs estimator. Using $\mathbf{y}\overset{d}{=}\mathbf{X}\bm{\beta}+\sigma\mathbf{u}$ , we have

[TABLE]

Since $\hat{\pi}_{2}(\bm{\theta})$ depends only on $\sigma^{2}$ , solving $d(\hat{\pi}_{2},\hat{\pi}_{2}(\bm{\theta}))=0$ in $\sigma^{2}$ leads to

[TABLE]

On the other hand, solving $d(\hat{\bm{\pi}}_{1},\hat{\bm{\pi}}_{1}(\bm{\theta}))=\mathbf{0}$ in $\bm{\beta}$ leads to

[TABLE]

Since $\mathbf{y}_{0}=\mathbf{X}\bm{\beta}_{0}+\sigma_{0}\mathbf{u}_{0}$ , we obtain the following:

[TABLE]

Note that at the third equality we use the fact that $\mathbf{u}\overset{d}{=}-\mathbf{u}$ since $\mathbf{u}$ is symmetric around $\mathbf{0}$ . The last development, together with Proposition 21, demonstrates that any region built on the percentiles of the distribution of $\hat{\bm{\theta}}_{n}$ leads to exact frequentist coverage probabilities. The distribution of $\hat{\bm{\theta}}_{n}$ can be obtained in an explicit form.

Since $\mathbf{P}$ is symmetric and idempotent, it is well known that $\mathbf{u}^{T}\mathbf{P}\mathbf{u}\sim\chi^{2}_{n-p}$ [see Theorem 5.1.1 84]. Hence we obtain that

[TABLE]

As shown in Example 44, it follows that the joint distribution of $\hat{\bm{\theta}}_{n}$ conditionally on $\hat{\bm{\pi}}_{n}$ is a normal-inverse-gamma distribution

[TABLE]

and the distribution of $\hat{\bm{\beta}}$ , unconditionally on $\hat{\sigma}^{2}$ , is a multivariate non-standardized $t$ distribution with $n-p$ degrees of freedom

[TABLE]

In this last example on the linear regression, we employed the OLS as the auxiliary estimator, which is known to be an unbiased estimator. In fact, it is not a necessity to have unbiased auxiliary estimator. The next example illustrate this point.

Example 46 (ridge regression).

Consider the same setup as in Example 45, $\mathbf{y}=\mathbf{X}\bm{\beta}+\bm{\epsilon}$ , $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\sigma^{2}\mathbf{I}_{n})$ and $\operatorname{rank}(\mathbf{X}^{T}\mathbf{X})=p$ . Take the ridge estimator as the auxiliary estimator, so for the regression coefficients we have

[TABLE]

for some constant $\lambda\in{\rm I\!R}$ . Consider the squared residuals as an estimator of the variance, so after few manipulations, we obtain

[TABLE]

where $\mathbf{P}_{\lambda}\equiv\mathbf{I}_{n}-\mathbf{H}_{\lambda}$ , $\mathbf{H}_{\lambda}\equiv\mathbf{X}{\left(\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}_{p}\right)}^{-1}\mathbf{X}^{T}$ , $k\in{\rm I\!R}$ is any constant. Note that $\mathbf{P}_{\lambda}$ is symmetric but not idempotent. As in Example 45, let’s use the indirect inference estimator with $\mathbf{y}\overset{d}{=}\mathbf{X}\bm{\beta}+\sigma\mathbf{u}$ . We obtain

[TABLE]

Let $\tilde{\bm{\beta}}$ denotes the solution of $d(\hat{\bm{\pi}}_{1}^{R},\hat{\bm{\pi}}_{1}^{R}(\bm{\theta}))=0$ in $\bm{\beta}$ . We have the explicit solution given by

[TABLE]

Using $\tilde{\bm{\beta}}$ in $\hat{\pi}^{R}_{2}(\bm{\theta})$ leads to

[TABLE]

where $\mathbf{H}\equiv\mathbf{X}{\left(\mathbf{X}^{T}\mathbf{X}\right)}^{-1}\mathbf{X}$ and $\mathbf{P}\equiv\mathbf{I}_{n}-\mathbf{H}$ . We have the followings: $\mathbf{H}\mathbf{H}_{\lambda}=\mathbf{H}_{\lambda}$ , $\mathbf{P}\mathbf{P}_{\lambda}=\mathbf{P}$ and $\mathbf{P}\mathbf{H}=\mathbf{0}$ . Finding $\tilde{\sigma}^{2}$ such that $d(\hat{\pi}_{2}^{R},\hat{\pi}_{2}^{R}(\tilde{\bm{\theta}}))=0$ gives

[TABLE]

which leads to the following solution:

[TABLE]

Therefore, $\tilde{\sigma}^{2}$ is the same as $\hat{\sigma}^{2}$ we found in Example 45, and we directly have that $\tilde{\bm{\beta}}=\hat{\bm{\beta}}$ . As a consequence, the distribution of $\tilde{\bm{\theta}}$ is exactly the same as $\hat{\bm{\theta}}_{n}$ in Example 45 and the frequentist coverage probabilities are exact.

From Example 44 on the normal distribution, the derivation to closely related distribution is straightforward, as we see now with the log-normal distribution.

Example 47 (log-normal with unknown mean and unknown variance).

Let $x_{i}\sim\log\text{-}\mathcal{N}(\mu,\sigma^{2})$ be identically and independently distributed and consider $g(\mu,\sigma^{2},u)=e^{\mu}e^{\sigma u}$ where $u\sim\mathcal{N}(0,1)$ . If we take the maximum likelihood estimator as the auxiliary estimator, we have

[TABLE]

The solution is the following

[TABLE]

where $w_{1}=\frac{1}{n}\sum_{i=1}^{n}u_{i}$ and $w_{2}=\sum_{i=1}^{n}{(u_{i}-\frac{1}{n}\sum_{j=1}^{n}u_{j})}^{2}$ . It is the same solution as Example 44, hence the posterior distribution of $\hat{\bm{\theta}}_{n}$ is normal-inverse-gamma and any $\alpha$ -credible region built on this posterior have exact frequentist coverage.

Having illustrated the theory for random variable that are identically and independently distributed, we now show a last example on time series data. Note that (variations of) this example is numerically studied in [2].

Example 48 (irregularly observed geometric Brownian motion

with unknown drift and unknown volatility).

Consider the stochastic differential equation

[TABLE]

where $\{W_{t}:t\geq 0\}$ is a Wiener process and $\bm{\theta}={(\mu\;\sigma^{2})}^{T}$ are the drift and volatility parameters. An explicit solution to Itô’s integral exists and is given by

[TABLE]

Suppose we observe the process at $n$ points in time: $t_{1}<t_{2}<\ldots<t_{n}$ , $\forall i$ $t_{i}\in{\rm I\!R}^{+}$ . Define the difference in time by $\Delta_{i}=t_{i}-t_{i-1}$ , so we have $n-1$ time differences. Note that all the time differences are positive, $\Delta_{i}>0$ , and we allow the process to be irregularly observed, $\Delta_{i}\neq\Delta_{j},i\neq j$ . Instead of working directly with the process $\{y_{t_{i}}:i\geq 1\}$ , it is more convenient to work with the following transformation of the process $\{x_{t_{i}}=\ln(y_{t_{i}}/y_{t_{i-1}}):i\geq 2\}$ . Indeed, we have

[TABLE]

By the properties of the Wiener process, we have $W_{t_{i}}-W_{t_{i-1}}\sim\mathcal{N}(0,\Delta_{i})$ and $W_{t_{i}}-W_{t_{i-1}}$ is independent from $W_{t_{j}}-W_{t_{j-1}}$ for $i\neq j$ . Hence the vector $\mathbf{x}={(x_{t_{2}}\;\dots\;x_{t_{n}})}^{T}$ is independentely but non-identically distributed according to the joint normal distribution

[TABLE]

where $\bm{\Delta}={(\Delta_{2}\;\dots\;\Delta_{n})}^{T}$ and $\Sigma=\operatorname{diag}(\bm{\Delta})$ . Note that $\bm{\Delta}=\Sigma\mathbf{1}_{n-1}$ , where $\mathbf{1}_{n-1}$ is a vector of $n-1$ ones, and $\bm{\Delta}^{T}\mathbf{1}_{n-1}=\bm{\Delta}^{T/2}\bm{\Delta}^{1/2}$ since all the $\Delta$ are positives.

We consider the following auxiliary estimators:

[TABLE]

Since $\mathbf{x}\overset{d}{=}(\mu-\sigma^{2}/2)\bm{\Delta}+\sigma\Sigma^{1/2}\mathbf{z}$ , where $\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{n-1})$ , we obtain the following indirect inference estimators (or equivalently SwiZs),

[TABLE]

and

[TABLE]

Solving $d(\hat{\pi}_{1},\hat{\pi}_{1}(\hat{\bm{\theta}}))=0$ in $\hat{\mu}$ gives

[TABLE]

Now solving $d(\hat{\pi}_{2},\hat{\pi}_{2}(\hat{\bm{\theta}}))=0$ in $\hat{\sigma}^{2}$ and substituing $\hat{\mu}$ by the above expression in (3) leads to

[TABLE]

where $\mathbf{P}=\mathbf{I}_{n-1}-\bm{\Delta}^{1/2}{\left(\bm{\Delta}^{T/2}\bm{\Delta}^{1/2}\right)}^{-1}\bm{\Delta}^{T/2}$ is symmetric and idempotent, and $\mathbf{Q}=\Sigma^{-1}-\mathbf{1}_{n-1}{\left(\bm{\Delta}^{T/2}\bm{\Delta}^{1/2}\right)}^{-1}\mathbf{1}_{n-1}^{T}$ . By the properties of the rank of a matrix, we have $\operatorname{rank}(\mathbf{P})=\operatorname{trace}(\mathbf{P})=n-2$ . Note that by independence $\mathbf{z}^{T}\Delta^{1/2}\overset{d}{=}z(\Delta^{T/2}\Delta^{1/2})$ , where $z$ is a single standard normal random variable. Similarly to the example on the linear regression (Example 45), we obtain the explicit distributions

[TABLE]

As with Example 45, this findings suggest that $\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}$ is jointly distributed according to a normal-inverse-gamma distribution. However, $\hat{\sigma}^{2}$ appears in the mean of $\hat{\mu}|(\hat{\bm{\pi}}_{n},\hat{\sigma}^{2})$ so such conclusion is not straightforward. We leave the derivation of the joint distribution and the distribution of $\hat{\mu}$ unconditionnal on $\hat{\sigma}^{2}$ for further research.

We now demonstrate that the $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution of $\hat{\bm{\theta}}_{n}$ leads to exact frequentist coverage probabilities. Once realized that $\Sigma^{-1}=\Sigma^{-1/2}\Sigma^{-1/2}$ , $\Sigma^{1/2}\mathbf{1}_{n-1}=\bm{\Delta}^{1/2}$ , and $\bm{\Delta}^{T}\Sigma^{-1}=\mathbf{1}_{n-1}$ , it is not difficult to show that $\bm{\Delta}^{T}\mathbf{Q}\bm{\Delta}=0$ , $\bm{\Delta}^{T}\mathbf{Q}\Sigma^{1/2}=0$ and $\Sigma^{1/2}\mathbf{Q}\Sigma^{1/2}=\mathbf{P}$ . Since $\mathbf{x}_{0}=(\mu_{0}-\sigma^{2}_{0}/2)\bm{\Delta}+\sigma_{0}\Sigma^{1/2}\mathbf{z}_{0}$ , we obtain

[TABLE]

Therefore,

[TABLE]

where $k_{0}=\sigma_{0}\sqrt{w_{0}}/2$ . Thus, any region on the joint distribution of $\hat{\bm{\theta}}_{n}$ leads to exact frequentist coverage by Proposition 21.

7 Simulation study

The main goal of this section is threefold. First, we illustrate the results of the Section 4 on the frequentist properties in finite sample of the SwiZs in the general case where no solutions are known in explicit forms, as opposed to the Section 6, and thus requiring numerical solutions. In order to achieve this point, we measure at different levels the empirical coverage probabilities of the intervals built from the percentiles of the $\hat{\bm{\pi}}_{n}$ -approximate posterior obtained by the SwiZs. Note that for $\dim(\bm{\theta})>1$ , we only considered marginal intervals to avoid a supplementary layer of numerical nuisance, the coverage probabilities are not concerned by this choice, only the length of the intervals. Second, we elaborate on the verification of the conditions of Theorem 28 with the examples at hand. As already motivated, the emphasis is on the estimating function. It seems easier to verify Assumption 26 than Assumption 27, since only one of them is necessary to satisfy Theorem 28, we concentrate our efforts on the former. We also brighten the study up to situations where Assumption 26 does not entirely hold or cannot be verified to measure its consequences empirically. Third, we give the general idea on how to implement the SwiZs. Indeed, anyone familiar with the numerical problem of solving a point estimator such as the maximum likelihood estimator has a very good idea on how to obtain the auxiliary estimator $\hat{\bm{\pi}}_{n}$ . Solving the estimating function for the parameters of interest is very similar, it requires the exact same tools but has the inconvenient of needing further analytical derivations and implementations details. As already remarked, the parametric bootstrap does not possess such inconvenient. The counterpart is that the SwiZs may lead to exact coverage probabilities. The motto “no pain, no gain” is particularly relevant here. For this purpose, the parametric bootstrap is proposed as the point of comparison for all the examples of this section. We measure the computational time as experienced by the user in order to appreciate the numerical burden. In case both the SwiZs and the parametric bootstrap have very similar coverage probabilities, we also quantify the length of the intervals as a mean of comparison.

As a subsidiary goal of this section, we study the point estimates of the SwiZs. Indeed, the indirect inference is also a method for reducing the small sample bias of an initial (auxiliary) estimator, even in situations where it may be “unnatural” to call such method, as for example, when a maximum likelihood estimator may be easily obtained (see [14]). Since the SwiZs is a special case of indirect inference, it would be interesting to gauge the ability of the SwiZs to correct the bias. We explore the properties of the mean and the median of the SwiZs. This choice is arbitrary but largely admitted.

There are common factors in the implementation of all the examples of this section so we start by mentioning them by category. For the design, we use $M=10,000$ independent trials so we can appreciate the coverage probabilities up to the fourth digit. We evaluate numerically the $\hat{\bm{\pi}}_{n}$ -approximate posterior distribution of the SwiZs and the parametric bootstrap distribution based on $S=10,000$ replicates. We measure the coverage probabilities at $50\%,75\%,90\%,95\%$ and $99\%$ levels. Although sometimes we do not report all of them for more clarity of the presentation, they are however shown in Appendix for more transparency.

We select five different scenarii. First, we start with a toy example by considering a standard Student’s $t$ -distribution with unknown degrees of freedom (Example 49). Although the Student distribution is ubiquitous in statistics since at least Gosset’s Biometrika paper ([85]), there are no simple tractable way to construct an interval of uncertainty around the degrees of freedom. In addition, the degrees of freedom is a parameter that gauges the tail of the distribution and is not particularly easy to handle. The existence of the moments of this distribution depends upon the values that this parameter takes. We take a particular interest in small values of this parameter for which, for example the variance or the kurtosis are infinite.

Example 49 (standard $t$ -distribution with unknown degrees of freedom).

Let $x_{i}\sim t(\theta)$ , $i=1,\cdots,n$ , be identically and independently distributed with density

[TABLE]

where $\theta$ represents the degrees of freedom and $\mathcal{B}$ is the beta function. We consider the likelihood score function as the estimating function and we take the MLE as the auxiliary estimator. In this situation, $\Theta$ and $\Pi$ are equivalent, and thus, there are no reasons to disqualify the parametric bootstrap. Substituing $\theta$ by $\pi$ in the Equation 4, taking then the derivative with respect to $\pi$ of the log-density leads to the following

[TABLE]

where $\psi$ is the digamma function. We now verify Assumption 26 so Theorem 28 can be invoked. Suppose Assumption 24 holds so we can write the following scalar-valued function

[TABLE]

where $\hat{\pi}_{n}$ is fixed. The first derivative with respect to $\theta$ is given by

[TABLE]

Substituing $(\partial/\partial\theta)g$ by $(\partial/\partial w)g$ gives the first derivative with respect to $w$ . The derivative exists everywhere so $K_{n}=\emptyset$ . Therefore, if the generating function $g(\theta,w)$ is once continuously differentiable in both its arguments then Assumption 26 (i) is satisfied.

The determinant here is $\lvert\frac{\partial}{\partial\theta}\varphi_{\hat{\pi}_{n}}(\theta,w)\rvert$ . It will be zero on a countable set of points: if $g(\theta,w)=0$ , if $(\partial/\partial\theta)g(\theta,w)=0$ or if the rightest term of the Equation 5 is 0. Substituing $(\partial/\partial\theta)g$ by $(\partial/\partial w)g$ gives the same analysis. Hence, the determinant of the derivatives of the estimating function is almost everywhere non-null and Assumption 26 (ii) is satisfied.

Eventually, we clearly have that

[TABLE]

As a consequence, given that $\lim_{\lVert(\theta,w)\rVert\to\infty}\lvert g(\theta,w)\rvert=\infty$ , Assumption 26 (iii) is satisfied.

In the light of these findings, the choice of generating function is crucial and there are many candidates [see e.g. 86]. The inverse cumulative distribution function is a natural choice, but a numerically complicated one in this case. Indeed, it can be obtained by

[TABLE]

where $u_{1}\sim\mathcal{U}(0,1)$ and $z$ is equal to the incomplete beta function inverse parametrized by $\theta$ and depending on $u_{1}$ . An alternative choice, numerically and analytically simpler, is to consider Bailey’s polar algorithm [87], which is given by

[TABLE]

where $u_{2,2}\overset{d}{=}u_{2,1}^{2}+u_{2,3}^{2}$ if $u_{2,2}\leq 1$ and $u_{2,1},u_{2,3}\sim\mathcal{U}(-1,1)$ . Clearly $g_{2}(\theta,\mathbf{u}_{2})$ is once continuously differentiable in each of its arguments and the limit is $\lim_{(\theta,u_{2,1},u_{2,2})\to(\infty,1,1)}\lvert g_{2}(\theta,u_{2,1},u_{2,3})\rvert=\infty$ . Hence, even if $w$ is unknown, these results strongly suggests that the conditions of Theorem 28 hold, and as a conclusion, any intervals built on the percentiles of the distribution of $\hat{\theta}_{n}$ given $\hat{\pi}_{n}$ have exact frequentist coverage.

The coverage probabilities in the Table 1 below are computed for three different values of $\theta_{0}=\{1.5,3.5,6\}$ and a sample size of $n=50$ . When $\theta_{0}=1.5$ , the variance of a Student’s random variable is infinite and the skewness and kurtosis of the distribution are undefined. When $\theta_{0}=3.5$ , the variance is finite and the kurtosis is infinite. When $\theta_{0}=6$ , the first five moment exists.

The SwiZs is accurate at all the confidence levels with a maximum discrepancy of 1.39% in absolute value. This is very reasonable considering the numerical task we perform. In comparison, the parametric bootstrap has a minimum discrepancy of $0.87\%$ for an average of $4.44\%$ . The SwiZs is also more efficient, it dominates the parametric bootstrap with a median interval length systematically smaller. The parametric bootstrap is however about six times faster than the SwiZs to compute the intervals. The comparison is not totally fair in disfavor of the SwiZs as we were able here to use directly the log-likelihood for the parametric bootstrap, which is numerically simpler to evaluate than the estimating functions. We also bear the comparison with the bias-corrected and accelerated (BCa) resampling bootstrap of [88]. Performances of this bootstrap scheme are comparable to the parametric bootstrap. Finally, when considered in absolute value, 0.2 second do not seem to be a hard effort for obtaining interval which is nearly exact and shorter.

Second, we consider a more practical case with the two-parameters Lomax distribution ([89]) (Example 50), also known as the Pareto II distribution. This distribution has been used to characterise wealth and income distributions as well as business and actuarial losses (see [90] and the references therein). Because of this close relationship to the application, we also measure the coverage probabilities of the Gini index, the value-at-risk and the expected shortfall, quantities that may be of interest for the practitioner. The maximum likelihood estimator has been shown in [91] to suffer from small sample bias when $n$ is relatively small and the parameters are close to the boundary of the parameter space. We add their proposal for bias adjustment to the basket of comparative methods. To keep the comparison fair, we use a similar simulation scenario to the ones they proposed, which were also motivated by their closeness to situations encountered in practice. Situations where the Lomax distribution is employed has been shown to suffer from influential outliers ever since at least [92], we therefore consider, in a second time, the weighted maximum likelihood ([93]) as the auxiliary estimator to gain robustness. Interestingly, the weighted maximum likelihood estimator is generally not a consistent estimator (see [94, 13]) so the parametric bootstrap cannot be invoked directly, whereas, on the countrary, the SwiZs may be employed without any particular care.

Example 50 (two-parameters Lomax distribution).

Let $x_{i}\sim\text{Lomax}(\bm{\theta})$ , $i=1,\cdots,n$ , $\bm{\theta}=(b,q)$ , be identically and independently distributed with density

[TABLE]

where $b,q>0$ are shape parameters. We consider the likelihood score function as the estimating function and we take the MLE as the auxiliary estimator. The parameter sets $\bm{\Theta}$ and $\bm{\Pi}$ are equivalent with this setup, and thus, the parametric bootstrap may be employed. Substituing $\bm{\theta}$ by $\bm{\pi}$ in the Equation 6, taking then the derivative with respect to $\bm{\pi}$ of the log-density leads to the following

[TABLE]

We now verify Assumption 26 so Theorem 28 can be invoked. Suppose Assumption 24 on the existence of a random variable with the same dimensions as $\bm{\theta}$ holds, and let denote it by $\mathbf{w}={(w_{1}\;w_{2})}^{T}$ . Now assume that we can re-express the estimating function as follows

[TABLE]

where $\hat{\bm{\pi}}_{n}$ is fixed. The Jacobian matrix with respect to $\bm{\theta}$ is given by

[TABLE]

where

[TABLE]

Note that $\hat{\bm{\pi}}_{n}$ and $g(\bm{\theta},\mathbf{w})$ are strictly positive, so $\kappa_{1}(\bm{\theta})<0$ and $\kappa_{2}(\bm{\theta})>0$ . Substituing $D_{\bm{\theta}}g$ by $D_{\mathbf{w}}g$ leads to the Jacobian matrix with respect to $\mathbf{w}$ , given by

[TABLE]

We see by inspection that the derivatives are defined everywhere and $\mathbf{K}_{n}=\{\emptyset\}$ . If $D_{\bm{\theta}}g$ and $D_{\mathbf{w}}g$ exist and are continuous, then Assumption 26 (i) is satisfied.

The determinants are given by

[TABLE]

where $\kappa(\bm{\theta},\mathbf{w})=\kappa_{1}(\bm{\theta})\kappa_{2}(\bm{\theta})$ and $\kappa(\bm{\theta},\mathbf{w})<0$ . The only scenarii where these determinants are zero are whether all the partial derivatives are zero, or if $(\partial/\partial a)g(\bm{\theta},w_{1})(\partial/\partial b)g(\bm{\theta},w_{2})=(\partial/\partial a)g(\bm{\theta},w_{2})\;(\partial/\partial b)g(\bm{\theta},w_{1})$ . Since the Lomax random variables are absolutely continuous, it is impossible for the generating function to be flat on $\bm{\theta}$ and on $\mathbf{w}$ , except maybe in extreme cases. Therefore, situations where the determinants are zero are countable, and Assumption 26 (ii) is satisfied.

Suppose the generating function satisfies the following property:

[TABLE]

Since the limit of the natural logarithm tends to infinity when its argument diverges, we clearly have that

[TABLE]

and as a consequence, Assumption 26 (iii) is satisfied.

It remains to demonstrate that a generating function satisfies the above properties. A natural and computationally easy choice for the generating function is the inverse cdf, it is given by

[TABLE]

Clearly the generating function is once continuously differentiable in each $(b,q,u)$ . The only possibilities for the partial derivatives of $g$ to be zero are whether $q=\{+\infty\}$ or $u=\{0\}$ . The generating function tends to infinity when $b$ diverges whereas it remains constant when $q$ or $u$ diverges. All these findings strongly suggest that Theorem 28 is applicable here, and as a conclusion that any intervals built on the percentiles of the SwiZs distribution lead to exact frequentist coverage probabilities.

However, the situation is less optimistic with the weighted maximum likelihood. Indeed, the estimating function is typically modified as follows:

[TABLE]

where $\mathrm{w}(\bm{\theta},\mathbf{u},\bm{\pi},k)$ is some weight function typically taking values in $[0,1]$ that depends upon a tuning constant $k$ . Usual weight functions are Huber’s type ([95]) and Tukey’s biweighted function ([96]); see [97] for a textbook on robust statistics. For an estimating function to be robust, the weight function either decreases to 0 or remains constant for large values of $x$ . As a consquence, at least two out of the three hypothesis of Assumption 26 do not hold. Indeed, the determinants will be zero on an uncountable set and $\lim_{\lVert(\bm{\theta},\mathbf{w})\rVert\to\infty}\widetilde{\bm{\Psi}}_{n}<\infty$ .

For the simulations, we set $\bm{\theta}_{0}={(2\;\;2.3)}^{T}$ and use $n=\{35,50,100,150,250,500\}$ as sample sizes. As already mentioned, this setup is close to the ones proposed in [91], and we thus add their proposal for correcting the bias of the maximum likelihood estimator to the basket of the compared methods. The bias-adjustment estimator is given by

[TABLE]

where

[TABLE]

and

[TABLE]

All the detailed results of simulation are in Appendix D.1. In Figure 1, we discover that the SwiZs has very accurate coverage probabilities at all levels and all sample sizes which seems in accordance with Theorem 28 and the subsequent verification analysis for this example. For sample sizes greater or equal to 250, the parametric bootstrap and the bias-adjustment proposal of [91] meet the performance of the SwiZs at almost every levels. However, below a sample of 150, the performance of the bias-adjustment are catastrophic. This may only be explained by the following phenomenon: the maximum likelihood is adjusted too severely for small values of $n$ , and for a large proportion of the time the resulting bias-adjusted estimator is out of the parameter space $\bm{\Theta}$ . We report in Table 2 our empirical findings. This phenomenon affects not only the coverage probabilities but also the variation of this estimator (Figure 3) and the length of the confidence intervals (Figure 2). Here we opted for discarding the inadmissible values (negative), thereby reducing artificially the variance and the length of the confidence intervals of the bias-adjustment. All the other methods considered do not suffer from the positivity constrain on $\bm{\theta}$ and thus we do not attempt to tackle this limitation of the bias-adjustment method.

The SwiZs has shorter uncertainty intervals than the parametric bootstrap, however it is more demanding in computational efforts (Figure 2). The computational comparison is not entirely fair in disfavor of the SwiZs as here we take advantage that the maximum likelihood estimator can be optimized directly on the log-likelihood, which is numerically easier to evaluate than the likelihood scores that constitues the estimating function. An unexpected good surprise emerges from Figure 3 where it seems that taking the median of the SwiZs leads to almost median unbiased point estimators. The same may be said when using the weighted maximum likelihood as the auxiliary estimator (Figure 5). However, using a robust estimator as the auxiliary parameter do not offer interesting coverage probabilities in small samples (Figure 4), which seems to indicate that Assumption 26 may not be easily relaxed. The parametric bootstrap unsurprisingly fails completely when considering an inconsistent estimator. Eventually, the empirical distributions in Figure 6 reminds us of the difficulty of estimating confidence regions.

Third, we investigate a linear mixed-model. These models are very common in statistics as they incorporate both parameters associated with an entire population and parameters associated with individual experimental units facilitating thereby the study of, for examples, longitudinal data, multilevel data and repeated measure data. Although being widespread, the inference on the parameters remain a formidable task. We study a rather simple model, namely the random intercept and random slope model when data is balanced.

Example 51 (random intercept and random slope linear mixed model).

Consider the following balanced Gaussian mixed linear model expressed for the $i$ th individual as

[TABLE]

where $\bm{\epsilon}_{i},\alpha_{i}$ and $\gamma_{i}$ are identically and independently distributed according to centered Gaussian distributions with respective variances $\sigma^{2}_{\epsilon}\mathbf{I}_{m},\sigma^{2}_{\alpha}$ and $\sigma^{2}_{\gamma}$ , $m$ being the number of replicates, the same for each individual, and $\mathbf{1}_{m}$ is a vector of $m$ ones. The vector of parameters of interest is $\bm{\theta}={\left(\beta_{0},\beta_{1},\sigma^{2}_{\epsilon},\sigma^{2}_{\alpha},\sigma^{2}_{\gamma}\right)}^{T}$ . Let $\bm{\pi}={\left(\pi_{0},\ldots,\pi_{4}\right)}^{T}$ be the corresponding vector of auxiliary parameters. We take the MLE as the auxiliary estimator and thus consider the likelihood score function as the estimating function. With this setup, the parameter spaces $\bm{\Theta}$ and $\bm{\Pi}$ are equivalent, and the parametric bootstrap may be employed. Denote by $N=nm$ the total sample size. The negative log-likelihood may be expressed as

[TABLE]

for some constant $k$ and where $\bm{\Omega}_{i}(\bm{\theta})=\sigma^{2}_{\epsilon}\mathbf{I}_{m}+\sigma^{2}_{\alpha}\mathbf{1}_{m}\mathbf{1}^{T}_{m}+\sigma^{2}_{\gamma}\mathbf{x}_{i}\mathbf{x}^{T}_{i}$ is clearly a symmetric positive definite matrix. Taking the derivatives with respect to $\bm{\theta}$ , then substituing $\bm{\theta}$ by $\bm{\pi}$ and $\mathbf{y}_{i}$ by $\mathbf{g}(\bm{\theta},\mathbf{u}_{i})$ leads to

[TABLE]

where $\mathbf{z}(\bm{\theta},\mathbf{u}_{i},\bm{\pi})=\mathbf{g}(\bm{\theta},\mathbf{u}_{i})-\pi_{0}\mathbf{1}_{m}-\pi_{1}\mathbf{x}_{i}$ (see also [98] for more details on these derivations). The derivatives of $\bm{\Omega}_{i}(\bm{\pi})$ are easily obtained: $(\partial/\partial\pi_{2})\bm{\Omega}_{i}(\bm{\pi})=\mathbf{I}_{m}$ , $(\partial/\partial\pi_{3})\bm{\Omega}_{i}(\bm{\pi})=\mathbf{1}_{m}\mathbf{1}^{T}_{m}$ and $(\partial/\partial\pi_{4})\bm{\Omega}_{i}(\bm{\pi})=\mathbf{x}_{i}\mathbf{x}^{T}_{i}$ . Since they do not depend on parameters, let denotes $(\partial/\partial\pi_{j})\bm{\Omega}_{i}(\bm{\pi})\equiv\mathbf{D}_{ij}$ .

We now motivate the possibility to employ Theorem 28 by verifying Assumption 26. First, we suppose that a random variable $\mathbf{w}$ of the same dimension as $\bm{\theta}$ exists. Then, we assume that the estimating function may be re-expressed as follows:

[TABLE]

where $\mathbf{z}_{i}(\bm{\theta},w_{j},\hat{\bm{\pi}}_{N})=\mathbf{g}(\bm{\theta},w_{j})-\hat{\pi}_{0}\mathbf{1}_{m}-\hat{\pi}_{1}\mathbf{x}_{i}$ , $j=0,1,2,3,4$ , and $\hat{\bm{\pi}}_{N}$ is fixed. The Jacobian matrix with respect to $\bm{\theta}$ is given by

[TABLE]

Substituing $D_{\bm{\theta}}\mathbf{g}^{T}$ by $D_{\mathbf{w}}\mathbf{g}^{T}$ in the above delivers immediately the Jacobian matrix with respect to $\mathbf{w}$ . Note that this second Jacobian is a diagonal matrix. Clearly, the differentiability and continuity of $\bm{\varphi}_{\hat{\bm{\pi}}_{N}}$ depends exclusively upon the differentiability and continuity of $\mathbf{g}$ . Ergo, if $D_{\bm{\theta}}\mathbf{g}$ and $D_{\mathbf{w}}\mathbf{g}$ exist and are continuous, then Assumption 26 (i) holds.

These Jacobian matrices may have a null determinant under two circumstances: whether the generating function $\mathbf{g}$ is flat on $\bm{\theta}$ and/or $\mathbf{w}$ , and/or whether they are linearly dependent. Since the Normal distribution is absolutely continuous, $\mathbf{g}$ may be flat only on extreme cases. The Jacobian $D_{\mathbf{w}}\bm{\varphi}_{\hat{\bm{\pi}}_{N}}$ is a diagonal matrix, so its determinant is null if and only if one of its diagonal element is null. Since both the design and $\hat{\bm{\pi}}_{N}$ are fixed, situations where $D_{\bm{\theta}}\bm{\varphi}_{\hat{\bm{\pi}}_{N}}$ is linearly dependent may occur if the vectors $(\partial/\partial\theta_{j})\mathbf{g}(\bm{\theta},\mathbf{w})=k(\partial/\partial\theta_{j^{\prime}})\mathbf{g}(\bm{\theta},\mathbf{w}),j\neq j^{\prime},$ for some constant $k\in{\rm I\!R}$ . But because $\mathbf{w}$ is random, this situation is unlikely to occur, and, depending on $\mathbf{g}$ , Assumption 26 (ii) is plausible.

Eventually, it clearly holds that

[TABLE]

if $\lVert\mathbf{g}(\bm{\theta},\mathbf{w})\rVert\to\infty$ as $\lVert(\bm{\theta},\mathbf{w})\rVert\to\infty$ , so Assumption 26 (iii) is satisfied given that $\mathbf{g}$ fulfills the requirement.

Once again, the plausibility of Assumption 26 is up to the choice of the generating function. A popular choice is the following:

[TABLE]

where $\mathbf{C}_{i}(\bm{\theta})$ is the lower triangular Cholesky factor such that $\mathbf{C}_{i}(\bm{\theta})\mathbf{C}_{i}^{T}(\bm{\theta})=\bm{\Omega}_{i}(\bm{\theta})$ . It is straightforward to remark that $\mathbf{g}$ is once continuously differentiable in $\beta_{0},\beta_{1}$ and $\mathbf{u}_{i}$ . For the variances components, the partial derivatives of the Cholesky factor is given by Theorem A.1 in [99]:

[TABLE]

where the function $L$ returns the lower triangular and half of the diagonal elements of the inputed matrix, that is:

[TABLE]

The partial derivatives of the covariance matrix are given by: $(\partial/\partial\sigma^{2}_{\epsilon})\bm{\Omega}_{i}(\bm{\theta})=\mathbf{I}_{m}$ , $(\partial/\partial\sigma^{2}_{\alpha})\bm{\Omega}_{i}(\bm{\theta})=\mathbf{1}_{m}\mathbf{1}_{m}^{T}$ and $(\partial/\partial\sigma^{2}_{\gamma})\bm{\Omega}_{i}(\bm{\theta})=\mathbf{x}_{i}\mathbf{x}_{i}^{T}$ . Hence, $\mathbf{C}_{i}(\bm{\theta})$ is once differentiable. For the continuity of the partial derivative of $\mathbf{C}_{i}(\bm{\theta})$ , note that $\mathbf{C}_{i}(\bm{\theta})$ and $\mathbf{C}^{-1}_{i}(\bm{\theta})$ are once differentiable and thus continuous. Indeed, $(\partial/\partial\theta_{j})\mathbf{C}^{-1}_{i}(\bm{\theta})=-\mathbf{C}^{-1}_{i}(\bm{\theta})[(\partial/\partial\theta_{j})\mathbf{C}_{i}(\bm{\theta})]\mathbf{C}_{i}^{-1}(\bm{\theta})$ . Eventually, $(\partial/\partial\theta_{j})\bm{\Omega}_{i}(\bm{\theta})$ is constant in $\bm{\theta}$ , and therefore continuous. Since matrix product preserves the continuity, the Cholesky factor is once continuously differentiable. The partial derivatives of $\mathbf{g}$ may be zero if the design is null or if the pivotal quantity is zero, two extreme situations unlikely encountered. It is straightforward to remark that the estimating function diverges as $\bm{\theta}$ and $\mathbf{u}_{i}$ tends to infinity. All these findings make usage of Theorem 28 highly plausible.

Let us turn our attention to simulations. We set $\bm{\theta}_{0}=(1,0.5,0.5^{2},0.5^{2},0.2^{2})^{T}$ and considered $n=m=\{5,10,20,40\}$ such that $N=nm=\{25,\;100,\;400,\;1,600\}$ . The detailed results of simulations may be found in the tables of Appendix D.2. In Figure 7, we can observe the outstanding performances of the SwiZs in terms of coverage probabilities, which supports our analysis and the possibility of using Theorem 28. The parametric bootstrap meets the performance of the SwiZs as the sample size increases, however, when the sample size is small, it is off the ideal level for the variance components. The length of the marginal intervals of uncertainty are comparable between the two methods, except for the smallest sample size considered where it is anyway harder to interpret the size of the interval of the parametric bootstrap since it is off the confidence level. We also bear the comparison with profile likelihood confidence intervals which are based on likelihood ratio test. The coverage probabilities are almost undistinguishable from the SwiZs whereas interval lengths for variance components are the shortest. We interpret such good performances as follows: first, as shown in Example 45 on linear regression, asymptotic and finite sample distributions coincides in theory, coincidance that may be still hold in the present case with balanced linear mixed model; second, larger intervals accounts for the fact that no simulations are needed. A good surprise appears in Figure 8 where the median of the SwiZs shows good performances in terms of relative median bias.

Fourth, we study inference in queueing theory models (see [100] for a monograph). In particular, we re-investigate the M/G/1 model studied by [12, 101, 52]. Although the underlying process is relatively simple, there is no known closed-form for the likelihood function and inference is not easy to conduct.

Example 52 (M/G/1-queueing model).

Consider the following stochastic process

[TABLE]

for $i=1,\cdots,n,$ where $\sigma^{\varepsilon}_{i}=\sum_{j=1}^{i}\varepsilon_{j}$ , $\sigma^{x}_{i}=\sum_{j=1}^{i}x_{j}$ , $v_{i}$ is identically and independently distributed according to a uniform distribution $\mathcal{U}(\theta_{1},\theta_{2})$ , $0\leq\theta_{1}<\theta_{2}<\infty$ and $\varepsilon_{i}$ is identically and independently distributed according to an exponential distribution $\mathcal{E}(\theta_{3})$ , $\theta_{3}>0$ . In queueing theory, random variables have special meaning, for the $i$ th customer: $x_{i}$ represents interdeparture time, $v_{i}$ is service time and $\varepsilon_{i}$ corresponds to interarrival time. Only the interdeparture times $x_{i}$ are observed, $v_{i}$ and $\varepsilon_{i}$ are latent. All past information influence the current observation and therefore this process is not Markovian. Finding an “appropriate” auxiliary estimator is challenging as we now discuss.

In this context, semi-automatic ABC approaches by [101] and [52] use several quantiles as summary statistics for the auxiliary estimator. This method cannot be employed here for the SwiZs because, first, the restriction that $\dim(\bm{\theta})=\dim(\bm{\pi})$ would be violated, and second, the quantiles are non-differentiables with respect to $\mathbf{g}$ and consequently, as already discussed, Assumptions 26 and 27 would not hold. However, [12] present different choices and motivate a particular auxiliary model with the following closed-form:

[TABLE]

where $-1\leq\alpha\leq 1$ is some constant. Motivations for this auxiliary model are based on a graphical analysis of the sensitivity of $\hat{\bm{\pi}}_{n}(\bm{\theta})$ with respect to $\bm{\theta}$ and the root mean squared errors performances of $\hat{\bm{\theta}}_{n}$ on simulations. Unfortunately, Assumption 26 is not satisfied with this choice. Indeed, by taking the likelihood scores of the auxiliary model as the estimating equation, one can realize that the score relative to $\pi_{2}$ is

[TABLE]

hence, it does not depend on $\bm{\theta}$ ! This result implies directly that all the partial derivatives with respect to $\bm{\theta}$ and $\mathbf{w}$ are null and $\det(\bm{\varphi}_{\hat{\bm{\pi}}_{n}})=0$ for all $(\bm{\theta},\mathbf{w})\in(\bm{\Theta}_{n}\times W_{n})$ . Assumption 27 is also violated and Theorem 28 cannot be invoked. Worse, the behaviour of this score does not depend on $n$ and the identifiability condition in Assumption 32 (ii) does not hold since $\Phi_{2}(\bm{\theta}_{1},\bm{\pi})=\Phi_{2}(\bm{\theta}_{2},\bm{\pi})$ for all $(\bm{\theta}_{1},\bm{\theta}_{2})\in\bm{\Theta}$ , so using this auxiliary model does not lead to a consistent estimator. It is however not clear whether Assumption 33, the alternative to Assumption 32, holds or not because the quantities to verify are unknown. Note however that in view of the equivalence theorem between the SwiZs and the indirect inference estimator (Theorem 5), it would appear as a contradiction for Assumption 32 not to hold but Assumption 33 to be satisfied.

[12]** idea is to select an auxiliary model where $\hat{\bm{\pi}}_{n}(\bm{\theta})$ is both sensitive to $\bm{\theta}$ and efficient for a given $\bm{\theta}$ . Since they justify their choice on a graphical analysis with simulated samples, one may wonder whether the authors were unlucky or misleaded by the graphics on this particular example. In fact, although $\hat{\bm{\pi}}_{n}(\bm{\theta})$ is unknown in an explicit form, its Jacobian may be derived explicitly by mean of an implicit function theorem, so for a given $\bm{\theta}_{1}\in\bm{\Theta}$ we have:

[TABLE]

The Jacobian $D_{\bm{\pi}}\bm{\Psi}_{n}$ is non zero. Yet, as already discussed, the second partial derivative of $\bm{\Psi}_{n}$ with respect to $\bm{\theta}$ is null. Because only the second row of $D_{\bm{\theta}}\bm{\Psi}_{n}$ has zero entries, there is no reason to believe that $D_{\bm{\theta}}\hat{\bm{\pi}}_{n}(\bm{\theta})$ has zero entries. Consequently, the authors were not misleaded by the gaphics or unlucky, it is the criterion itself that is misleading.

We now face ourselves to the delicate task of choosing an auxiliary model which non-only respects the constraint $\dim(\bm{\theta})=\dim(\bm{\theta})$ , but also makes Assumption 26 plausible. In view of this particular M/G/1 stochastic process, using the convolution between a gamma with shape parameter $n$ and unknown rate parameter and a uniform distributions may be a “natural” choice, yet, terms computationally complicated to evaluate readily appear. We propose instead of using Fréchet’s three parameters extreme value distribution, whose density is given, for $i=1,\ldots,n$ , by:

[TABLE]

where $\pi_{1}>0$ is a shape parameter, $\pi_{2}>0$ is a scale parameter and $\pi_{3}\in{\rm I\!R}$ is a parameter representing the location of the minimum. The relationship between $\pi_{3}$ and $\theta_{1}$ as the minimum of the distribution seems natural and we thus further constrain here $\pi_{3}$ to be non-negative, so $\bm{\pi}>0$ . However, the existence of a potential link between ${(\theta_{2},\theta_{3})}^{T}$ and ${(\pi_{1},\pi_{2})}^{T}$ is not self-evident, but certainly that the shape ( $\pi_{1}$ ) and scale ( $\pi_{2}$ ) parameters offer enough flexibility to “encompass” the distribution of the M/G/1 stochastic process as illustrated in Figure 9. Note that the “closeness” between M/G/1 and Fréchet models is also dependent on the parametrization.

It remains to advocate this choice in the light of Assumption 26. We take the maximum likelihood estimator of Fréchet’s distribution as the auxiliary estimator and thus the likelihood score as the estimating function, which is given by:

[TABLE]

Let us assume that a random variable $\mathbf{w}$ with the same dimension as $\bm{\theta}$ exists such that the estimating function may be expressed as follows:

[TABLE]

where $\hat{\bm{\pi}}_{n}$ is fixed and $z_{i}\equiv\frac{\mathbf{g}(\bm{\theta},w_{i})-\hat{\pi}_{3}}{\hat{\pi}_{2}}$ , $i=1,2,3$ . The Jacobian matrix with respect to $\bm{\theta}$ is give by:

[TABLE]

Substituing $D_{\bm{\theta}}\mathbf{g}^{T}$ by $D_{\mathbf{w}}\mathbf{g}^{T}$ in the above equation gives the Jacobian matrix with respect to $\mathbf{w}$ , a matrix which is diagonal. It is straightforward to remark that the differentiability and continuity depends exclusively on the smoothness of $\mathbf{g}$ . Thus, if $\mathbf{g}$ is once continuously differentiable in both $\bm{\theta}$ and $\mathbf{w}$ , then Assumption 26 (i) holds.

Concerning the determinant of these Jacobian matrices, they may be null only on unlikely situations: first, if $\mathbf{g}$ equals $\hat{\pi}_{3}$ then $z_{i}$ is zero for $i=1,2,3$ , second, if $D_{\bm{\theta}}\mathbf{g}$ or $D_{\mathbf{w}}\mathbf{g}$ are zeros. The choice of $\mathbf{g}$ may be guided by this restriction so typically the determinants may be null, but only on a countable set, and Assumption 26 (ii) is verified. For Assumption 26 (iii), it is straightforward to remark that

[TABLE]

as long as $\lim_{\lVert(\bm{\theta},\mathbf{w})\rVert\to\infty}\lVert\mathbf{g}(\bm{\theta},\mathbf{w})\rVert=\infty$ , since $\log(z_{1})$ would diverge. Depending on $g$ , Assumption 26 (iii) is satisfied.

Therefore, the plausibility of Assumption 26 is up to the choice of the generating equation $g$ . Here, the choice is quasi immediate as it is driven by the form of the process:

[TABLE]

where $\mathbf{u}_{i}={(u_{1i},u_{2i})}^{T}$ , $u_{ji}\sim\mathcal{U}(0,1)$ , $j=1,2$ , $u_{1i}$ and $u_{2i}$ are independent, $v_{i}(\bm{\theta})\overset{d}{=}\theta_{1}+(\theta_{2}-\theta_{1})u_{1i}$ , $\sigma^{\varepsilon}_{i}(\bm{\theta})=\sum_{j=1}^{i}{\varepsilon}_{j}(\bm{\theta})$ , $\varepsilon_{j}(\bm{\theta})=-\theta_{3}^{-1}\log(u_{2j})$ and $\sigma^{g}_{i}=\sum_{j=1}^{i}g(\bm{\theta},\mathbf{u}_{j})$ . Let $E_{i}$ corresponds to the event $\{\sigma^{\varepsilon}_{i}(\bm{\theta})\leq\sigma^{g}_{i-1}(\bm{\theta})\}$ and $\bar{E}_{i}$ be the contrary. The partial derivatives may be found recursively as follows:

[TABLE]

Clearly $g$ is once continuously differentiable in both its arguments with non-zero derivatives. Eventually, we have that $v_{i}(\bm{\theta})$ goes to $\infty$ when $\theta_{1}\to\infty$ , $\theta_{2}\to\infty$ and $u_{1i}\to 1$ , whereas $\varepsilon_{i}(\bm{\theta})$ tends to zero whenever $\theta_{3}\to\infty$ and $u_{2i}\to 1$ . It is not clear whether $v_{i}(\bm{\theta})+\sigma^{\varepsilon}_{i}(\bm{\theta})-\sigma^{g}_{i}(\bm{\theta})$ diverges or converges to 0 when $\lVert(\bm{\theta},\mathbf{u}_{i})\rVert\to\infty$ , but in any case $\lVert g(\bm{\theta},\mathbf{u}_{i})\rVert$ tends to $\infty$ since $v_{i}(\bm{\theta})$ diverges. As a consequence, Assumption 26 is highly plausible and thus Theorem 28 seems invokable.

For the simulation, we set $\bm{\theta}_{0}={[0.3\;0.9\;1]}^{T}$ and $n=100$ as in [12]. We compare the SwiZs with indirect inference in Definition 3 and the parametric bootstrap using the indirect inference with $B=1$ as the initial consistent estimator (see Definition 6). By Theorem 5, the SwiZs and the indirect inference are equivalent, but as argued, the price for obtaining the inidirect inference is higher so here we seek empirical evidence, and Table 3 speaks for itself, the difference is indeed monstrous. The parametric bootstrap is even worse in terms of computational time. It is maybe good to remind the reader that the comparison is fair: all three methods benefits from the same level of implementation and uses the very same technology.

The complete results may be found in Appendix D.3. In Figure 10 we can realize that the SwiZs do not offer an exact coverage in this case, it is even far from ideal for $\hat{\theta}_{2}$ . It is nonetheless better than the parametric bootstrap. Especially the coverage of $\hat{\theta}_{1}$ and $\hat{\theta}_{3}$ are close to the ideal level. Considering the context of this simulation: moderate sample size, no closed-form for the likelihood, the results are very encouraging. A good surprise appears from Figure 11 where the SwiZs demonstrates better performances of its point estimates (mean and median) compared to indirect inference approaches in termes of absolute median bias and mean absolute deviation.

It is however not clear which one, if not both, we should blame for failure of missing exact coverage probability between our analysis on the applicability of Theorem 28 to this case or the numerical optimization procedure. The previous examples seem to indicate for the latter. To this end, we re-run the same experiment only for the SwiZs (for pure operational reason) by changing the starting values to be the true parameter $\bm{\theta}_{0}$ to measure the implication. Indeed, starting values are a sensitive matter for quasi-Newton routine and since $\hat{\bm{\pi}}_{n}$ is not a consistent estimator of $\bm{\theta}_{0}$ , using it as a starting value might have a persistent influence on the sequence $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}^{+}_{S}\}$ . Results are reported in Table in Appendix D.3. The coverage probabilities of $\hat{\theta}_{1}$ and $\hat{\theta}_{3}$ becomes nearly perfect, which shows that indeed good starting values may reduce the numerical error in the coverage probabilities. However, coverage probability for $\hat{\theta}_{2}$ persistently shows result off the desired levels, which seems rather to indicate a problem related to the applicability of Theorem 28. Increasing the sample size to $n=1,000$ (see Table 19) makes the coverage of all three parameters nearly perfect.

Fifth and last, we consider logistic regression. This is certainly one of the most widely used statistical model in practice. This case is challenging at least on two aspects. First, the random variable is discrete and the finite sample theory in Section 4 does not hold. Second, the generating function is non-differentiable with respect to $\bm{\theta}$ , therefore gradient-based optimization routines cannot be employed. In what follows, we circumvent this inconvenient by smoothing the generating function. To this end, we start by introducing the continuous latent representation of the logistic regression.

Example 53.

Suppose we have the model

[TABLE]

where $\bm{\epsilon}=\left(\epsilon_{1},\cdots,\epsilon_{n}\right)^{T}$ and $\epsilon_{i}$ , $i=1,\cdots,n$ , are identically and independently distributed according to a logistic distribution with mean 0 and unity variance. This distribution belongs to symmetric location-scale families. It is similar to the Gaussian distribution with heavier tails. The unknwon parameters $\bm{\theta}$ of this model could be easily estimated by the ordinary least squares:

[TABLE]

The corresponding estimating function is:

[TABLE]

A straightforward generating function is $\mathbf{g}(\bm{\theta},\mathbf{u})=\mathbf{X}\bm{\theta}+\mathbf{u}$ where $u_{i}\sim\emph{Logistic}(0,1)$ . Evaluating this function at $\bm{\pi}=\hat{\bm{\pi}}_{n}$ leads to

[TABLE]

Solving the root of this function in $\bm{\theta}$ gives the following explicit solution:

[TABLE]

Following Example 45 on linear regression, it is easy to show that inference based on the distribution of this estimator leads to exact frequentist coverage probabilities.

Let us turn our attention to logistic regression. In this case, $\bm{\mathit{y}}$ is not observed. Instead, we observe a binary random variable $\mathbf{y}$ , whose elements are:

[TABLE]

where $\mathbf{X}_{i}$ is the $i$ th row of $\mathbf{X}$ . Saying it differently, this consideration implies that the generating function is modified to the following indicator function:

[TABLE]

Clearly, this change implies that $\bm{\Psi}_{n}$ has a flat Jacobian matrix and Assumptions 26 and 27 do not hold. Moreover, this problem becomes numerically more invloved, especially if we want to pursue with a gradient-based optimization routine. As mentionned, in practice we seek the solution of the following problem:

[TABLE]

Note that $\mathbf{X}^{T}\mathbf{y}$ is the sufficient statistic for a logistic regression (see Chapter 2 in [102]). The gradient of $f(\bm{\theta})$ is

[TABLE]

However, the Jacobian $D_{\bm{\theta}}\mathbf{g}(\bm{\theta},\mathbf{u})$ is 0 almost everywhere and alternatives are necessary for using gradient-based methods. A possibility is to smooth $\mathbf{g}(\bm{\theta},\mathbf{u})$ by using for example a sigmoid function:

[TABLE]

The value of $t$ tunes the approximation and the value of the gradient. However, from our experience, large values of $t$ , say $t>0.1$ , leads to poor results and small values, say $t<0.1$ , leads to numerical instability. We thus prefer to use a different strategy by taking $-f(\bm{\theta})$ as the gardient. This strategy corresponds to the iterative bootstrap procedure ([14]). In Figure 12, we illustrate the difference between these two approximations and the “ideal” distribution we would have obtained by observing the continuous underlying latent process.

Clearly, the loss of information induced from the possibility of only observing a binary outcome results in an increase of variability. Nonetheless, the difference is not enormous. Both approximations leads to similar distributions in terms of shapes. We can notice a little difference in their modes. Since the iterative bootstrap approximation is numerically advantageous, we use it in the next study.

For simulation, we setup $\bm{\theta}_{0}={(0,5,5,-7,-7,\underbrace{0,\ldots,0}_{15})}^{T}$ and sample size $n=200$ . We compare coverage probabilities of 95% confidence intervals obtained by the SwiZs and by asymptotic theory. We report results in Table 4. We can clearly see that the SwiZs have the most precise confidence intervals for all coefficients with coverage close to the target level of 95%.

Appendix A Technical results

Lemma 54.

Let $X$ and $Y$ be open subsets of ${\rm I\!R}^{n}$ . If $\mathbf{f}:X\to Y$ is a $\mathcal{C}^{1}$ -diffeomorphism, then the Jacobian matrices of the maps $x\mapsto\mathbf{f}$ and $y\mapsto\mathbf{f}^{-1}$ are invertible, and the derivatives at the points $a\in X$ and $b\in Y$ , are given by:

[TABLE]

Proof.

By assumption, $\mathbf{f}$ is invertible, once continuously differentiable and $\mathbf{f}^{-1}$ is once continuously differentiable.

We have $\mathbf{f}^{-1}\circ\mathbf{f}=\operatorname{id}_{X}$ , where $\operatorname{id}_{X}$ is the identity function on the set $X$ . Fix $a\in X$ . By the chain rule, the derivative at $a$ is the following:

[TABLE]

where $\mathbf{I}_{n}$ is the identity matrix. Since $D_{y}\mathbf{f}^{-1}$ and $D_{x}\mathbf{f}$ are square matrices, we have:

[TABLE]

The determinants cannot be 0, there are either 1 or -1 for both matrices, ergo, the Jacobian are invertible and we can write

[TABLE]

The proof for $\mathbf{f}\circ\mathbf{f}^{-1}=\operatorname{id}_{Y}$ follows by symmetry. ∎

Lemma 55.

Let $\bm{\Theta}$ and $W$ be open subsets of ${\rm I\!R}^{p}$ . If there exists a $\mathcal{C}^{1}$ -diffeomorphic mapping $\mathbf{a}:W\to\bm{\Theta}$ , that is, $\mathbf{w}\mapsto\mathbf{a}$ is continuously once differentialbe in $\bm{\Theta}\times W$ and the inverse map $\bm{\theta}\mapsto\mathbf{a}^{-1}$ is continuously once differentiable in $\bm{\Theta}\times W$ , then the cumulative distribution function of $\{\hat{\bm{\theta}}_{n}^{(s)}:s\in\mathbb{N}\}$ is given by:

[TABLE]

provided that $f$ is a nonnegative Borel function and $\Pr\left(\hat{\bm{\pi}}_{n}\neq\emptyset\right)=1$ .

Proof of Lemma 55.

By assumption, $\mathbf{w}\mapsto\mathbf{a}$ is a $\mathcal{C}^{1}$ -diffeomorphism so by Lemma 54 the Jacobian of $\mathbf{a}$ and $\mathbf{a}^{-1}$ are invertible. All the conditions of the change-of-variable formula for multidimensional Lebesgue integral in [77, Theorem 17.2, p.239] are satisfied, so we obtain

[TABLE]

By Lemma 54, we have that $D_{\bm{\theta}}\mathbf{a}^{-1}={\left[D_{\mathbf{w}}\mathbf{a}\right]}^{-1}$ . Taking the determinant ends the proof. ∎

Appendix B Finite sample

Proof of Theorem 5.

We proceed by showing first that $\bm{\Theta}^{(s)}_{\text{II},n}\subset\bm{\Theta}^{(s)}_{n}$ , and second that $\bm{\Theta}^{(s)}_{\text{II},n}\supset\bm{\Theta}^{(s)}_{n}$ .

It follows from Assumption 4 that $\hat{\bm{\pi}}_{n}$ is the unique solution of $\operatorname*{argzero}_{\bm{\pi}\in\bm{\Pi}}\bm{\Psi}_{n}(\bm{\theta}_{0},\mathbf{u}_{0},\bm{\pi})$ , ergo $\bm{\Pi}_{n}$ in the Definition 2 is a singleton.

(1). Fix $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{\text{II},n}$ . By Definition 3, it holds that

[TABLE]

where $\hat{\bm{\pi}}_{\text{II},n}^{(s)}$ is the unique solution of $\operatorname*{argzero}_{\bm{\theta}\in\bm{\Pi}}\bm{\Psi}_{n}(\bm{\theta}_{1},\mathbf{u}_{s},\bm{\pi})$ . Ergo, it holds as well that

[TABLE]

implying that $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{n}$ by Definition 2. Thus $\bm{\Theta}^{(s)}_{\text{II},n}\subset\bm{\Theta}_{n}^{(s)}$ .

(2). Fix $\bm{\theta}_{2}\in\bm{\Theta}_{n}$ . By Definition 2 we have

[TABLE]

By Definition 3, we also have

[TABLE]

where $\hat{\bm{\pi}}^{(s)}_{\text{II},n}(\bm{\theta}_{2})$ is the unique solution of $\operatorname*{argzero}_{\bm{\pi}\in\bm{\Pi}}\bm{\Psi}_{n}(\bm{\theta}_{2},\mathbf{u}_{s},\bm{\pi})$ . It follows that $\hat{\bm{\pi}}_{n}=\hat{\bm{\pi}}^{(s)}_{\text{II},n}\left(\bm{\theta}_{2}\right)$ uniquely, implying that $\bm{\theta}_{2}\in\bm{\Theta}^{(s)}_{\text{II},n}$ by Definition 3. Thus $\bm{\Theta}^{(s)}_{\text{II},n}\supset\bm{\Theta}^{(s)}_{n}$ , which concludes the proof. ∎

Proof of Theorem 8.

We proceed by showing first that (A) $\bm{\Theta}^{(s)}_{n}=\bm{\Theta}^{(s)}_{\text{Boot},n}$ implies (B) $\bm{\Psi}_{n}(\bm{\theta},\mathbf{u}_{s},\bm{\pi})=\bm{\Psi}_{n}(\bm{\pi},\mathbf{u}_{s},\bm{\theta})=\mathbf{0}$ , then that (B) implies (A).

Suppose (A) holds. Fix $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{n}$ and $\hat{\bm{\pi}}_{n}\in\bm{\Pi}_{n}$ . We have by the Definition 2

[TABLE]

By (A), we also have that $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{\text{Boot},n}$ so by the Definition 6

[TABLE]

Since both estimating equations equal zero, we have

[TABLE]

Hence (A) implies (B).

Suppose now that (B) holds. Fix $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{n}$ and $\hat{\bm{\pi}}_{n}\in\bm{\Pi}_{n}$ so $\bm{\Psi}_{n}(\bm{\theta}_{1},\mathbf{u}_{s},\hat{\bm{\pi}}_{n})=\mathbf{0}$ . By (B), we have

[TABLE]

so $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{\text{Boot},n}$ and thus $\bm{\Theta}^{(s)}_{n}\subset\bm{\Theta}^{(s)}_{\text{Boot},n}$ . The same argument shows that $\bm{\Theta}^{(s)}_{n}\supset\bm{\Theta}^{(s)}_{\text{Boot},n}$ which ends the proof. ∎

Proof of Proposition 9.

Since $\hat{\pi}_{n}=\bar{\mathbf{x}}=\frac{1}{n}\sum_{i=1}^{n}x_{i}$ , the sample average, we can write the following estimating equation

[TABLE]

where $x\overset{d}{=}g(\theta_{0},u_{0})$ . Since $x$ follows a location family, we have that $x\overset{d}{=}\theta_{0}+g(0,u_{0})\overset{d}{=}\theta_{0}+y$ .

The SwiZs is defined as

[TABLE]

On the other hand, the parametric bootstrap estimator is

[TABLE]

Eventually, we obtain that

[TABLE]

where we use the fact that $\bar{\mathbf{y}}\overset{d}{=}-\bar{\mathbf{y}}$ . Therefore, $\hat{\theta}^{(s)}_{n}=\hat{\theta}^{(s)}_{\text{Boot},n}$ , or equivalently $\Phi_{n}\left(\theta,\mathbf{u}_{s},\pi\right)=\Phi_{n}\left(\pi,\mathbf{u}_{s},\theta\right)=0$ , which ends the proof. ∎

Proof of Theorem 13.

Fix $\varepsilon=0$ . The Theorem 5 is satisfied so $\bm{\Theta}_{n}^{(s)}=\bm{\Theta}_{\text{II},n}^{(s)}$ for any $s$ . It is sufficient then to prove $\bm{\Theta}^{(s)}_{\text{ABC,n}}(0)=\bm{\Theta}^{(s)}_{\text{II},n}$ for any $s\in\mathbb{N}^{+}_{S}$ . We proceed by verifying that first $\bm{\Theta}^{(s)}_{\text{ABC},n}(0)\subset\bm{\Theta}^{(s)}_{\text{II},n}$ , and second that $\bm{\Theta}^{(s)}_{\text{ABC},n}(0)\supset\bm{\Theta}^{(s)}_{\text{II},n}$ .

(1). Fix $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{\text{ABC},n}(0)$ . By the Assumption 12, $\bm{\theta}_{1}$ is also a realization from the prior distribution $\mathcal{P}$ . By Definition 10, we have

[TABLE]

By Definition 3, $\bm{\theta}_{1}\in\bm{\Theta}^{(s)}_{\text{II},n}$ , thus $\bm{\Theta}^{(s)}_{\text{ABC},n}(0)\subset\bm{\Theta}^{(s)}_{\text{II},n}$ .

(2). Fix $\bm{\theta}_{2}\in\bm{\Theta}^{(s)}_{\text{II},n}$ . By Definition 3, we have

[TABLE]

By Assumption 12 and Definition 10, $\bm{\theta}_{2}\in\bm{\Theta}^{(s)}_{\text{ABC},n}(0)$ , ergo $\bm{\Theta}^{(s)}_{\text{ABC},n}(0)\supset\bm{\Theta}^{(s)}_{\text{II},n}$ , which ends the proof. ∎

Proof of Proposition 21.

Fix $\alpha_{1},\alpha_{2}>0$ such that $\alpha_{1}+\alpha_{2}=\alpha\in(0,1)$ . Since we consider an exact $\alpha$ -credible set $C_{\hat{\bm{\pi}}_{n}}$ , we have

[TABLE]

Consider the event $E=\{u\in(\alpha_{1},1-\alpha_{2})\}$ taking value one with probability $p$ if $u$ is inside the interval and 0 otherwise. Let $u=F_{\hat{\bm{\theta}}_{n}|\hat{\bm{\pi}}_{n}}(\bm{\theta}_{0})$ so at each trial there is one such event. Now consider indefinitely many trials, so we have $\{E_{i}:i\in\mathbb{N}^{+}\}$ where $\mathbb{E}(E_{i})=\Pr(E_{i}=1)=p_{i}$ . Denote by $N$ is the number of trials. The frequentist coverage probability is given by

[TABLE]

By assumption, $u$ is an independent standard uniform variable, so the events are independent and $p_{i}=1-\alpha$ for all $i\geq 1$ and for every $\alpha\in(0,1)$ . It follows that $\{E_{i}:i\in\mathbb{N}^{+}\}$ are identically and independently distributed Bernoulli random variables. The proof follows by Borel’s strong law of large numbers (see [103]). ∎

Proof of Lemma 22.

Fix $\mathbf{u}_{0}$ . Fix $\bm{\theta}_{1}\in\bm{\Theta}$ . By definition we have

[TABLE]

By assumption, the following equation

[TABLE]

is uniquely defined. Now fix $\bm{\pi}_{1}\in\bm{\Pi}$ . By definition we have

[TABLE]

and by assumption

[TABLE]

is uniquely defined. It follows that $\bm{\theta}_{1}=\hat{\bm{\theta}}_{n}$ if and only if $\bm{\pi}_{1}=\hat{\bm{\pi}}_{n}$ . ∎

Proof of Theorem 28.

We gives the demonstration under the Assumptions 26 and 27 separately.

We proceed by showing that we have a $\mathcal{C}^{1}$ -diffeomorphism which is unique so Lemma 55 and Lemma 22 apply. We then demonstrate that the obtained cumulative distribution function evaluated at $\bm{\theta}_{0}\in\bm{\Theta}$ is a realization from a standard uniform random variable. The conclusion is eventually reached by the Proposition 21.

Let $\pi_{1}:\bm{\Theta}_{n}\times W_{n}\to\bm{\Theta}_{n}$ and $\pi_{2}:\bm{\Theta}_{n}\times W_{n}\to W_{n}$ be the projections defined by $\pi_{1}(\bm{\theta},\mathbf{w})=\bm{\theta}$ and $\pi_{2}(\bm{\theta},\mathbf{w})=\mathbf{w}$ if $(\bm{\theta},\mathbf{w})\in\bm{\Theta}_{n}\times W_{n}$ . By Assumption 26 the conditions of the global implicit function theorem of [68, Theorem 1] are satisfied, so it holds that there exists a unique (global) continuous implicit function $\mathbf{a}:W_{n}\to\bm{\Theta}_{n}$ such that $\mathbf{a}(\mathbf{w}_{0})=\bm{\theta}_{0}$ and $\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\mathbf{w},\mathbf{a}(\mathbf{w}))=\mathbf{0}$ for every $\mathbf{w}\in W$ . In addition, the mapping is continuously differentiable on $W_{n}\setminus\pi_{2}(K_{n})$ with derivative given by

[TABLE]

for every $\mathbf{w}\in W_{n}\setminus\pi_{2}(K_{n})$ . Clearly the map $\mathbf{a}$ is invertible with a continuous inverse. Since the derivative $D_{\mathbf{w}}\bm{\varphi}_{p}$ is continuous and invertible for $(\bm{\theta},\mathbf{w})\in\bm{\Theta}_{n}\times W_{n}\setminus K_{n}$ , we immediately have that $\mathbf{a}$ is a $\mathcal{C}^{1}$ -diffeomorphism with deriative of the inverse given by

[TABLE]

for $\bm{\theta}\in\bm{\Theta}_{n}\setminus\pi_{1}(K_{n})$ . The conditions of Lemma 55 are satisfied and we obtain the cumulative distribution function

[TABLE]

proving point (i). Since $\hat{\bm{\pi}}_{n}$ is the unique zero of $\bm{\Psi}_{n}(\bm{\theta}_{0},\mathbf{u}_{0},\bm{\pi})$ , and hence of $\bm{\varphi}_{p}(\bm{\theta}_{0},\mathbf{w}_{0},\bm{\pi})$ , and $\bm{\theta}=\mathbf{a}(\mathbf{w})$ is the unique zero of $\bm{\varphi}_{p}(\bm{\theta},\mathbf{w},\hat{\bm{\pi}}_{n})$ , we have by Lemma 22 that $\bm{\theta}_{0}=\mathbf{a}(\mathbf{w}_{0})$ , and therefore that $\mathbf{w}_{0}=\mathbf{a}^{-1}(\bm{\theta}_{0})$ . In consequence, evaluating the above distribution at $\bm{\theta}_{0}$ leads to

[TABLE]

that is, the distribution evaluated at $\bm{\theta}_{0}$ is a realization from a standard uniform random variable. The conclusion follows by the Proposition 21.

Fix $\bm{\theta}_{0}\in\bm{\Theta}_{n}$ and $\mathbf{w}_{0}\in W_{n}$ . Fix $\hat{\bm{\pi}}_{n}\in\bm{\Pi}_{n}$ , the point such that $\bm{\varphi}_{p}(\bm{\theta}_{0},\mathbf{w}_{0},\hat{\bm{\pi}}_{n})=\mathbf{0}$ . Let $\pi_{1}:W_{n}\times\bm{\Pi}_{n}\to W_{n}$ and $\pi_{2}:W_{n}\times\bm{\Pi}_{n}\to\bm{\Pi}_{n}$ be the projections such that $\pi_{1}(\mathbf{w},\bm{\pi})=\mathbf{w}$ and $\pi_{2}(\mathbf{w},\bm{\pi})=\bm{\pi}$ if $(\mathbf{w},\bm{\pi})\in W_{n}\times\bm{\Pi}_{n}$ . By Assumption 27 ((i), (iii), (v)), the Theorem 1 in [68] is satisfied, as a consequence it holds that $\bm{\varphi}_{\bm{\theta}_{0}}$ admits a unique global implicit function $\bm{\pi}_{\bm{\theta}_{0}}:W_{n}\to\bm{\Pi}_{n}$ such that $\bm{\varphi}_{\bm{\theta}_{0}}(\mathbf{w},\bm{\pi}_{\bm{\theta}_{0}}(\mathbf{w}))=\mathbf{0}$ for every $\mathbf{w}\in W_{n}$ , $\bm{\pi}_{\bm{\theta}_{0}}(\mathbf{w}_{0})=\hat{\bm{\pi}}_{n}$ , and $\bm{\pi}_{\bm{\theta}_{0}}$ is once continuously differentiable on $W_{n}\setminus\pi_{1}(K_{1n})$ with derivative given by

[TABLE]

Clearly $\mathbf{w}\mapsto\bm{\pi}_{\bm{\theta}_{0}}$ is a homeomorphism. Since $D_{\mathbf{w}}\bm{\varphi}_{\bm{\theta}_{0}}$ is continuous and invertible on $W_{n}\times\bm{\Pi}\setminus K_{1n}$ , we have that $\bm{\pi}_{\bm{\theta}_{0}}$ is a $\mathcal{C}^{1}$ -diffeomorphism with differentiable inverse function on $\bm{\Pi}\setminus\pi_{2}(K_{1n})$ given by Lemma 54:

[TABLE]

Let $\pi_{3}:\bm{\Theta}_{n}\times\bm{\Pi}_{n}\to\bm{\Theta}_{n}$ and $\pi_{4}:\bm{\Theta}_{n}\times\bm{\Pi}_{n}\to\bm{\Pi}_{n}$ denotes the projections such that $\pi_{3}(\bm{\theta},\bm{\pi})=\bm{\theta}$ and $\pi_{4}(\bm{\theta},\bm{\pi})=\bm{\pi}$ . By using the same argument presented above, the Assumption 27 ((ii), (iv), (vi)) permits us to have an implicit $\mathcal{C}^{1}$ -diffeomorphism $\bm{\pi}_{\mathbf{w}_{0}}:\bm{\Theta}_{n}\to\bm{\Pi}_{n}$ with the following continuous derivatives:

[TABLE]

Now define the function $\bm{\xi}(\bm{\theta})=\bm{\pi}_{\bm{\theta}_{0}}^{-1}\circ\bm{\pi}_{\mathbf{w}_{0}}(\bm{\theta})$ . It is trivial to show that this mapping $\bm{\theta}\mapsto\bm{\xi}$ is a $\mathcal{C}^{1}$ -diffeomorphism. We have from the preceding results and the chain rule that

[TABLE]

We make the following remarks. First, note that all these derivatives are square matrices of dimension $p\times p$ . Second, we have that $D_{\bm{\pi}}\bm{\varphi}_{\bm{\theta}_{0}}(\mathbf{w}_{0},\hat{\bm{\pi}}_{n})=D_{\bm{\pi}}\bm{\varphi}_{p}(\bm{\theta}_{0},\mathbf{w}_{0},\hat{\bm{\pi}}_{n})=D_{\bm{\pi}}\bm{\varphi}_{\mathbf{w}_{0}}(\bm{\theta}_{0},\hat{\bm{\pi}}_{n})$ so $D_{\bm{\pi}}\bm{\varphi}_{\bm{\theta}_{0}}{\left[D_{\bm{\pi}}\bm{\varphi}_{\mathbf{w}_{0}}\right]}^{-1}=\mathbf{I}_{p}$ . Third, it holds that $D_{\mathbf{w}}\bm{\varphi}_{\bm{\theta}_{0}}(\mathbf{w}_{0},\hat{\bm{\pi}}_{n})=D_{\mathbf{w}}\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta}_{0},\mathbf{w}_{0})$ and $D_{\bm{\theta}}\bm{\varphi}_{\mathbf{w}_{0}}(\bm{\theta}_{0},\hat{\bm{\pi}}_{n})=D_{\bm{\theta}}\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\bm{\theta}_{0},\mathbf{w}_{0})$ . As a consequence, we obtain that

[TABLE]

Using Lemma 55 ends the proof of point (i) in Theorem 28. From the above display, we have that the relation $\bm{\pi}_{\bm{\theta}_{0}}(\mathbf{w}_{0})=\hat{\bm{\pi}}_{n}=\bm{\pi}_{\mathbf{w}_{0}}(\bm{\theta}_{0})$ is uniquely defined, so $\bm{\xi}(\bm{\theta}_{0})=\bm{\pi}_{\bm{\theta}_{0}}^{-1}(\hat{\bm{\pi}}_{n})=\mathbf{w}_{0}$ . Since $\bm{\xi}$ is a diffeomorphism, then $\bm{\xi}^{-1}(\mathbf{w}_{0})=\bm{\theta}_{0}$ , which finishes the proof. ∎

Proof of Proposition 30.

This is a special case of the Theorem 28. Let define $\bm{\varphi}_{\hat{\bm{\pi}}_{n}}(\mathbf{w},\bm{\theta})=\mathbf{h}(\mathbf{x}_{0})-\mathbcal{g}(\bm{\theta},\mathbf{w})$ , where $\mathbf{h}(\mathbf{x}_{0})=\hat{\bm{\pi}}_{n}$ is fixed. Following the proof of Theorem 28, we have by assumption that $\mathbf{a}:W_{n}\to\bm{\Theta}_{n}$ is a $\mathcal{C}^{1}$ -diffeomorphism with derivatives

[TABLE]

The rest of the proof is identical to the proof of Theorem 28. ∎

Appendix C Asymptotics

Proof of Theorem 34.

We start by showing the claim 1: the pointwise convergence of $\hat{\bm{\pi}}_{n}$ . Then we demonstrate the claim 2 with two different approaches corresponding respectively to the Assumptions 32 and 33.

Fix $\bm{\pi}_{0}\in\bm{\Pi}$ . Since $\{\bm{\Psi}_{n}(\bm{\theta},\mathbf{u},\bm{\pi})\}$ is stochastically Lipschitz in $\bm{\pi}$ , it is stochastically equicontinuous by the Lemma 59. In addition, $\bm{\Pi}$ is compact and $\{\bm{\Psi}_{n}\}$ is pointwise convergent by assumption, so by the Lemma 58 $\{\bm{\Psi}_{n}\}$ converges uniformly and the limit $\bm{\Psi}$ is uniformly continuous. By $\bm{\Pi}$ compact and the continuity of the norm, the infimum of the norm of $\bm{\Psi}$ exists. The infimum of $\bm{\Psi}$ is well-separated by the bijectivity of the function. Therefore, all the conditions of Lemma 56 are satisfied and $\{\hat{\bm{\pi}}_{n}\}$ converges pointwise to $\bm{\pi}_{0}$ .

2 (i). For this proof, we consider $\bm{\theta}$ and $\bm{\pi}$ jointly. Let $\mathcal{K}=\bm{\Theta}\cap\bm{\Pi}$ be the set for both $\bm{\theta}$ and $\bm{\pi}$ . Fix $(\bm{\theta}_{0},\bm{\pi}_{0})\in\mathcal{K}$ . Since $\bm{\Pi}\subset{\rm I\!R}^{p}$ and $\bm{\Theta}\subset{\rm I\!R}^{p}$ are compact subsets of a metric space, they are closed (see the Theorem 2.34 in [104]), and $\mathcal{K}$ is compact (see the Corollary to the Theorem 2.35 in [104]) and nonempty (Theorem 2.36 in [104]). Having $\mathcal{K}$ compact, it is now sufficient to show that $\{\bm{\Psi}_{n}\}$ is jointly stochastically Lipschitz as the rest of the proof follows exactly the same steps as the claim 1.

For every $(\bm{\theta}_{1},\bm{\pi}_{1}),(\bm{\theta}_{2},\bm{\pi}_{2})\in\mathcal{K}$ , $n$ and $\mathbf{u}\sim F_{\mathbf{u}}$ , we have by the triangle inequality that

[TABLE]

where for the last inequality we make use of the marginal stochastic Lipschitz assumptions and $D_{n}=\max(A_{n},B_{n})$ . Let $a=\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert$ and $b=\lVert\bm{\pi}_{1}-\bm{\pi}_{2}\rVert$ . Now remark that for the $\ell_{2}$ -norm we have

[TABLE]

Since $a,b$ are positive real numbers, a direct application of the inequality of arithmetic and geometric means gives

[TABLE]

Therefore, we have that

[TABLE]

where $D_{n}^{\star}=\sqrt{2}D_{n}$ . Consequently, $\{\bm{\Psi}_{n}\}$ is jointly stochastically Lipschitz, and following the proof of claim 1 we have that $\hat{\bm{\theta}}_{n}\overset{p}{\rightarrow}\bm{\theta}_{0}$ . More precisely, we even have that $(\hat{\bm{\theta}}_{n},\hat{\bm{\pi}}_{n})\overset{p}{\rightarrow}(\bm{\theta}_{0},\bm{\pi}_{0})$ .

2 (ii). This proof is different from 2 (i) since $\hat{\bm{\pi}}_{\text{II},n}$ is considered as a function of $\bm{\theta}$ . Fix $\bm{\pi}_{0}\in\bm{\Pi}$ . Since $\{\hat{\bm{\pi}}_{\text{II},n}\}$ is stochastically Lipschitz in $\bm{\theta}$ , it is stochastically equicontinuous by the Lemma 59. In addition, $\bm{\Theta}$ is compact and $\{\hat{\bm{\pi}}_{\text{II},n}\}$ is pointwise convergent by the claim 1, so by the Lemma 58 $\{\hat{\bm{\pi}}_{\text{II},n}\}$ converges uniformly and the limit $\bm{\pi}$ is uniformly continuous in $\bm{\theta}$ . Let the stochastic and deterministic objective functions be $Q_{n}(\bm{\theta})=\lVert\hat{\bm{\pi}}_{n}-\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})\rVert$ and $Q(\bm{\theta})=\lVert\bm{\pi}_{0}-\bm{\pi}(\bm{\theta})\rVert$ , for any norms. Now, we have by using successively the reverse and the regular triangle inequalities

[TABLE]

By the convergence of $\{\hat{\bm{\pi}}_{n}\}$ and the uniform convergence of $\{\hat{\bm{\pi}}_{\text{II},n}\}$ , we have

[TABLE]

By $\bm{\Pi}$ compact and the continuity of the norm, the infimum of the norm of $\bm{\Psi}$ exists. The infimum of $\bm{\Psi}$ is well-separated by the bijectivity of the function. Therefore, all the conditions of Lemma 56 are satisfied and $\{\hat{\bm{\pi}}_{n}\}$ converges pointwise to $\bm{\pi}_{0}$ . ∎

Proof of Theorem 38.

We first demonstrate the asymptotic distribution of the auxiliary estimator, then separately shows the result for $\hat{\bm{\theta}}_{n}$ using independentely the Assumption 36 and 37.

The result on $\hat{\bm{\pi}}_{n}$ is a special case of $\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})$ . Fix $\bm{\theta}_{0}\in\bm{\Theta}^{\circ}$ and denote $\bm{\pi}(\bm{\theta}_{0})\equiv\bm{\pi}_{0}$ . By assumptions, the conditions for the delta method in Lemma 63 are satisfied so we have

[TABLE]

By the Definition 3, we have $\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{s},\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta}_{0})\right)=\mathbf{0}$ . By the Theorem 34, $\mathcal{o}_{p}\left(\left\lVert\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta}_{0})-\bm{\pi}_{0}\right\rVert\right)=\mathcal{o}_{p}(1)$ . By assumptions, $D_{\bm{\pi}}\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{s},\bm{\pi}_{0}\right)\overset{p}{\rightarrow}\mathbf{K}$ , $\mathbf{K}$ nonsingular. Multiplying by square-root $n$ , the proof results from the central limit theorem assumption on $\bm{\Psi}_{n}$ and the Slutsky’s lemma.

2 (i). From the delta method in Lemma 63, we obtain

[TABLE]

By definition we have $\bm{\Psi}_{n}\left(\hat{\bm{\theta}}_{n},\mathbf{u}_{s},\hat{\bm{\pi}}_{n}\right)=\mathbf{0}$ . Using again the delta method on the non-zero left-hand side element, we obtain from (9)

[TABLE]

Since $\{D_{\bm{\theta}}\bm{\Psi}_{n}(\bm{\theta}_{0},\mathbf{u}_{s},\bm{\pi})\}$ is stochastically Lipschitz in $\bm{\pi}$ , it is stochastically equicontinuous by the Lemma 59. In addition, $\bm{\Pi}$ is compact and $\{D_{\bm{\theta}}\bm{\Psi}_{n}\}$ is pointwise convergent by assumption, so by the Lemma 58 $\{D_{\bm{\theta}}\bm{\Psi}_{n}\}$ converges uniformly and the limit $\mathbf{J}$ is uniformly continuous in $\bm{\pi}$ .

Next, we obtain the following

[TABLE]

By uniform convergence $\sup_{\bm{\pi}\in\bm{\Pi}}\left\lVert D_{\bm{\theta}}\bm{\Psi}_{n}(\bm{\pi})-\mathbf{J}(\bm{\pi})\right\rVert=\mathcal{o}_{p}(1)$ and by the continuous mapping theorem $\left\lVert\mathbf{J}(\hat{\bm{\pi}}_{n})-\mathbf{J}(\bm{\pi}_{0})\right\rVert=\mathcal{o}_{p}(1)$ .

The central limit theorem is satisfied for the estimating equation thus $n^{1/2}\bm{\Psi}_{n}\rightsquigarrow\mathcal{N}\left(\mathbf{0},\mathbf{Q}\right)$ . Let $\mathbf{y}$ be a random variable identically and independently distributed according to $\mathcal{N}(\mathbf{0},\mathbf{Q})$ . Therefore, by multiplying by square-root $n$ we obtain

[TABLE]

By the Theorem 34, we have $\mathcal{o}_{p}\left(\left\lVert\hat{\bm{\pi}}_{n}-\bm{\pi}_{0}\right\rVert\right)=\mathcal{o}_{p}(1)$ and $\mathcal{o}_{p}\left(\left\lVert\hat{\bm{\theta}}_{n}-\bm{\theta}_{0}\right\rVert\right)=\mathcal{o}_{p}(1)$ . By the result of the claim 1 and the nonsingularity of $\mathbf{J}$ , we have

[TABLE]

Slutsky’s lemma ends the proof.

2 (ii). Let $\mathbf{g}_{n}(\bm{\theta})=\hat{\bm{\pi}}_{n}-\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})$ . The conditions for the delta method in Lemma 63 are satisfied by assumption so we have

[TABLE]

Since $\hat{\bm{\theta}}_{n}=\operatorname*{argzero}_{\bm{\theta}}d(\hat{\bm{\theta}}_{n},\hat{\bm{\theta}}_{\text{II},n}(\bm{\theta}))$ , we have $\hat{\bm{\theta}}_{n}-\hat{\bm{\theta}}_{\text{II},n}(\hat{\bm{\theta}}_{n})=\mathbf{0}$ and thus $\mathbf{g}_{n}(\hat{\bm{\theta}}_{n})=\mathbf{0}$ . By the Theorem 34, we have $\mathcal{o}_{p}\left(\left\lVert\hat{\bm{\theta}}_{n}-\bm{\theta}_{0}\right\rVert\right)=\mathcal{o}_{p}(1)$ . We have $D_{\bm{\theta}}\mathbf{g}_{n}(\bm{\theta}_{0})=-D_{\bm{\theta}}\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta}_{0})$ which, by assumption converges pointwise to $D_{\bm{\theta}}\bm{\pi}(\bm{\theta}_{0})$ . By the claim 1, we have $n^{1/2}(\hat{\bm{\pi}}_{n}-\bm{\pi}_{0})\overset{d}{=}n^{1/2}(\hat{\bm{\pi}}_{\text{II},n}(\bm{\theta})-\bm{\pi}_{0})\overset{d}{=}\mathbf{K}^{-1}\mathbf{y}$ as $n\to\infty$ . Hence, multiplying the Equation 10 by square-root $n$ , gives the following

[TABLE]

for sufficiently large $n$ . Remark that the mapping $\bm{\theta}\mapsto\bm{\pi}$ is implicitely defined by

[TABLE]

Since $\bm{\Psi}$ is once continuously differentiable in $(\bm{\theta},\bm{\pi})$ and the partial derivatives are invertibles, the conditions for invoking an implicit function theorem are satisfied (see for example the Theorem 9.28 in [104]) and one of the conclusion is that

[TABLE]

Since $\mathbf{J}$ is invertible, the conclusion follows by Slutsky’s lemma. ∎

Proof of Proposition 39.

The proof follows essentially the same steps as the proof of Theorem 38. From the proof of Theorem 38, the following holds: $n^{1/2}\left(\hat{\bm{\pi}}_{n}-\bm{\pi}_{0}\right)\overset{\mathop{}\!\mathrm{d}}{=}\mathbf{K}^{-1}\mathbf{y}_{0}$ and $n^{1/2}\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{s},\bm{\pi}_{0}\right)\overset{\mathop{}\!\mathrm{d}}{=}\mathbf{y}_{s}$ as $n\to\infty$ where $\mathbf{y}_{j}\sim\mathcal{N}\left(\mathbf{0},\mathbf{Q}\right)$ , $j\in\mathbb{N}^{+}$ , $D_{\bm{\pi}}\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{0},\bm{\pi}_{0}\right)$ converges in probability to $\mathbf{K}$ and $D_{\bm{\theta}}\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{s},\bm{\pi}\right)$ converges uniformly in probability to $\mathbf{J}$ . The $\{\mathbf{u}_{j}:j\in\mathbb{N}_{S}\}$ are assumed independent and so are $\{\mathbf{y}_{j}:j\in\mathbb{N}_{S}\}$ .

From the delta method in Lemma 63, we obtain

[TABLE]

By definition $\frac{1}{S}\sum_{s\in\mathbb{N}^{+}_{S}}\bm{\Psi}_{n}\left(\hat{\bm{\theta}}_{n}^{(s)},\mathbf{u}_{s},\hat{\bm{\pi}}_{n}\right)=\mathbf{0}$ . Using the delta method on $\frac{1}{S}\sum_{s\in\mathbb{N}^{+}_{S}}\bm{\Psi}_{n}\left(\bm{\theta}_{0},\mathbf{u}_{s},\hat{\bm{\pi}}_{n}\right)$ , multiplying by square-root $n$ , we obtain from the results of Theorem 38:

[TABLE]

Clearly $\frac{1}{S}\sum_{s\in\mathbb{N}^{+}_{S}}\mathbf{y}_{s}\sim\mathcal{N}\left(\mathbf{0},\frac{1}{S}\mathbf{Q}\right)$ . The conclusion follows from Slutsky’s lemma. ∎

Appendix D Additional simulation results

D.1 Lomax distribution

D.2 Random intercept and random slope linear mixed model

D.3 M/G/1 queueing model

Appendix E Generic results

This chapter assembles some generic theoretical results useful for the other Chapters.

We generically denote $\{\mathbf{g}_{n}:n\geq 1\}$ a sequence of a random vector-valued function and $\bm{\theta}\in\bm{\Theta}$ a vector of parameters.

The next Lemma is Theorem 5.9 in [70]. The proof is given for the sake of completeness.

Lemma 56 (weak consistency).

Let $\{\mathbf{g}_{n}(\bm{\theta})\}$ be sequence of a random vector-valued function of vector parameter $\bm{\theta}$ with a deterministic limit $\mathbf{g}(\bm{\theta})$ . If $\bm{\Theta}$ is compact, if the random function sequence converges uniformly as $n\to\infty$

[TABLE]

and if there exist $\delta>0$ such that

[TABLE]

then any sequence of estimators $\{\hat{\bm{\theta}}_{n}\}$ converges weakly in probability to $\bm{\theta}_{0}$ .

Proof.

Choose $\hat{\bm{\theta}}_{n}$ that nearly minimises $\lVert\mathbf{g}_{n}(\bm{\theta})\rVert$ so that

[TABLE]

Clearly we have $\inf_{\bm{\theta}}\rVert\mathbf{g}_{n}(\bm{\theta})\rVert\leq\rVert\mathbf{g}_{n}(\bm{\theta}_{0})\rVert$ , and by (11) $\lVert\mathbf{g}_{n}(\bm{\theta}_{0})\rVert\overset{p}{\rightarrow}\lVert\mathbf{g}(\bm{\theta}_{0})\rVert$ so that

[TABLE]

Now, substracting both sides by $\lVert\mathbf{g}(\hat{\bm{\theta}}_{n})\rVert$ , we have by the reverse triangle inequality

[TABLE]

The left-hand side is bounded by the negative supremum, thus

[TABLE]

It follows from (11) that the limit in probability of the right-hand side tends to 0. Let $\varepsilon>0$ and choose a $\delta>0$ as in (12) so that

[TABLE]

for every $\bm{\theta}\notin\mathcal{B}(\bm{\theta}_{0},\delta)$ . If $\hat{\bm{\theta}}_{n}\notin\mathcal{B}(\bm{\theta}_{0},\delta)$ , we have

[TABLE]

The probability of this event converges to 0 as $n\to\infty$ . ∎

The next definition is taken from [105] (see also [106, Chapter 7.1])

Definition 57.

$\{\mathbf{g}_{n}(\bm{\theta})\}$ * is stochastically uniformly equicontinuous on $\bm{\Theta}$ if for every $\varepsilon>0$ there exist a real $\delta>0$ such that*

[TABLE]

Lemma 58 (uniform consistency).

If $\bm{\Theta}$ is compact, if the sequence of random vector-valued function $\{\mathbf{g}_{n}(\bm{\theta})\}$ is pointwise convergent for all $\bm{\theta}\in\bm{\Theta}$ and is stochastically uniformly equicontinuous on $\bm{\Theta}$ , then

i.

$\{\mathbf{g}_{n}(\bm{\theta})\}$ * converges uniformly,* 2. ii.

$\mathbf{g}$ * is uniformly continuous.*

Proof.

(i) (Inspired from [104, Theorem 7.25(b)]). Let $\varepsilon>0$ , choose $\delta>0$ so to satisfy stochastic uniform equicontinuity in (13). Let $\mathcal{B}(\bm{\theta},\delta)=\{\bm{\theta}^{\prime}\in\bm{\Theta}:d(\bm{\theta},\bm{\theta}^{\prime})<\delta\}$ . Since $\bm{\Theta}$ is compact, there are finitely many points $\bm{\theta}_{1},\dots,\bm{\theta}_{k}$ in $\bm{\Theta}$ such that

[TABLE]

Since $\{\mathbf{g}_{n}(\bm{\theta})\}$ converges pointwise for every $\bm{\theta}\in\bm{\Theta}$ , we have

[TABLE]

whenever $1\leq l\leq k$ . If $\bm{\theta}\in\bm{\Theta}$ , so $\bm{\theta}\in\mathcal{B}(\bm{\theta}_{l},\delta)$ for some $l$ , so that

[TABLE]

Then, by the triangle inequality we have

[TABLE]

(ii). The proof follows the same steps. ∎

The next Lemma is similar to [105, Lemma 1]. The result of [105] is on the difference between a random and a nonrandom functions and requires the extra assumption of absolute continuity of the nonrandom function. The proof provided here is also different.

Lemma 59.

If for all $\bm{\theta},\bm{\theta}^{\prime}\in\bm{\Theta}$ , $\lVert\mathbf{g}_{n}(\bm{\theta})-\mathbf{g}_{n}(\bm{\theta}^{\prime})\rVert\leq B_{n}d(\bm{\theta},\bm{\theta}^{\prime})$ with $B_{n}=\mathcal{O}_{p}(1)$ , then $\{\mathbf{g}_{n}(\bm{\theta})\}$ is stochastically uniformly equicontinuous.

Proof.

By $B_{n}=\mathcal{O}_{p}(1)$ , there is $M>0$ such that for all $n$ , $\Pr(\lvert B_{n}\rvert>M)<\varepsilon$ . Let $\varepsilon>0$ and choose a sufficiently small $\delta>0$ such that for all $\bm{\theta}^{\prime},\bm{\theta}\in\bm{\Theta}$ , $d(\bm{\theta},\bm{\theta}^{\prime})<\varepsilon/M=\tau$ , $\delta\leq\tau$ . Let $\mathcal{B}(\bm{\theta},\delta)=\{\bm{\theta}^{\prime}\in\bm{\Theta}:d(\bm{\theta},\bm{\theta}^{\prime})<\delta\}$ . Then, we have

[TABLE]

∎

The next Lemma is a special case of [107, Corollary 3.1].

Lemma 60.

Let $\{\mathbf{x}_{i}:i\geq 1\}$ be an i.i.d. sequence of random variable and let $\mathbf{g}_{n}(\bm{\theta})=n^{-1}\sum_{i=1}^{n}\mathbf{g}(\mathbf{x}_{i},\bm{\theta})$ . If for all $i=1,\dots,n$ and $\bm{\theta},\bm{\theta}^{\prime}\in\bm{\Theta}$ , $\lVert\mathbf{g}(\mathbf{x}_{i},\bm{\theta})-\mathbf{g}(\mathbf{x}_{i},\bm{\theta}^{\prime})\lVert\leq b_{n}(\mathbf{x}_{i})d(\bm{\theta},\bm{\theta}^{\prime})$ with $\mathbb{E}[b_{n}(\mathbf{x}_{i})]=\mu_{n}=\mathcal{O}(1)$ , then $\{\mathbf{g}_{n}(\bm{\theta})\}$ is stochastically uniformly equicontinuous.

Proof.

Let $B_{n}=n^{-1}\sum_{i=1}^{n}b_{n}(\mathbf{x}_{i})$ , so $\mathbb{E}[B_{n}]=\mathcal{O}(1)$ . We have by triangle inequality

[TABLE]

The rest of the proof follows from Lemma 59. ∎

Lemma 61 (uniform weak law of large number).

If, in addition to Lemma 60, for each $\bm{\theta}\in\bm{\Theta}$ , $\mathbf{g}_{n}(\bm{\theta})$ is pointwise convergent, then $\{\mathbf{g}_{n}(\bm{\theta})\}$ converges uniformly.

Proof.

The proof is an immediat consequence of Lemma 60 and Lemma 58. ∎

The next Lemma is essentially a combination of Theorem 4.2 and Corollary 4.3 in [108]. The proof is given for the sake of completeness.

Lemma 62 (mean value inequality).

Let $U$ be a convex open set in $\bm{\Theta}$ . Let $\bm{\theta}_{1}\in U$ and $\bm{\theta}_{2}\in\bm{\Theta}$ . If $\mathbf{g}:U\to F$ is a $\mathcal{C}^{1}$ -mapping, then

i.

$\mathbf{g}(\bm{\theta}_{1}+\bm{\theta}_{2})-\mathbf{g}(\bm{\theta}_{1})=\int_{0}^{1}D\mathbf{g}(\bm{\theta}_{1}+t\bm{\theta}_{2})dt\cdot\bm{\theta}_{2}$ ** 2. ii.

$\lVert\mathbf{g}(\bm{\theta}_{1}+\bm{\theta}_{2})-\mathbf{g}(\bm{\theta}_{1})\rVert\leq\sup_{0\leq t\leq 1}\lVert D\mathbf{g}(\bm{\theta}_{1}+t\bm{\theta}_{2})\rVert\cdot\lVert\bm{\theta}_{2}\rVert$ **

Proof.

(i). Fix $\bm{\theta}_{1}\in U$ , $\bm{\theta}_{2}\in\bm{\Theta}$ . Let $\bm{\theta}_{3}=\bm{\theta}_{1}+\bm{\theta}_{2}$ and $\lambda_{t}=(1-t)\bm{\theta}_{1}+t\bm{\theta}_{3}$ . For $t\in[0,1]$ we have by the convexity of $U$ that $\lambda_{t}\in U$ , and so $\bm{\theta}_{1}+t\bm{\theta}_{2}$ is in $U$ as well. Put $\mathbf{h}(t)=\mathbf{g}(\bm{\theta}_{1}+t\bm{\theta}_{2})$ , so $D\mathbf{h}(t)=D\mathbf{g}(\bm{\theta}_{1}+t\bm{\theta}_{2})\cdot\bm{\theta}_{2}$ . By the fundamental theorem of calcul we have that

[TABLE]

Since $\mathbf{h}(1)=\mathbf{g}(\bm{\theta}_{1}+\bm{\theta}_{2})$ , $\mathbf{h}(0)=\mathbf{g}(\bm{\theta}_{1})$ , and $\bm{\theta}_{2}$ is allowed to be pulled out of the integral, part (i) is proven.

(ii). We have that

[TABLE]

where we use the Cauchy-Schwarz inequality for the first inequality, and the upper bound of integral for the second. The supremum of the norm exists because the affine line $\bm{\theta}_{1}+t\bm{\theta}_{2}$ is compact and the Jacobian is continuous. ∎

Lemma 63 (delta method).

If conditions of Lemma 62 holds, then

[TABLE]

Proof.

Fix $\bm{\theta}_{1}\in U$ and $\bm{\theta}_{2}\in\bm{\Theta}$ . By Lemma 62, we have

[TABLE]

Let $\bm{\theta}_{3}=\bm{\theta}_{1}+\bm{\theta}_{2}$ so $\lambda_{t}=(1-t)\bm{\theta}_{1}+t\bm{\theta}_{3}$ , $t\in[0,1]$ , is in $U$ and $\bm{\theta}_{1}+t\bm{\theta}_{2}$ as well. Let $\mathcal{B}^{c}(\bm{\theta}_{1},\lVert\bm{\theta}_{2}\rVert)=\{\lVert\bm{\theta}_{1}-\bm{\theta}\rVert\leq\lVert\bm{\theta}_{2}\rVert\}$ . We have

[TABLE]

so the line segment $\lambda_{t}$ is in the closed ball. Hence, we have

[TABLE]

Eventually, we have by continuity of the Jacobian in a neighborhood of $\bm{\theta}_{1}$ that

[TABLE]

as $\lVert\bm{\theta}_{2}\rVert\rightarrow 0$ . ∎

Lemma 64 (asymptotic normality).

Let $U$ be a convex open set in $\bm{\Theta}$ . Let $\{\hat{\bm{\theta}}_{n}\}$ be a sequence of estimator (roots of) the mapping $\mathbf{g}_{n}:U\to F$ . If

i.

$\hat{\bm{\theta}}_{n}$ * converges in probability to $\bm{\theta}_{0}\in U$ ,* 2. ii.

$\{\mathbf{g}_{n}\}$ * is a $\mathcal{C}^{1}$ -mapping,* 3. iii.

$n^{1/2}\mathbf{g}_{n}(\bm{\theta}_{0})\rightsquigarrow\mathcal{N}(\mathbf{0},\mathbf{V})$ , 4. iv.

$D\mathbf{g}_{n}(\bm{\theta}_{0})$ * converges in probability to $\mathbf{M}$ ,* 5. v.

$D\mathbf{g}_{n}(\bm{\theta}_{0})$ * is nonsingular,*

then

[TABLE]

where $\bm{\Sigma}=\mathbf{M}^{-1}\mathbf{V}\mathbf{M}^{-T}$ .

Proof.

Fix $\bm{\theta}_{1}=\bm{\theta}_{0}$ and $\bm{\theta}_{2}=\hat{\bm{\theta}}_{n}-\bm{\theta}_{0}$ , from Lemma 62 and Lemma 63 we have

[TABLE]

By definition $\mathbf{g}_{n}(\hat{\bm{\theta}}_{n})=\mathbf{0}$ . Multiplying by square-root $n$ leads to

[TABLE]

By the continuity of the matrix inversion $[D\mathbf{g}_{n}(\bm{\theta}_{0})]^{-1}\overset{p}{\rightarrow}\mathbf{M}^{-1}$ . Since the central limit theorem holds for $n^{1/2}\mathbf{g}_{n}(\bm{\theta}_{0})$ , the proof results from Slutsky’s lemma. ∎

The next Lemma is Theorem 9.4 in [109] and is given without proof.

Lemma 65 (implicit function theorem).

Let $\bm{\Xi}\times\bm{\Theta}$ be an open subset of ${\rm I\!R}^{m}\times{\rm I\!R}^{p}$ . Let $\mathbf{g}:\bm{\Xi}\times\bm{\Theta}\rightarrow{\rm I\!R}^{p}$ be a function of the form $\mathbf{g}(\bm{\xi},\bm{\theta})=k$ . Let the solution at the points $(\bm{\xi}_{0},\bm{\theta}_{0})\in\bm{\Xi}\times\bm{\Theta}$ and $k_{0}\in{\rm I\!R}^{p}$ be

[TABLE]

If

i.

$\mathbf{g}$ * is differentiable in $\bm{\Xi}\times\bm{\Theta}$ ,* 2. ii.

The partial derivative $D_{\bm{\xi}}\mathbf{g}$ is continuous in $\bm{\Xi}\times\bm{\Theta}$ , 3. iii.

The partial derivative $D_{\bm{\theta}}\mathbf{g}$ is invertible at the points $(\bm{\xi}_{0},\bm{\theta}_{0})\in\bm{\Xi}\times\bm{\Theta}$ ,

then, there are neighborhoods $X\subset\bm{\Xi}$ and $O\subset\bm{\Theta}$ of $\bm{\xi}_{0}$ and $\bm{\theta}_{0}$ on which the function $\hat{\bm{\theta}}:O\rightarrow X$ is uniquely defined, and such that:

$\mathbf{g}(\bm{\xi},\hat{\bm{\theta}}(\bm{\xi}))=k_{0}$ * for all $\bm{\xi}\in X$ ,* 2. 2.

For each $\bm{\xi}\in X$ , $\hat{\bm{\theta}}(\bm{\xi})$ is the unique solution lying in $O$ such that $\hat{\bm{\theta}}(\bm{\xi}_{0})=\bm{\theta}_{0}$ , 3. 3.

$\hat{\bm{\theta}}$ * is differentiable on $X$ and*

[TABLE]

Bibliography109

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A Ronald Gallant and George Tauchen. Which moments to match? Econometric Theory , 12(4):657–681, 1996.
2[2] Christian Gourieroux, Alain Monfort, and Eric Renault. Indirect inference. Journal of applied econometrics , 8(S 1), 1993.
3[3] Anthony A Smith. Estimating nonlinear time-series models using simulated vector autoregressions. Journal of Applied Econometrics , 8(S 1), 1993.
4[4] René Garcia, Eric Renault, and David Veredas. Estimation of stable distributions by indirect inference. Journal of Econometrics , 161(2):325–337, 2011.
5[5] Chiara Monfardini. Estimating stochastic volatility models through indirect inference. The Econometrics Journal , 1(1):113–128, 1998.
6[6] Marco J Lombardi and Giorgio Calzolari. Indirect estimation of α 𝛼 \alpha -stable stochastic volatility models. Computational Statistics & Data Analysis , 53(6):2298–2308, 2009.
7[7] Peter CB Phillips and Jun Yu. Simulation-based estimation of contingent-claims prices. The Review of Financial Studies , 22(9):3669–3705, 2009.
8[8] Christian Gouriéroux, Peter CB Phillips, and Jun Yu. Indirect inference for dynamic panel models. Journal of Econometrics , 157(1):68–77, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A simple recipe for making accurate parametric inference in finite sample

1 Introduction

2 Setup

Example 1** (Normal).**

Definition 2** (SwiZs).**

Remark 1**.**

3 Equivalent methods

Definition 3** (indirect inference estimators).**

Remark 2**.**

Remark 3**.**

Assumption 4** (uniqueness).**

Theorem 5** (Equivalence SwiZs/indirect inference).**

Definition 6** (parametric bootstrap).**

Remark 4**.**

Assumption 7**.**

Theorem 8** (equivalence SwiZs/parametric bootstrap).**

Proposition 9** (equivalence SwiZs/parametric bootstrap in location family problems).**

Definition 10** (Approximate Bayesian Computation (ABC) estimators).**

Remark 5**.**

Definition 11** (posterior distribution).**

Remark 6**.**

Assumption 12** (existence of a prior).**

Theorem 13** (Equivalence SwiZs/ABC).**

Definition 14** (Generalized fiducial inference).**

Remark 7**.**

Assumption 15**.**

Proposition 16**.**

Proposition 17**.**

Corollary 18**.**

4 Exact frequentist inference in finite sample

Definition 19** (sets of quantiles).**

Definition 20** (credible set).**

Proposition 21** (Exact frequentist coverage).**

Remark 8**.**

Lemma 22**.**

Assumption 23**.**

Assumption 24** (data reduction).**

Remark 9**.**

Remark 10**.**

Example 25** (Explicit form for h\mathbf{h}h and b\mathbf{b}b).**

Assumption 26** (characterization of φp\bm{\varphi}_{p}φp​).**

Assumption 27** (characterization of φp\bm{\varphi}_{p}φp​ II).**

Theorem 28**.**

Remark 11**.**

Assumption 29** (characterization of \mathbcalg\mathbcal{g}\mathbcalg).**

Proposition 30**.**

5 Asymptotic properties

Assumption 31**.**

Assumption 32** (SwiZs).**

Assumption 33** (IIE).**

Theorem 34** (consistency).**

Assumption 35**.**

Assumption 36** (SwiZs II).**

Assumption 37** (IIE II).**

Theorem 38** (asymptotic normality).**

Proposition 39**.**

6 Examples

Example 40** (Cauchy with unknown location).**

Example 41** (uniform with unknown upper bound).**

Example 42** (exponential with unknown rate parameter).**

Example 43** (gamma with unknown rate parameter).**

Example 44** (normal with unknown mean and unknown variance).**

Example 45** (linear regression).**

Example 46** (ridge regression).**

Example 47** (log-normal with unknown mean and unknown variance).**

Example 48** **(irregularly observed geometric Brownian motion

7 Simulation study

Example 49** (standard ttt-distribution with unknown degrees of freedom).**

Example 50** (two-parameters Lomax distribution).**

Example 51** (random intercept and random slope linear mixed model).**

Example 52** (M/G/1-queueing model).**

Example 53**.**

Appendix A Technical results

Example 1 (Normal).

Definition 2 (SwiZs).

Remark 1.

Definition 3 (indirect inference estimators).

Remark 2.

Remark 3.

Assumption 4 (uniqueness).

Theorem 5 (Equivalence SwiZs/indirect inference).

Definition 6 (parametric bootstrap).

Remark 4.

Assumption 7.

Theorem 8 (equivalence SwiZs/parametric bootstrap).

Proposition 9 (equivalence SwiZs/parametric bootstrap in location family problems).

Definition 10 (Approximate Bayesian Computation (ABC) estimators).

Remark 5.

Definition 11 (posterior distribution).

Remark 6.

Assumption 12 (existence of a prior).

Theorem 13 (Equivalence SwiZs/ABC).

Definition 14 (Generalized fiducial inference).

Remark 7.

Assumption 15.

Proposition 16.

Proposition 17.

Corollary 18.

Definition 19 (sets of quantiles).

Definition 20 (credible set).

Proposition 21 (Exact frequentist coverage).

Remark 8.

Lemma 22.

Assumption 23.

Assumption 24 (data reduction).

Remark 9.

Remark 10.

Example 25 (Explicit form for $\mathbf{h}$ and $\mathbf{b}$ ).

Assumption 26 (characterization of $\bm{\varphi}_{p}$ ).

Assumption 27 (characterization of $\bm{\varphi}_{p}$ II).

Theorem 28.

Remark 11.

Assumption 29 (characterization of $\mathbcal{g}$ ).

Proposition 30.

Assumption 31.

Assumption 32 (SwiZs).

Assumption 33 (IIE).

Theorem 34 (consistency).

Assumption 35.

Assumption 36 (SwiZs II).

Assumption 37 (IIE II).

Theorem 38 (asymptotic normality).

Proposition 39.

Example 40 (Cauchy with unknown location).

Example 41 (uniform with unknown upper bound).

Example 42 (exponential with unknown rate parameter).

Example 43 (gamma with unknown rate parameter).

Example 44 (normal with unknown mean and unknown variance).

Example 45 (linear regression).

Example 46 (ridge regression).

Example 47 (log-normal with unknown mean and unknown variance).

Example 48 (irregularly observed geometric Brownian motion

Example 49 (standard $t$ -distribution with unknown degrees of freedom).

Example 50 (two-parameters Lomax distribution).

Example 51 (random intercept and random slope linear mixed model).

Example 52 (M/G/1-queueing model).

Example 53.

Lemma 54.

Lemma 55.

Lemma 56 (weak consistency).

Definition 57.

Lemma 58 (uniform consistency).

Lemma 59.

Lemma 60.

Lemma 61 (uniform weak law of large number).

Lemma 62 (mean value inequality).

Lemma 63 (delta method).

Lemma 64 (asymptotic normality).

Lemma 65 (implicit function theorem).