Estimation and adaptive-to-model testing for regressions with diverging   number of predictors

Falong Tan; Lixing Zhu

arXiv:1706.07664·stat.ME·June 26, 2017

Estimation and adaptive-to-model testing for regressions with diverging number of predictors

Falong Tan, Lixing Zhu

PDF

Open Access

TL;DR

This paper develops a new test for parametric single-index models with a diverging number of predictors, analyzing estimator properties and constructing an adaptive test statistic suitable for high-dimensional settings.

Contribution

It introduces an adaptive-to-model residual empirical process and a martingale transformation for model checking in high-dimensional regressions with diverging predictors.

Findings

01

The test maintains good size and power in simulations.

02

Asymptotic properties are established under null and alternative hypotheses.

03

Estimator properties are characterized for diverging dimensions.

Abstract

The research described in this paper is motivated by model checking for parametric single-index models with diverging number of predictors. To construct a test statistic, we first study the asymptotic property of the estimators of involved parameters of interest under the null and alternative hypothesis when the dimension is divergent to infinity as the sample size goes to infinity. For the testing problem, we study an adaptive-to-model residual-marked empirical process as the basis for constructing a test statistic. By modifying the approach in the literature to suit the diverging dimension settings, we construct a martingale transformation. Under the null, local and global alternative hypothesis, the weak limits of the empirical process are derived and then the asymptotic properties of the test statistic are investigated. Simulation studies are carried out to examine the performance…

Tables6

Table 1. Table 1: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 11 subscript 𝐻 11 H_{11} in Study 1.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.0	0.0970	0.0905	0.0890	0.1020
	0.5	0.8650	0.9915	1.0000	1.0000
$A C M_{n}^{2}, α = 0.05$	0.0	0.0500	0.0530	0.0500	0.0505
	0.5	0.7770	0.9810	1.0000	1.0000
$A C M_{n}^{2}, α = 0.01$	0.0	0.0085	0.0105	0.0115	0.0130
	0.5	0.5620	0.9095	0.9975	1.0000
$T_{n}^{S Z}, α = 0.10$	0.0	0.0915	0.0995	0.1060	0.0985
	0.5	0.8675	0.9865	1.0000	1.0000
$T_{n}^{S Z}, α = 0.05$	0.0	0.0510	0.0470	0.0420	0.0495
	0.5	0.7825	0.9795	1.0000	1.0000
$T_{n}^{S Z}, α = 0.01$	0.0	0.0120	0.0090	0.0120	0.0100
	0.5	0.5290	0.9065	0.9990	1.0000
$P C v M_{n}, α = 0.10$	0.0	0.1140	0.1220	0.0980	0.1190
	0.5	0.8850	0.9880	1.0000	1.0000
$P C v M_{n}, α = 0.05$	0.0	0.0480	0.0590	0.0650	0.0490
	0.5	0.8110	0.9860	1.0000	1.0000
$P C v M_{n}, α = 0.01$	0.0	0.0150	0.0100	0.0110	0.0090
	0.5	0.6190	0.9310	0.9970	1.0000
$I C M_{n}, α = 0.10$	0.0	0.0390	0.0010	0.0000	0.0000
	0.5	0.5490	0.2910	0.1760	0.0000
$I C M_{n}, α = 0.05$	0.0	0.0070	0.0000	0.0000	0.0000
	0.5	0.3900	0.0910	0.0180	0.0000
$I C M_{n}, α = 0.01$	0.0	0.0000	0.0000	0.0000	0.0000
	0.5	0.1220	0.0060	0.0020	0.0000
$T_{n}^{Z H}, α = 0.10$	0.0	0.0805	0.0950	0.1055	0.1060
	0.5	0.2240	0.2205	0.2420	0.2430
$T_{n}^{Z H}, α = 0.05$	0.0	0.0305	0.0300	0.0330	0.0310
	0.5	0.1460	0.1285	0.1445	0.0980
$T_{n}^{Z H}, α = 0.01$	0.0	0.0015	0.0020	0.0025	0.0025
	0.5	0.0420	0.0210	0.0225	0.0150
$T_{n}^{G W Z}, α = 0.10$	0.0	0.0710	0.0755	0.0850	0.0830
	0.5	0.8170	0.9795	1.0000	1.0000
$T_{n}^{G W Z}, α = 0.05$	0.0	0.0525	0.0430	0.0585	0.0475
	0.5	0.7690	0.9690	1.0000	1.0000
$T_{n}^{G W Z}, α = 0.01$	0.0	0.0220	0.0170	0.0205	0.0170
	0.5	0.6510	0.9455	0.9995	1.0000

Table 2. Table 2: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 12 subscript 𝐻 12 H_{12} in Study 1.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.0	0.1010	0.0925	0.1055	0.0900
	0.5	0.2550	0.5135	0.9190	1.0000
$A C M_{n}^{2}, α = 0.05$	0.0	0.0520	0.0465	0.0445	0.0515
	0.5	0.1445	0.3225	0.7550	1.0000
$A C M_{n}^{2}, α = 0.01$	0.0	0.0095	0.0090	0.0120	0.0070
	0.5	0.0460	0.1060	0.3485	0.9140
$T_{n}^{S Z}, α = 0.10$	0.0	0.0980	0.0990	0.0865	0.0930
	0.5	0.2630	0.5265	0.9240	1.0000
$T_{n}^{S Z}, α = 0.05$	0.0	0.0530	0.0480	0.0515	0.0495
	0.5	0.1760	0.3235	0.7350	0.9970
$T_{n}^{S Z}, α = 0.01$	0.0	0.0100	0.0060	0.0085	0.0105
	0.5	0.0470	0.1145	0.3580	0.9350
$P C v M_{n}, α = 0.10$	0.0	0.1080	0.1170	0.1230	0.1000
	0.5	0.2560	0.3390	0.5160	0.7590
$P C v M_{n}, α = 0.05$	0.0	0.0530	0.0590	0.0440	0.0700
	0.5	0.1470	0.2320	0.4080	0.6250
$P C v M_{n}, α = 0.01$	0.0	0.0130	0.0130	0.0080	0.0130
	0.5	0.0450	0.1020	0.2010	0.4080
$I C M_{n}, α = 0.10$	0.0	0.0370	0.0000	0.0000	0.0000
	0.5	0.1950	0.0330	0.0020	0.0000
$I C M_{n}, α = 0.05$	0.0	0.0110	0.0000	0.0000	0.0000
	0.5	0.0790	0.0020	0.0000	0.0000
$I C M_{n}, α = 0.01$	0.0	0.0020	0.0000	0.0000	0.0000
	0.5	0.0110	0.0000	0.0000	0.0000
$T_{n}^{Z H}, α = 0.10$	0.0	0.0805	0.0830	0.0800	0.1095
	0.5	0.1630	0.1515	0.1825	0.1665
$T_{n}^{Z H}, α = 0.05$	0.0	0.0325	0.0350	0.0320	0.0330
	0.5	0.0755	0.0775	0.0940	0.0615
$T_{n}^{Z H}, α = 0.01$	0.0	0.0045	0.0015	0.0035	0.0035
	0.5	0.0155	0.0095	0.0125	0.0060
$T_{n}^{G W Z}, α = 0.10$	0.0	0.0820	0.0725	0.0810	0.0745
	0.5	0.6765	0.9460	1.0000	1.0000
$T_{n}^{G W Z}, α = 0.05$	0.0	0.0495	0.0500	0.0500	0.0535
	0.5	0.6035	0.9335	0.9995	1.0000
$T_{n}^{G W Z}, α = 0.01$	0.0	0.0190	0.0165	0.0180	0.0210
	0.5	0.4660	0.8705	0.9980	1.0000

Table 3. Table 3: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 13 subscript 𝐻 13 H_{13} in Study 1.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.00	0.0985	0.1050	0.1085	0.1090
	0.25	0.7130	0.9410	0.9955	1.0000
$A C M_{n}^{2}, α = 0.05$	0.00	0.0500	0.0455	0.0435	0.0450
	0.25	0.5970	0.8945	0.9980	1.0000
$A C M_{n}^{2}, α = 0.01$	0.00	0.0095	0.0090	0.0095	0.0090
	0.25	0.3470	0.7225	0.9840	1.0000
$T_{n}^{S Z}, α = 0.10$	0.00	0.0960	0.1055	0.1060	0.0960
	0.25	0.7190	0.9405	0.9975	1.0000
$T_{n}^{S Z}, α = 0.05$	0.00	0.0505	0.0420	0.0470	0.0495
	0.25	0.5940	0.8980	0.9945	1.0000
$T_{n}^{S Z}, α = 0.01$	0.00	0.0080	0.0125	0.0095	0.0115
	0.25	0.3310	0.7190	0.9705	0.9995
$P C v M_{n}, α = 0.10$	0.00	0.1030	0.0980	0.1140	0.1210
	0.25	0.7180	0.9500	0.9970	1.0000
$P C v M_{n}, α = 0.05$	0.00	0.0580	0.0600	0.0440	0.0570
	0.25	0.6160	0.8980	0.9970	1.0000
$P C v M_{n}, α = 0.01$	0.00	0.0060	0.0150	0.0080	0.0070
	0.25	0.3870	0.7360	0.9780	1.0000
$I C M_{n}, α = 0.10$	0.00	0.0290	0.0010	0.0000	0.0000
	0.25	0.1590	0.0190	0.0030	0.0000
$I C M_{n}, α = 0.05$	0.00	0.0110	0.0000	0.0000	0.0000
	0.25	0.0590	0.0010	0.0000	0.0000
$I C M_{n}, α = 0.01$	0.00	0.0010	0.0000	0.0000	0.0000
	0.25	0.0140	0.0000	0.0000	0.0000
$T_{n}^{Z H}, α = 0.10$	0.00	0.0765	0.0810	0.0940	0.0970
	0.25	0.1135	0.1185	0.1400	0.1305
$T_{n}^{Z H}, α = 0.05$	0.00	0.0275	0.0310	0.0315	0.0340
	0.25	0.0730	0.0485	0.0745	0.0625
$T_{n}^{Z H}, α = 0.01$	0.00	0.0030	0.0020	0.0030	0.0010
	0.25	0.0055	0.0060	0.0080	0.0030
$T_{n}^{G W Z}, α = 0.10$	0.00	0.0800	0.0735	0.0770	0.0765
	0.25	0.4580	0.7430	0.9795	0.9995
$T_{n}^{G W Z}, α = 0.05$	0.00	0.0510	0.0505	0.0540	0.0490
	0.25	0.3840	0.6660	0.9465	1.0000
$T_{n}^{G W Z}, α = 0.01$	0.00	0.0200	0.0225	0.0235	0.0240
	0.25	0.2590	0.5570	0.9040	0.9995

Table 4. Table 4: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 14 subscript 𝐻 14 H_{14} in Study 1.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.00	0.1130	0.1000	0.0970	0.0955
	0.25	0.9825	1.0000	1.0000	1.0000
$A C M_{n}^{2}, α = 0.05$	0.00	0.0520	0.0460	0.0545	0.0490
	0.25	0.9525	1.0000	1.0000	1.0000
$A C M_{n}^{2}, α = 0.01$	0.00	0.0110	0.0090	0.0075	0.0105
	0.25	0.8680	0.9950	1.0000	1.0000
$T_{n}^{S Z}, α = 0.10$	0.00	0.1090	0.0970	0.0910	0.1090
	0.25	0.9805	0.9990	1.0000	1.0000
$T_{n}^{S Z}, α = 0.05$	0.00	0.0475	0.0490	0.0460	0.0555
	0.25	0.9605	0.9995	1.0000	1.0000
$T_{n}^{S Z}, α = 0.01$	0.00	0.0095	0.0115	0.0075	0.0090
	0.25	0.8700	0.9970	1.0000	1.0000
$P C v M_{n}, α = 0.10$	0.00	0.0950	0.1130	0.1110	0.1040
	0.25	0.9960	1.0000	1.0000	1.0000
$P C v M_{n}, α = 0.05$	0.00	0.0580	0.0540	0.0570	0.0540
	0.25	0.9690	0.9990	1.0000	1.0000
$P C v M_{n}, α = 0.01$	0.00	0.0140	0.0170	0.0080	0.0150
	0.25	0.8730	0.9980	1.0000	1.0000
$I C M_{n}, α = 0.10$	0.00	0.0290	0.0010	0.0000	0.0000
	0.25	0.5680	0.2420	0.1330	0.0000
$I C M_{n}, α = 0.05$	0.00	0.0050	0.0000	0.0000	0.0000
	0.25	0.3670	0.0740	0.0120	0.0000
$I C M_{n}, α = 0.01$	0.00	0.0010	0.0000	0.0000	0.0000
	0.25	0.1060	0.0040	0.0000	0.0000
$T_{n}^{Z H}, α = 0.10$	0.00	0.0700	0.0910	0.0875	0.0985
	0.25	0.2420	0.2125	0.2680	0.2210
$T_{n}^{Z H}, α = 0.05$	0.00	0.0320	0.0295	0.0325	0.0380
	0.25	0.1145	0.1195	0.1410	0.1145
$T_{n}^{Z H}, α = 0.01$	0.00	0.0015	0.0045	0.0050	0.0035
	0.25	0.0335	0.0230	0.0220	0.0095
$T_{n}^{G W Z}, α = 0.10$	0.00	0.0780	0.0805	0.0815	0.0830
	0.25	0.8645	0.9935	1.0000	1.0000
$T_{n}^{G W Z}, α = 0.05$	0.00	0.0455	0.0560	0.0540	0.0625
	0.25	0.8405	0.9870	1.0000	1.0000
$T_{n}^{G W Z}, α = 0.01$	0.00	0.0210	0.0195	0.0225	0.0195
	0.25	0.7285	0.9735	1.0000	1.0000

Table 5. Table 5: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 21 subscript 𝐻 21 H_{21} in Study 2.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.00	0.1075	0.0965	0.0910	0.1035
	0.25	0.6185	0.8980	0.9955	1.0000
$A C M_{n}^{2}, α = 0.05$	0.00	0.0520	0.0490	0.0495	0.0570
	0.25	0.4895	0.8185	0.9925	1.0000
$A C M_{n}^{2}, α = 0.01$	0.00	0.0100	0.0085	0.0100	0.0115
	0.25	0.2505	0.5920	0.9450	0.9995
$T_{n}^{S Z}, α = 0.10$	0.00	0.0935	0.0935	0.1070	0.1055
	0.25	0.7005	0.9120	0.9965	1.0000
$T_{n}^{S Z}, α = 0.05$	0.00	0.0515	0.0425	0.0460	0.0445
	0.25	0.5600	0.8505	0.9940	1.0000
$T_{n}^{S Z}, α = 0.01$	0.00	0.0080	0.0100	0.0060	0.0100
	0.25	0.3180	0.6680	0.9665	1.0000
$P C v M_{n}, α = 0.10$	0.00	0.1150	0.0910	0.1090	0.1050
	0.25	0.7080	0.9320	0.9990	1.0000
$P C v M_{n}, α = 0.05$	0.00	0.0560	0.0480	0.0570	0.0430
	0.25	0.6230	0.9080	0.9960	1.0000
$P C v M_{n}, α = 0.01$	0.00	0.0080	0.0120	0.0100	0.0090
	0.25	0.3810	0.7230	0.9820	1.0000
$I C M_{n}, α = 0.10$	0.00	0.0180	0.0010	0.0000	0.0000
	0.25	0.1220	0.0060	0.0000	0.0000
$I C M_{n}, α = 0.05$	0.00	0.0040	0.0000	0.0000	0.0000
	0.25	0.0470	0.0010	0.0000	0.0000
$I C M_{n}, α = 0.01$	0.00	0.0000	0.0000	0.0000	0.0000
	0.25	0.0070	0.0000	0.0000	0.0000
$T_{n}^{Z H}, α = 0.10$	0.00	0.1100	0.1020	0.0960	0.1110
	0.25	0.1420	0.1370	0.1550	0.1545
$T_{n}^{Z H}, α = 0.05$	0.00	0.0400	0.0410	0.0365	0.0390
	0.25	0.0710	0.0700	0.0610	0.0550
$T_{n}^{Z H}, α = 0.01$	0.00	0.0045	0.0035	0.0040	0.0035
	0.25	0.0140	0.0075	0.0065	0.0035
$T_{n}^{G W Z}, α = 0.10$	0.00	0.1135	0.1045	0.1115	0.1240
	0.25	0.5275	0.8140	0.9860	0.9995
$T_{n}^{G W Z}, α = 0.05$	0.00	0.0790	0.0760	0.0775	0.0750
	0.25	0.4625	0.7300	0.9610	1.0000
$T_{n}^{G W Z}, α = 0.01$	0.00	0.0340	0.0345	0.0310	0.0305
	0.25	0.3175	0.6015	0.9295	0.9985

Table 6. Table 6: Empirical sizes and powers of A C M n 2 𝐴 𝐶 superscript subscript 𝑀 𝑛 2 ACM_{n}^{2} , T n S Z superscript subscript 𝑇 𝑛 𝑆 𝑍 T_{n}^{SZ} , P C v M n 𝑃 𝐶 𝑣 subscript 𝑀 𝑛 PCvM_{n} , I C M n 𝐼 𝐶 subscript 𝑀 𝑛 ICM_{n} , T n Z H superscript subscript 𝑇 𝑛 𝑍 𝐻 T_{n}^{ZH} and T n G W Z superscript subscript 𝑇 𝑛 𝐺 𝑊 𝑍 T_{n}^{GWZ} for H 0 subscript 𝐻 0 H_{0} vs. H 22 subscript 𝐻 22 H_{22} in Study 2.

	a	n=100	n=200	n=400	n=800
		p=7	p=10	p=12	p=16
$A C M_{n}^{2}, α = 0.10$	0.0	0.1180	0.1190	0.1095	0.1060
	0.5	0.2255	0.3090	0.4805	0.7390
$A C M_{n}^{2}, α = 0.05$	0.0	0.0575	0.0550	0.0585	0.0530
	0.5	0.1295	0.1895	0.3030	0.5790
$A C M_{n}^{2}, α = 0.01$	0.0	0.0110	0.0135	0.0115	0.0120
	0.5	0.0325	0.0605	0.1155	0.2830
$T_{n}^{S Z}, α = 0.10$	0.0	0.1110	0.1075	0.0980	0.1010
	0.5	0.1335	0.1480	0.1580	0.1920
$T_{n}^{S Z}, α = 0.05$	0.0	0.0650	0.0535	0.0550	0.0550
	0.5	0.0755	0.0970	0.0835	0.1195
$T_{n}^{S Z}, α = 0.01$	0.0	0.0085	0.0140	0.0095	0.0120
	0.5	0.0205	0.0285	0.0180	0.0330
$P C v M_{n}, α = 0.10$	0.0	0.1110	0.1160	0.1010	0.1180
	0.5	0.2370	0.3480	0.4730	0.6630
$P C v M_{n}, α = 0.05$	0.0	0.0470	0.0560	0.0690	0.0510
	0.5	0.1310	0.2000	0.2760	0.4450
$P C v M_{n}, α = 0.01$	0.0	0.0070	0.0100	0.0240	0.0100
	0.5	0.0430	0.0580	0.0930	0.1700
$I C M_{n}, α = 0.10$	0.0	0.0200	0.0000	0.0000	0.0000
	0.5	0.0980	0.0140	0.0030	0.0020
$I C M_{n}, α = 0.05$	0.0	0.0050	0.0000	0.0000	0.0000
	0.5	0.0210	0.0020	0.0000	0.0000
$I C M_{n}, α = 0.01$	0.0	0.0000	0.0000	0.0000	0.0000
	0.5	0.0000	0.0000	0.0000	0.0000
$T_{n}^{Z H}, α = 0.10$	0.0	0.0940	0.0915	0.0985	0.1135
	0.5	0.1325	0.1455	0.1625	0.1455
$T_{n}^{Z H}, α = 0.05$	0.0	0.0445	0.0365	0.0410	0.0380
	0.5	0.0690	0.0765	0.0770	0.0545
$T_{n}^{Z H}, α = 0.01$	0.0	0.0050	0.0035	0.0020	0.0020
	0.5	0.0125	0.0090	0.0070	0.0040
$T_{n}^{G W Z}, α = 0.10$	0.0	0.1015	0.1020	0.0995	0.1125
	0.5	0.2380	0.3745	0.5450	0.8265
$T_{n}^{G W Z}, α = 0.05$	0.0	0.0615	0.0675	0.0670	0.0580
	0.5	0.1700	0.2750	0.4560	0.7725
$T_{n}^{G W Z}, α = 0.01$	0.0	0.0240	0.0270	0.0290	0.0335
	0.5	0.1015	0.1655	0.3360	0.6260

Equations485

Y = g (β_{0}^{⊤} X, θ_{0}) + ε for some β_{0} \in R^{p}, θ_{0} \in R^{d},

Y = g (β_{0}^{⊤} X, θ_{0}) + ε for some β_{0} \in R^{p}, θ_{0} \in R^{d},

Y = G (B^{⊤} X) + ε .

Y = G (B^{⊤} X) + ε .

H_{0} : Y = g (β_{0}^{⊤} x, θ_{0}) + ε for some β_{0} \in R^{p}, θ_{0} \in R^{d} .

H_{0} : Y = g (β_{0}^{⊤} x, θ_{0}) + ε for some β_{0} \in R^{p}, θ_{0} \in R^{d} .

(\hat{β}_{n}, \hat{θ}_{n}) = β, θ argmin i = 1 \sum n [Y_{i} - g (β^{⊤} X_{i}, θ)]^{2} .

(\hat{β}_{n}, \hat{θ}_{n}) = β, θ argmin i = 1 \sum n [Y_{i} - g (β^{⊤} X_{i}, θ)]^{2} .

(\tilde{β}_{0}, \tilde{θ}_{0}) = β, θ argmin E [Y - g (β^{⊤} X, θ)]^{2} .

(\tilde{β}_{0}, \tilde{θ}_{0}) = β, θ argmin E [Y - g (β^{⊤} X, θ)]^{2} .

g^{'} (β, θ, x) = \frac{\partial g ( β ^{⊤} x , θ )}{\partial ( β , θ )}, g^{''} (β, θ, x) = \frac{\partial g ^{'} ( β , θ , x )}{\partial ( β , θ )} .

g^{'} (β, θ, x) = \frac{\partial g ( β ^{⊤} x , θ )}{\partial ( β , θ )}, g^{''} (β, θ, x) = \frac{\partial g ^{'} ( β , θ , x )}{\partial ( β , θ )} .

Σ_{n} = E [g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X) g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X)^{⊤}] - E [e g^{''} (\tilde{β}_{0}, \tilde{θ}_{0}, X)] =: Σ_{1 n} - Σ_{2 n} .

Σ_{n} = E [g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X) g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X)^{⊤}] - E [e g^{''} (\tilde{β}_{0}, \tilde{θ}_{0}, X)] =: Σ_{1 n} - Σ_{2 n} .

\overset{γ}{^}_{n} - \tilde{γ}_{0} = Σ_{n}^{- 1} \frac{1}{n} i = 1 \sum n [Y_{i} - g (\tilde{β}_{0}^{⊤} X_{i}, \tilde{θ}_{0})] g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X_{i}) + o_{p} (\frac{1}{n}) .

\overset{γ}{^}_{n} - \tilde{γ}_{0} = Σ_{n}^{- 1} \frac{1}{n} i = 1 \sum n [Y_{i} - g (\tilde{β}_{0}^{⊤} X_{i}, \tilde{θ}_{0})] g^{'} (\tilde{β}_{0}, \tilde{θ}_{0}, X_{i}) + o_{p} (\frac{1}{n}) .

H_{0} : P {E (Y ∣ X) = g (β_{0}^{⊤} X, θ_{0})} = 1 for some β_{0} \in R^{p}, θ_{0} \in R^{d},

H_{0} : P {E (Y ∣ X) = g (β_{0}^{⊤} X, θ_{0})} = 1 for some β_{0} \in R^{p}, θ_{0} \in R^{d},

H_{1} : P {E (Y ∣ X) = G (B^{⊤} X) = g (β^{⊤} X, θ)} < 1 \forall β \in R^{p}, θ \in R^{d}

H_{1} : P {E (Y ∣ X) = G (B^{⊤} X) = g (β^{⊤} X, θ)} < 1 \forall β \in R^{p}, θ \in R^{d}

E [e I (B^{⊤} X \leq u)] = E [e I (κ β_{0}^{⊤} X \leq u)] = 0.

E [e I (B^{⊤} X \leq u)] = E [e I (κ β_{0}^{⊤} X \leq u)] = 0.

E [e I (α^{⊤} B^{⊤} X \leq u)] \neq = 0

E [e I (α^{⊤} B^{⊤} X \leq u)] \neq = 0

V_{n} (\overset{α}{^}, u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (\hat{β}_{n}^{⊤} X_{i}, \hat{θ}_{n})] I (\overset{α}{^}^{⊤} \hat{B}_{n}^{⊤} X_{i} \leq u),

V_{n} (\overset{α}{^}, u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (\hat{β}_{n}^{⊤} X_{i}, \hat{θ}_{n})] I (\overset{α}{^}^{⊤} \hat{B}_{n}^{⊤} X_{i} \leq u),

V_{n} (u) = \overset{α}{^} \in S_{\overset{q}{^}}^{+} sup ∣ V_{n} (\overset{α}{^}, u) ∣

V_{n} (u) = \overset{α}{^} \in S_{\overset{q}{^}}^{+} sup ∣ V_{n} (\overset{α}{^}, u) ∣

Y = G (B^{⊤} X) + ε,

Y = G (B^{⊤} X) + ε,

Y ⊥ ⊥ E (Y ∣ X) ∣ β_{0}^{⊤} X, \mbox an d Y ⊥ ⊥ E (Y ∣ X) ∣ B^{⊤} X,

Y ⊥ ⊥ E (Y ∣ X) ∣ β_{0}^{⊤} X, \mbox an d Y ⊥ ⊥ E (Y ∣ X) ∣ B^{⊤} X,

Y ⊥ ⊥ X ∣ B^{⊤} X ⟺ h_{t} (Y) ⊥ ⊥ X ∣ B^{⊤} X, \forall t \in R .

Y ⊥ ⊥ X ∣ B^{⊤} X ⟺ h_{t} (Y) ⊥ ⊥ X ∣ B^{⊤} X, \forall t \in R .

M = \int E [X h_{t} (Y)] E [X^{⊤} h_{t} (Y)] d F_{Y} (t),

M = \int E [X h_{t} (Y)] E [X^{⊤} h_{t} (Y)] d F_{Y} (t),

\hat{M}_{n} = \frac{1}{n} j = 1 \sum n \overset{α}{^}_{Y_{j}} \overset{α}{^}_{Y_{j}}^{⊤} .

\hat{M}_{n} = \frac{1}{n} j = 1 \sum n \overset{α}{^}_{Y_{j}} \overset{α}{^}_{Y_{j}}^{⊤} .

λ_{1} \geq \dots \geq λ_{q} > λ_{q + 1} = \dots = λ_{p} = 0.

λ_{1} \geq \dots \geq λ_{q} > λ_{q + 1} = \dots = λ_{p} = 0.

\overset{q}{^} = ar g 1 \leq i \leq p min {i : \frac{λ ^ _{i + 1}^{2} + c}{λ ^ _{i}^{2} + c}} .

\overset{q}{^} = ar g 1 \leq i \leq p min {i : \frac{λ ^ _{i + 1}^{2} + c}{λ ^ _{i}^{2} + c}} .

V_{n}^{0} (u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (β_{0}^{⊤} X_{i}, θ_{0})] I (κ β_{0}^{⊤} X_{i} \leq u) .

V_{n}^{0} (u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (β_{0}^{⊤} X_{i}, θ_{0})] I (κ β_{0}^{⊤} X_{i} \leq u) .

σ_{n}^{2} (v)

σ_{n}^{2} (v)

ψ_{n} (u)

C o v [V_{n}^{0} (s), V_{n}^{0} (t)] = ψ_{n} (s \land t) .

C o v [V_{n}^{0} (s), V_{n}^{0} (t)] = ψ_{n} (s \land t) .

V_{n}^{0} (u) ⟶ V_{\infty} (u) in distribution,

V_{n}^{0} (u) ⟶ V_{\infty} (u) in distribution,

V_{n} (\overset{α}{^}, u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (\hat{β}_{n}^{⊤} X_{i}, \hat{θ}_{n})] I (\hat{B}_{n}^{⊤} X_{i} \leq u)

V_{n} (\overset{α}{^}, u) = \frac{1}{n} i = 1 \sum n [Y_{i} - g (\hat{β}_{n}^{⊤} X_{i}, \hat{θ}_{n})] I (\hat{B}_{n}^{⊤} X_{i} \leq u)

V_{n} (\overset{α}{^}, u) = V_{n}^{0} (u) - n (\overset{γ}{^}_{n} - γ_{0})^{⊤} M_{n} (u) + o_{p} (1)

V_{n} (\overset{α}{^}, u) = V_{n}^{0} (u) - n (\overset{γ}{^}_{n} - γ_{0})^{⊤} M_{n} (u) + o_{p} (1)

V_{n} (\overset{α}{^}, u) = V_{n}^{0} (u) - \frac{1}{n} M_{n} (u)^{⊤} Σ_{n}^{- 1} i = 1 \sum n g^{'} (β_{0}, θ_{0}, X_{i}) ε_{i} + o_{p} (1)

V_{n} (\overset{α}{^}, u) = V_{n}^{0} (u) - \frac{1}{n} M_{n} (u)^{⊤} Σ_{n}^{- 1} i = 1 \sum n g^{'} (β_{0}, θ_{0}, X_{i}) ε_{i} + o_{p} (1)

V_{n} (u) ⟶ ∣ V_{\infty}^{1} (u) ∣,

V_{n} (u) ⟶ ∣ V_{\infty}^{1} (u) ∣,

K_{n} (s, t)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Statistical Methods and Bayesian Inference · Bayesian Methods and Mixture Models

Full text

Estimation and adaptive-to-model testing for regressions with diverging number of predictors

111Lixing Zhu is a Chair professor of Department of Mathematics at Hong Kong Baptist University, Hong Kong, China. He was supported by a grant from the University Grants Council of Hong Kong, Hong Kong, China.

Falong Tan and Lixing Zhu

Department of Mathematics, Hong Kong Baptist University, Hong Kong

Abstract

The research described in this paper is motivated by model checking for parametric single-index models with diverging number of predictors. To construct a test statistic, we first study the asymptotic property of the estimators of involved parameters of interest under the null and alternative hypothesis when the dimension is divergent to infinity as the sample size goes to infinity. For the testing problem, we study an adaptive-to-model residual-marked empirical process as the basis for constructing a test statistic. By modifying the approach in the literature to suit the diverging dimension settings, we construct a martingale transformation. Under the null, local and global alternative hypothesis, the weak limits of the empirical process are derived and then the asymptotic properties of the test statistic are investigated. Simulation studies are carried out to examine the performance of the test.

Key words: Adaptive-to-model test; Empirical process; Martingale transformation; Parametric single-index models; Sufficient dimension reduction.

1 Introduction

Regression modelling is a vital problem in regression analysis. One important step in regression modelling is to check the adequacy of a model that would be used in further analysis to prevent possible wrong conclusions. There are a number of proposals available in the literature, which will be reviewed later. However, there is an important issue that has not been well studied. We notice that in high dimensional data analysis, the dimension $p$ of the predictor vector is often large even though it is still small compared with the sample size $n$ . In this case, we often regard $p$ as a diverging number as $n$ goes to infinity. A relevant reference is Huber (1973) who considered a problem where $p$ goes to infinity at the rate of order $O(n^{1/4})$ .

In this paper, we focus on inference for parametric single-index models. Although they are in form generalized linear models, we do not use this name as generalized linear models have their own definitions in the literature. Let $Y$ be a response variable associated with a $p$ -dimensional predictor vector $X\in\mathbb{R}^{p}$ . If $Y$ is integrable, the regression function $g(x)=E(Y|X=x)$ is well-defined. Let $\mathcal{G}=\{g(\beta^{\top}\cdot,\theta):\beta\in\mathbb{R}^{p},\theta\in\mathbb{R}^{d}\}$ be a given parametric family of functions. The study herewith is motivated by checking whether $g(\cdot,\cdot)$ belongs to $\mathcal{G}$ or not. Thus the null hypothesis we want to test is that $(Y,X)$ follows a parametric single-index model as

[TABLE]

where $\varepsilon=Y-E(Y|X)$ is the error term, $d$ is fixed, $p$ diverges as the sample size $n$ tends to infinity, and $\top$ denotes the transposition.

We now review existing methodologies in the literature. Two major classes of tests are: locally smoothing tests and globally smoothing tests. Locally smoothing tests use nonparametric smoothing estimators to construct test statistics; see Härdle and Mammen (1993), Zheng (1996), Fan and Li (1996), Dette (1999), Fan and Huang (2001), Koul and Ni (2004), and Van Keilegom et al. (2008) as examples. Globally smoothing tests construct test statistics based on averages of functionals of empirical processes and then avoid nonparametric estimation. They are called globally smoothing tests as averaging is also a globally smoothing step. Examples include Bierens (1982, 1990), Stute (1997), Stute, Thies, and Zhu (1998), Stute et al. (1998), Khmadladze and Koul (2004).

All existing methods are limited to the fixed dimension settings. The extension to a diverging dimension case is by no means trivial. When the dimension $p$ is large, most existing tests, especially locally smoothing tests, perform badly. Stute and Zhu (2002) can be regarded as a dimension reduction-based test. A martingale transformation leads it to be asymptotically distribution-free. This test has been proved to be powerful in many cases, even when $p$ is large. But Stute and Zhu’s (2002) test is not omnibus, i.e., it fails to be consistent against all alternative hypotheses and thus is a directional test. Escanciano (2006) gave some detailed comments on this issue, and proposed, as well as Lavergne and Patilea (2008, 2012), tests that are based on projected covariates. Guo et al. (2016) did it also and put forward to a model adaptation notion in hypothesis testing. This innovative notion provides a deep insight into model checking for regressions and the adaptive-to-model approach can fully use the model structures under both the null and alternative hypothesis. Recently, with the help of sufficient dimension reduction techniques, Tan et al. (2017) generalized Stute and Zhu’s (2002) method and obtained an omnibus test which is asymptotically distribution-free and inherits the dimension reduction properties. It performs very well, but still requires the condition that $p$ is fixed. In this paper, we develop a consistent diagnostic test for checking the adequacy of a single-index model when the dimension $p$ of the predictor vector diverges to infinity as the sample size $n$ tends to infinity.

To make full use of the model structure under both the null hypothesis and the alternative hypothesis, we consider the following alternative model

[TABLE]

where $E(\varepsilon|X)=0$ and $G(\cdot)$ is an unknown smooth function and $B$ is a $p\times q$ orthonormal matrix with an unknown $q$ with $1\leq q\leq p$ . Note that this is a more general model of (1.2) than the nonparametric model $Y=G(X)+\varepsilon$ as it is a special case when $B$ is an $p\times p$ orthonormal matrix with $q=p$ .

Similarly as Stute and Zhu (2002), we still use residual-marked empirical process and the martingale transformation to construct a test statistic when projected predictors vector is used. However, when the projected predictors vector under the null hypothesis is used to construct a test statistic as Stute and Zhu (2002) did, it cannot be an omnibus test. Stute et al (1998a) constructed a residual-marked empirical process by using the original predictors vector. When $p$ is divergent, the test severely suffers from the curse of dimensionality in theory. To alleviate these difficulties, we will adopt a model adaptation strategy as Tan et al (2017) did. It can adaptively uses projected predictors under the null and alternative hypothesis. Under the null, only one projected predictor is used like that in Stute and Zhu’s construction, while under the alternatives, it can automatically uses all projections on $q$ -dimensional unit sphere to guarantee the omnibus property. Although this idea seems workable, the theoretical investigation, due to the dimensionality divergence, becomes very complicated. There are no no relevant results in the literature about the convergence of residual-marked empirical process with diverging $p$ . Even when we can obtain its limiting Gaussian process, the shift term created by estimating the parameter of interest has no a simple formula so that we can easily motivate the martingale transformation construction proposed by Stute, Thies, and Zhu (1998) to make the test asymptotically distribution-free. This is a typical problem when $p$ is divergent, which does not happen when $p$ is fixed.

Therefore, the paper is then organized as follows. Section 2 contains the asymptotic properties of the ordinary least squares estimator in the diverging dimension setting. Based on this, we define an adaptive-to-model residual-marked empirical process as the basis of the proposed test statistic. Since sufficient dimension reduction theory plays a crucial role to achieve the adaptive-to-model property, we give a brief review in this section and give the study on the convergence rate of the relevant estimators. In Section 3, we present the limit of the adaptive-to-model empirical process under the null hypothesis and give the investigation for its asymptotics. Then we use a modified approach to define a martingale transformation because the shift term has no close form in the diverging dimension settings. The asymptotic properties of the martingale transformation-based innovation process under both the null and alternatives are studied. We also show that when $p$ is fixed, this transformation is equivalent to the Stute and Zhu’s (2002) martingale transformation. In Section 4, we give the test statistic for practical use and then several simulation studies are conducted. A real data example is analysed in Section 5 for illustration. Section 6 contains a discussion. Technical proofs are deferred to Appendix.

2 Adaptive-to-model residual-marked empirical process

2.1 Preliminary

Let $\{(X_{1},Y_{1}),\cdots,(X_{n},Y_{n})\}$ be an i.i.d. sample with the same distribution as $(X,Y)$ and let $\varepsilon=Y-E(Y|X)$ be the unpredictable part of $Y$ given $X$ . Recall that $\mathcal{G}=\{g(\beta^{\top}\cdot,\theta):\beta\in\mathbb{R}^{p},\theta\in\mathbb{R}^{d}\}$ . We want to test whether or not

[TABLE]

For estimating the unknown $(\beta_{0},\theta_{0})$ , we in this paper restrict ourselves to the ordinary least squares method. Let

[TABLE]

To analyze the asymptotic property of $(\hat{\beta}_{n},\hat{\theta}_{n})$ , define

[TABLE]

It is easy to see that if $g(\cdot,\cdot)\in\mathcal{G}$ , we have $(\tilde{\beta}_{0},\tilde{\theta}_{0})=(\beta_{0},\theta_{0})$ . If $g\notin\mathcal{G}$ , $(\tilde{\beta}_{0},\tilde{\theta}_{0})$ typically depends on the distribution of $X$ . Let $e=Y-g(\tilde{\beta}_{0}^{\top}X,\tilde{\theta}_{0})$ . Then under the null hypothesis we have $e=\varepsilon$ .

To study the asymptotic properties of $(\hat{\beta}_{n},\hat{\theta}_{n})$ as $p$ is divergent, we first give some notations and the regularity conditions postpone to Appendix. Suppose that $g(\beta^{\top}x,\theta)$ is third differentiable with respective to $(\beta,\theta)$ . Let

[TABLE]

The matrix $g^{\prime\prime}(\beta,\theta,x)$ is used in the following matrix $\Sigma_{n}$ which will play a crucial role in deriving the asymptotic properties of $(\hat{\beta}_{n},\hat{\theta}_{n})$ :

[TABLE]

The next two results give the norm consistency of $(\hat{\beta}_{n},\hat{\theta}_{n})$ with respective to $(\tilde{\beta}_{0},\tilde{\theta}_{0})$ and the decomposition of $\left(\begin{array}[]{c}\hat{\beta}_{n}-\tilde{\beta}_{0}\\ \hat{\theta}_{n}-\tilde{\theta}_{0}\\ \end{array}\right)$ into independent and identically distributed summands. This decomposition generalizes the results of White (1981) to the case where the dimension $p$ of the predictor vector diverges. For simplicity, we define hereafter $\hat{\gamma}_{n}=(\hat{\beta}_{n}^{\top},\hat{\theta}_{n}^{\top})^{\top}$ , $\tilde{\gamma}_{0}=(\tilde{\beta}_{0}^{\top},\tilde{\theta}_{0}^{\top})^{\top}$ and $\gamma_{0}=(\beta_{0}^{\top},\theta_{0}^{\top})^{\top}$ .

Proposition 1.

Suppose that conditions (A1)-(A6) in Appendix hold. If $p^{4}/n\to 0$ , then $\hat{\gamma}_{n}$ is a norm consistent estimator of $\tilde{\gamma}_{0}$ in the sense that $\|\hat{\gamma}_{n}-\tilde{\gamma}_{0}\|=O_{p}(\sqrt{p/n})$ , where $\|\cdot\|$ denotes the Frobenius norm.

The convergence rate of order $\sqrt{p/n}$ is in line of the results of the M-estimator that was obtained by Huber (1973) and Portnoy (1984) when the number of parameters $p$ diverges. For the asymptotic decomposition, we have the following result.

Proposition 2.

If $p^{5}/n\rightarrow 0$ and conditions (A1)-(A6) in Appendix hold, we then have

[TABLE]

Remark 1.

The rate $p^{4}/n\to 0$ or $p^{5}/n\to 0$ as $n\to\infty$ seems slow. According to the arguments for proving Propositions 1 and 2 in Appendix, we can see that if $g(\beta^{\top}X,\theta)=\beta^{\top}X$ follows a linear model, then $g^{\prime\prime}(\beta,\theta,x)=0$ and $g^{\prime\prime\prime}(\beta,\theta,x)=0$ . Thus we can obtain the norm consistency of $\hat{\gamma}_{n}$ to $\tilde{\gamma}_{0}$ and the asymptotic decomposition of $\hat{\gamma}_{n}-\tilde{\gamma}_{0}$ under the conditions $p^{2}/n\to 0$ and $p^{3}/n\to 0$ , respectively. This condition is the same as that of Huber (1973) who only considered the linear model therein. Portnoy (1984, 1985) obtained the norm consistency and the asymptotic normality under weaker conditions again for linear models. However, his conditions are hard to check in practice what kinds of models, other than linear models, can satisfy. Further, extending their results to handle the parametric single-index models as we consider here is, to the best of our knowledge, still an open question.

2.2 Basic test statistic construction

Recall the null hypothesis:

[TABLE]

against the alternative hypothesis:

[TABLE]

where $G(\cdot)$ is an unknown smooth function and the $p\times q$ orthonormal matrix $B$ is given in (1.2). We assume that $\tilde{\beta}_{0}\in\mathcal{S}_{E(Y|X)}$ under both the null and alternative hypothesis where $\mathcal{S}_{E(Y|X)}$ is the central mean subspace such that $\mathcal{S}_{E(Y|X)}={\rm span}(B)$ . Under the null hypothesis, this is obvious. Under the alternative hypothesis, $\tilde{\beta}_{0}$ would not necessarily parallel to $\beta_{0}$ , but reasonably be a linear combination of all columns of the matrix $B$ . Thus the assumption is not restrictive.

Also recall $\varepsilon=Y-E(Y|X)$ and $e=Y-g(\tilde{\beta}_{0}^{\top}X,\tilde{\theta}_{0})$ . Under the null hypothesis, $e=\varepsilon,q=1$ and $B=\kappa\beta_{0}$ with $\kappa=\pm\frac{1}{\|\beta_{0}\|}$ . Therefore, we obtain that $E(e|B^{\top}X)=E(e|\beta_{0}^{\top}X)=0$ . Under the alternative hypothesis, we have $E(e|B^{\top}X)=G(B^{\top}X)-g(\tilde{\beta}_{0}^{\top}X,\tilde{\theta}_{0})\neq 0$ . Then it follows that under the null hypothesis

[TABLE]

While under the alternative, by Lemma 1 of Escanciaco (2006), there exists an $\alpha\in\mathcal{S}_{q}^{+}$ such that $E(e|\alpha^{\top}B^{\top}X)\neq 0$ , where $\mathcal{S}_{q}^{+}=\{\alpha=(a_{1},\cdots,a_{q})^{\top}\in\mathbb{R}^{q}:\|\alpha\|=1\ {\rm and}\ a_{1}\geq 0\}$ . Then it follows that

[TABLE]

Note that under the null we have $q=1$ and $\mathcal{S}_{q}^{+}=\{1\}$ . Thus the quantity $E[eI(\alpha^{\top}B^{\top}X\leq u)]$ actually has the same form in both (2.2) and (2.3). Define an adaptive-to-model residual marked empirical process $V_{n}(u)$ in the diverging dimension setting as below

[TABLE]

where $\hat{\beta}_{n}$ and $\hat{\theta}_{n}$ are defined as before and $\hat{B}_{n}$ is the sufficient dimension reduction estimator of $B$ with an estimated structural dimension $\hat{q}$ of $q$ , which will be specified later. For $V_{n}(u)$ , one can also use the integral over $\mathcal{S}_{\hat{q}}^{+}$ to define a test statistic.

To achieve the model adaptation property of the process, we need sufficient dimension reduction (SDR) techniques to identify the structural dimension $q$ and the matrix $B$ , when $p$ diverges to infinity. We give a brief review below on this topic.

2.3 Adaptive-to-model approach

In this methodology, we need to identify the dimension $q$ and the matrix $B$ . This can be done by using the methods in sufficient dimension reduction. We then give a brief description. Recall under the alternative hypothesis the model is as

[TABLE]

where $E(\varepsilon|X)=0$ and $G(\cdot)$ is an unknown smooth function and $B$ is a $p\times q$ orthonormal matrix with $1\leq q\leq p$ . We can see that under both the null and alternative hypothesis, the conditional independence holds respectively:

[TABLE]

where $\bot\!\!\!\bot$ means statistical independence. Define $\mathcal{S}_{E(Y|X)}$ as the central mean subspace of $Y$ with respect to $X$ (see, Cook and Li 2002) that is the intersection of all subspaces spanned by the columns of $A$ $\rm{span}(A)$ such that $Y\bot\!\!\!\bot E(Y|X)|A^{\top}X$ . The dimension of $\mathcal{S}_{E(Y|X)}$ is called the structural dimension, denoted as $d_{E(Y|X)}$ . Under mild conditions, such a subspace $\mathcal{S}_{E(Y|X)}$ always exists (see Cook and Li, 2002). If $\mathcal{S}_{E(Y|X)}=\rm{span}(A)$ , then $E(Y|X)=E(Y|A^{\top}X)$ . Under the null hypothesis (1.1), $d_{E(Y|X)}=1$ and $\mathcal{S}_{E(Y|X)}=\rm{span}(\beta_{0}/\|\beta_{0}\|)$ . Under the alternative (1.2), $d_{E(Y|X)}=q$ and $\mathcal{S}_{E(Y|X)}=\rm{span}(B)$ . For simplicity, we assume throughout this paper that $\mathcal{S}_{E(Y|X)}=\mathcal{S}_{Y|X}$ . Here $\mathcal{S}_{Y|X}$ is the central subspace of $Y$ with respect to $X$ (see, Cook 1998).

There are several estimation proposals available in the literature. For instance, sliced inverse regression (SIR, Li (1991)), sliced average variance estimation (SAVE, Cook and Weisberg (1991)), minimum average variance estimation (MAVE, Xia et.al. (2002)), directional regression (DR, Li and Wang, (2007)), discretization-expectation estimation (DEE, Zhu, et al. (2010a)). All these methods assumed that $p$ is fixed. Zhu, Miao, and Peng (2006) first discussed the asymptotic properties of SIR when $p$ diverges to infinity. In this paper, we adapt cumulative slicing estimation (CSE, Zhu, Zhu, and Feng (2010b)) to identify the central subspace, which is similar to discretization-expectation estimation (DEE, Zhu, et al. (2010a)). This is because both of them are very easily implemented and easy to be extended to handle the case where the dimension $p$ grows to infinity.

The procedure of CSE is as follows. For simplicity, we assume $E(X)=0,Var(X)=I_{p}$ for a moment. If the linearity condition (see Li, 1991) holds, it is easy to see that $E[Xh(Y)]\in\mathcal{S}_{Y|X}$ for any function $h(\cdot)$ . Theoretically, we obtain infinity amount of vectors in $\mathcal{S}_{Y|X}$ . Zhu et.al. (2010b) suggested a determining class of indicator functions to replace $h(\cdot)$ . Let $h_{t}(Y)=I(Y\leq t)$ . It follows that

[TABLE]

Define the target matrix

[TABLE]

where $F_{Y}$ denotes the cumulative distribution function of $Y$ . If the rank of $M$ is $q$ , then ${\rm span}(M)=\mathcal{S}_{Y|X}$ . Based on this, it is easy to obtain the sample version of $M$ . Let $Z_{i}$ be the standardized $X_{i}$ and $\hat{\alpha}_{t}=\frac{1}{n}\sum_{i=1}^{n}Z_{i}I(Y_{i}\leq t)$ . The estimator of $M$ is given by

[TABLE]

If the structural dimension $q$ is given, an estimator $\hat{B}_{n}(q)$ of $B$ consists of the eigenvectors corresponding to the largest $q$ eigenvalues of $\hat{M}_{n}$ . Throughout this paper, we assume that $q$ is fixed.

Yet we need a consistent estimator $\hat{q}$ of $q$ as $q$ is usually unknown under the alternative hypothesis. Later we will see that even when $q$ is given, we still want a consistent estimator because we wish the test to have model adaptation property to fully use the dimension reduction structure under the null hypothesis. Inspired by Xia et al. (2015), we suggest a minimum ridge-type eigenvalue ratio estimator (MRER) to determine $q$ . Let $\hat{\lambda}_{1}\geq\cdots\geq\hat{\lambda}_{p}$ and ${\lambda}_{1}\geq\cdots\geq{\lambda}_{p}$ be the eigenvalues of the matrix $\hat{M}_{n}$ and $M$ respectively. Since $rank(M)=q$ , it follows that

[TABLE]

Hence we estimate the structural dimension $q$ by

[TABLE]

Here $\hat{\lambda}_{p+1}$ is defined as [math] and the ridge $c$ is a positive constant. The following result shows that the consistency of MRER is adaptive to the underlying models, when $c$ equals to some appropriate constant. Its proof will be given in Appendix.

Proposition 3.

*Suppose that the regularity conditions of Theorem 3 in Zhu et al. (2010b) hold. Let $\hat{B}_{n}(q)$ be a matrix whose columns are the eigenvectors that are associated with the largest $q$ eigenvalues of $\hat{M}_{n}$ . If $c=\log{n}/n$ , then

(1) under $H_{0}$ , we have $\mathbb{P}(\hat{q}=1)\to 1$ and $\|\hat{B}_{n}(1)-\kappa\beta_{0}\|=O_{p}(\sqrt{p/n})$ ;

(2) under $H_{1}$ , we have $\mathbb{P}(\hat{q}=q)\to 1$ and $\|\hat{B}_{n}(q)-B\|=O_{p}(\sqrt{p/n})$ .*

3 Main results

3.1 Basic properties of the process

First, we discuss the asymptotic properties of the process $V_{n}(\hat{\alpha},u)$ under the null hypothesis. Since the distributional limit theory becomes much simpler if we replace the estimators by their true values, we define the following process

[TABLE]

Put

[TABLE]

Then we have $\sigma_{n}^{2}(v)=E(\varepsilon^{2}|\kappa\beta_{0}^{\top}X=v)$ and $\psi_{n}(u)=\int_{-\infty}^{u}\sigma_{n}^{2}(v)F_{\kappa\beta_{0}}(dv)$ where $F_{\kappa\beta_{0}}$ is the cumulate distribution function of $\kappa\beta_{0}^{\top}X$ . Obviously, $\psi_{n}(u)$ is a nondecreasing and nonnegative function. Since $V_{n}^{0}(u)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\varepsilon_{i}I(\kappa\beta_{0}^{\top}X_{i}\leq u)$ is a centered residual cusum process, it is readily seen that

[TABLE]

By Theorem 2.11.22 in Van Der Vaart and Wellner (1996), we obtain that $V_{n}^{0}(u)$ is asymptotically tight. If $\psi_{n}(u)\to\psi(u)$ pointwisely in $u$ , it follows that

[TABLE]

in the space $\ell^{\infty}(\overline{R})$ , where $V_{\infty}(u)$ is a centred Gaussian process with the covariance function $\psi(s\wedge t)$ . Since $\psi(u)$ is also nondecreasing and nonnegative, it follows that $V_{\infty}(u)=B(\psi(u))$ in distribution, where $B(u)$ is a standard Brownian motion.

For composite model checks, the unknown parameters in $V_{n}^{0}(u)$ should be replaced by their estimators, so we go back to $V_{n}(\hat{\alpha},u)$ as defined in (2.4). By Proposition 3, $\mathbb{P}(\hat{q}=1)\to 1$ under the null hypothesis. Thus we only need to work on the event $\{\hat{q}=1\}$ . Consequently, $\mathcal{S}_{\hat{q}}^{+}=\{1\}$ and $V_{n}(\hat{\alpha},u)$ can be rewritten as

[TABLE]

Under some regularity conditions stated in Appendix and on the event $\{\hat{q}=1\}$ , we can show that under the null hypothesis

[TABLE]

uniformly in $u$ , where $M_{n}(u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\kappa\beta_{0}^{\top}X\leq u)]$ . A proof of (3.2) will be given in Appendix. Combined (3.2) with Proposition 2 and some elementary calculations, we have

[TABLE]

uniformly in $u$ . It is easy to see that the second term of the right hand side of (3.3) is also asymptotically tight. Altogether we then obtain the following result.

Theorem 3.1.

Suppose that the regularity conditions in Appendix hold. when $p^{5}/n\to 0$ , then under the null hypothesis, we have in distribution

[TABLE]

where $V_{\infty}^{1}(u)$ is a zero mean Gaussian process with a covariance function $K(s,t)$ that is the pointwise limit of $K_{n}(s,t)$ as

[TABLE]

3.2 Martingale transformation

If $p$ is fixed, $V_{\infty}^{1}(u)$ can be rewritten as $V_{\infty}^{1}(u)=V_{\infty}(u)+M(u)^{\top}V$ in distribution and its covariance function can be specified. The shift term $M(u)^{\top}V$ is brought out from the second term in (3.3). Stute, Thies, and Zhu (1998) first proposed a martingale transformation to eliminate $M(u)^{\top}V$ in $V_{\infty}^{1}(u)$ and then obtain a tractable limiting distribution of a functional of $V_{\infty}(u)$ . This has become one of the basic methodologies in the area of model checking to derive asymptotically distribution-free tests. It was motivated by the Khmaladze martingale transformation in constructing convenient goodness of fit tests for hypothetical distribution functions (Khmaladze, 1982). There are a number of follow-up studies in the literature to extend this methodology to various high-dimensional models such as Khmadladze and Koul (2004) and Stute, Xu and Zhu (2008). However, when $p$ diverges as $n$ goes to infinity, the form of the shift term that would be a limit of $M(u)^{\top}V$ can not be given specifically, as stated in the above theorem. The martingale transformation cannot directly target $M(u)^{\top}V$ . We then bypass this difficulty by checking its shift term at the sample level. Note that the shift term comes from the second term in (3.2). This is because in the case with the fixed $p$ , $M(u)^{\top}V$ is just its weak limit. Thus, we then target that term directly at the sample level.

Following Stute, Thies, and Zhu (1998) or Stute and Zhu (2002), recall that $M_{n}(u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\kappa\beta_{0}^{\top}X\leq u)]$ and $\psi_{n}(u)=\int_{-\infty}^{u}\sigma_{n}^{2}(v)F_{\kappa\beta_{0}}(dv)$ . Let

[TABLE]

be the Radon-Nikodym derivative of $M_{n}(u)$ with respect to $\psi_{n}(u)$ . Next, define a $(p+d)\times(p+d)$ matrix

[TABLE]

It can also be written as

[TABLE]

Mimicking the martingale transformation in Stute and Zhu (2002) at the sample level, we have

[TABLE]

Here we should assume that $A_{n}(u)$ is nonsingular and the process $f_{n}(u)$ should be either bounded variation or a Brownian motion.

Some elementary computation concludes that $T_{n}(\sqrt{n}(\hat{\gamma}_{n}-\gamma_{0})^{\top}M_{n})=0$ . Next, we discuss the approximation properties of $T_{n}V_{n}^{0}$ . Note that

[TABLE]

and

[TABLE]

Combining these two formulas, we obtain that

[TABLE]

Therefore, $T_{n}V_{n}^{0}$ is also an i.i.d. centered residual cusum process with a covariance function

[TABLE]

This means that $T_{n}V_{n}^{0}(u)$ admits the same limiting distribution as that of $V_{n}^{0}(u)$ , i.e.,

[TABLE]

Consequently, we get rid of the annoying shift term $\sqrt{n}(\hat{\gamma}_{n}-\gamma_{0})^{\top}M_{n}$ and obtain the process $V_{\infty}(u)$ whose supremum over all $u$ has a tractable limiting distribution. The assertions (3.5) and (3.6) will be justified in Appendix (Lemma 1).

The transformation $T_{n}$ obviously contains some unknown quantities and therefore needs to be substituted by their empirical analogues. For this, let $g_{1}^{\prime}(t,\theta)=\frac{\partial g(t,\theta)}{\partial t}$ and $g_{2}^{\prime}(t,\theta)=\frac{\partial g(t,\theta)}{\partial\theta}$ . It follows that

[TABLE]

Consequently, we have

[TABLE]

where $r_{n}(v)=E(X|\kappa\beta_{0}^{\top}X=v)$ . Conclude that

[TABLE]

Since $a_{n}(u)$ depends on $r_{n}(u)$ and $\sigma_{n}^{2}(u)$ on which we do not make any assumption rather than smoothness, they need to be estimated in a nonparametric way. For instance, we may adopt a standard Nadaraya-Watson estimator for $r_{n}(v)$ :

[TABLE]

where $K(\cdot)$ is an univariate kernel function and $h$ is a bandwidth. Similarly for $\sigma_{n}^{2}(u)$ . Thus we obtain the empirical estimators $\hat{a}_{n}(u)$ and $\hat{A}_{n}(u)$ of $a_{n}(u)$ and $A_{n}(u)$ respectively:

[TABLE]

Finally, we can give an estimator $\hat{T}_{n}$ of $T_{n}$ :

[TABLE]

where $\hat{\kappa}_{n}$ is the estimator of $\kappa$ and $F_{\hat{\alpha}}$ is the empirical distribution function of $\hat{\alpha}^{\top}\hat{B}^{\top}_{n}X_{i},1\leq i\leq n$ . Making sure the columns of $\hat{B}_{n}$ have the same direction as $\hat{\beta}_{n}$ , we can assume $\kappa=1/\|\beta_{0}\|$ and $\hat{\kappa}_{n}=1/\|\hat{\beta}_{n}\|$ .

Theorem 3.2.

Suppose that $A_{n}(u)$ is nonsingular and $\sigma_{n}^{2}(u)$ is bounded away from zero for all $u$ . If $p^{5}/n\to 0$ , under the null hypothesis $H_{0}$ and the regularity conditions in Appendix, we have

[TABLE]

in distribution in the space $\ell^{\infty}([-\infty,x_{0}])$ for any $x_{0}\in\mathbb{R}$ .

Note that we use $\hat{A}_{n}(u)$ in the process $\hat{T}_{n}V_{n}(\hat{\alpha},u)$ . In concrete data analysis, these matrices may be unbounded for large $u$ and thus the distributional behavior of the underlying process may become very unstable in the extreme right tails. These may severely damage the approximation accuracy of the test statistic based on all $\hat{T}_{n}V_{n}$ . Therefore, we restrict $\hat{T}_{n}V_{n}$ to compact intervals $[-\infty,u_{0}]$ and obtain the convergence of $\sup_{\hat{\alpha}\in\mathcal{S}_{\hat{q}}^{+}}|\hat{T}_{n}V_{n}(\hat{\alpha},u)|$ in the space $\ell^{\infty}([-\infty,x_{0}])$ .

In a special case where the predictor $X$ follows a spherically contoured distribution or its extension, the elliptically contoured distribution, we can show that the calculations of the martingale transformation will become much simpler. The idea is similar to Stute and Zhu (2002). Without loss of generality, we only consider spherically contoured distributions. Here we shall assume the regression function $g$ does not depend on $\theta$ . Let $g^{\prime}(t)$ be the derivative of $g(t)$ with respective to $t$ . It follows that

[TABLE]

where $\Gamma$ is an $p\times p$ orthonormal matrix with the first row $\kappa\beta_{0}^{\top}$ (or $\beta_{0}^{\top}/\|\beta_{0}\|$ ). Since the conditional expectation of the other components of $\Gamma X$ given the first is zero, it follows that

[TABLE]

whence,

[TABLE]

Note that $A_{n}(z)$ is a matrix with rank $1$ and is singular when $p>1$ . Thus the martingale transformation can not apply directly. However, if we go back to (3.2) and set

[TABLE]

then (3.2) can be rewritten as

[TABLE]

Conclude that the new $a_{n}(u)$ and $A_{n}(u)$ become the real-valued

[TABLE]

Clearly, Theorem 3.2 can be applied to these new functions.

Hall and Li (1993) shown that, if $p\to\infty$ as $n\to\infty$ , expectation over a large number of random variables behaves more or less like expectation over the multivariate normal distribution. Note that $M_{n}(u)=E[g^{\prime}(\beta_{0}^{\top}X)XI(\kappa\beta_{0}^{\top}X\leq u)]$ and multivariate normal distribution is elliptically-contoured. Consequently, even when $X$ is not multivariate normal distributed, $M_{n}(u)$ can be viewed as expectation on multivariate normal distribution and then the martingale transformation $T_{n}$ can apply to the real-valued $a_{n}(u)$ and $A_{n}(u)$ in practice for large $p$ .

3.3 The properties under the alternative hypothesis

Now we discuss the asymptotic properties of $\sup_{\hat{\alpha}\in\mathcal{S}_{\hat{q}}^{+}}|\hat{T}_{n}V_{n}(\hat{\alpha},u)|$ under a sequence of local alternatives converging to the null hypothesis at a parametric rate $1/\sqrt{n}$ . Consider

[TABLE]

where $E(\varepsilon|X)=0$ , $G(X)$ is a random variable with zero mean and satifies $\mathbb{P}\{G(X)=0\}<1$ . To derive the asymptotic distribution of $\hat{T}_{n}V_{n}(\hat{\alpha},u)$ under $H_{1n}$ , we need the asymptotic properties of $\hat{q}$ and $\hat{\gamma}_{n}$ , when $p$ diverges to infinity.

Proposition 4.

Assume the regularity conditions of Theorem 3 in Zhu et al. (2010b) hold. Let $\hat{B}_{n}(1)$ be an eigenvector associating with the largest eigenvalues of $\hat{M}_{n}$ , then we have, under $H_{1n}$ , $\mathbb{P}(\hat{q}=1)\to 1$ and $\|\hat{B}_{n}(1)-\kappa\beta_{0}\|=O_{p}(\sqrt{p/n})$ .

Next, we derive the norm consistency of $\hat{\gamma}_{n}$ with respective to $\gamma_{0}$ and a asymptotical decomposition of $\hat{\gamma}_{n}-\gamma_{0}$ under $H_{1n}$ . Here $\hat{\gamma}_{n}=(\hat{\beta}_{n}^{\top},\hat{\theta}_{n}^{\top})^{\top}$ and $\gamma_{0}=(\beta_{0}^{\top},\theta_{0}^{\top})^{\top}$ as mentioned before.

Proposition 5.

Suppose the regularity conditions in Appendix and (3.8) hold. If $p^{4}/n\to 0$ , then $\hat{\gamma}_{n}$ is a norm consistent estimator for $\gamma_{0}$ with $\|\hat{\gamma}_{n}-\gamma_{0}\|=O_{p}(\sqrt{p/n})$ . Moreover, if $p^{5}/n\to 0$ , we have

[TABLE]

The following theorem states the asymptotic results under various alternatives.

Theorem 3.3.

*Suppose the regularity conditions in Appendix hold. If $p^{5}/n\to 0$ ,

(1) under the global alternative $H_{1}$ , we have in probability*

[TABLE]

*where $L(u)$ is some nonzero function;

(2) under the local alternative $H_{1n}$ , we have in distribution*

[TABLE]

where $V_{\infty}(u)$ is a zero-mean Gaussian process given by (3.1) and $G_{1}(u)$ and $G_{2}(u)$ are the uniform limit of $G_{1n}(u)$ , $G_{2n}(u)$ , respectively which are as follows

[TABLE]

These results show that under the global alternative, the process diverges to infinity at the rate of order $1/\sqrt{n}$ and under the local alternatives distinct from the null at the rate of order $1/\sqrt{n}$ , the process converges to a stochastic process. Thus, the test that is based on this process can detect such alternatives.

4 Numerical studies

4.1 Test statistics in practical use

In this subsection, we use the Cram $\rm\acute{e}$ r-von Mises (CM) functional to construct test statistic. Consider

[TABLE]

where $F_{n}$ is the empirical distribution function of $\beta_{0}^{\top}X_{i}/\|\beta_{0}\|$ , $1\leq i\leq n$ . According to Theroem 3.2 and the Extended Continuous Mapping Theorem (see Theorem 1.11.1 in Van Der Vaart and Wellner (1996)), we obtain, under the null,

[TABLE]

where $B(t)$ is a standard Brownian motion and $\sigma^{2}(u)$ is the pointwise limit of $\sigma_{n}^{2}(u)$ . Since $B(t\psi(u_{0}))/\sqrt{\psi(u_{0})}=B(t)$ in distribution, it follows that

[TABLE]

Consequently, we consider

[TABLE]

Here we use $\hat{\psi}_{n}(u_{0})=\frac{1}{n}\sum_{i=1}^{n}(Y_{i}-g(\hat{\beta}_{n}^{\top}X_{i},\hat{\theta}_{n}))^{2}I(\hat{\alpha}^{\top}\hat{B}_{n}^{\top}X_{i}\leq u_{0})$ as an estimator of $\psi(u)$ . Therefore, we obtain

[TABLE]

In the homoscedastic models, $\sigma_{n}^{2}(u)$ is free of $u$ and thus we can estimate it by

[TABLE]

Now we also have $\psi_{n}(u_{0})=\sigma_{n}^{2}F_{\kappa\beta_{0}}(u_{0})$ and thus it can be estimated by $\hat{\sigma}_{n}^{2}F_{n}(u_{0})$ . Consiquently, $ACM_{n}^{2}$ becomes

[TABLE]

For $u_{0}$ , as suggested by Stute and Zhu (2002), we take $99\%$ quantile of $F_{n}$ in the simulation studies.

4.2 Numerical studies

In this subsection we conduct some simulation studies to examine the performance of the proposed test in this paper. From the results, we set $p=[4n^{1/4}]-5$ with $n=100,200,400\ {\rm and}\ 800$ , as used in Fan and Peng (2004). As there are no relevant tests dealing with the case with divergent dimension, we give comparisons with some existing tests that were developed with fixed dimension as for practical use, they would be workable.

Stute and Zhu’s (2002) test is given by

[TABLE]

where

[TABLE]

For $\hat{\psi}_{n}(x_{0}),\hat{\sigma}_{n}^{2},\hat{a}_{n}(z),\hat{A}_{n}^{-1}(z)$ , one can refer to their paper for detail.

Bierens (1982) proposed an integrated conditional moment (ICM) test which is based on the following statistic:

[TABLE]

where $\hat{e}_{i}=Y_{i}-g(\hat{\beta}_{n}^{\top}X_{i},\hat{\theta}_{n})$ .

Escanciano’s (2006) test statistic is defined as

[TABLE]

with the critical value determination by the wild bootstrap. More details can be found in Escanciano (2006).

Zheng (1996) proposed a locally smoothing test whose statistic is given by

[TABLE]

An adaptive-to-model test defined in Guo et. al. (2016) with the test statistic:

[TABLE]

Here we use the kernel function $K(u)=(15/16)(1-u^{2})^{2}I(|u|\leq 1)$ and the bandwidth $h=1.5n^{1/(4+\hat{q})}$ as in Guo et. al. (2016) and $\hat{B}_{n}$ is a sufficient dimension estimate of $B$ with an estimated structural dimension $\hat{q}$ of $q$ .

The significance levels are set to be $\alpha=0.1$ , $0.05$ , and $0.01$ . The simulation results are based on the averages of $2000$ replications. In the following simulation studies, $a=0$ corresponds to the null while $a\neq 0$ to the alternatives.

$Study$ 1. The data are generated from the following models:

[TABLE]

where $\beta_{0}=(1,\cdots,1)^{\top}/\sqrt{p}$ , $\beta_{1}=(\underbrace{1,\dots,1}_{p_{1}},0,\dots,0)^{\top}/\sqrt{p_{1}}$ and $\beta_{2}=(0,\dots,0,\underbrace{1,\dots,1}_{p_{1}})/\sqrt{p_{1}}$ with $p_{1}=[p/2]$ . The predictors $\{X_{i},1\leq i\leq n\}$ are i.i.d. from $N(0,I_{p})$ and $\varepsilon$ is Guassian white noise with variance $1$ . $H_{12}$ is a high-frequency/oscilating model and the other three are low-frequency models. In $H_{11}$ and $H_{12}$ , the structural dimension equals $1$ under both the null and the alternative, while, in $H_{13}$ and $H_{14}$ , the structural dimension is $2$ under the alternatives.

The simulation results are reported in Tables 1 to 4. We can see that both $ACM_{n}^{2}$ and $T_{n}^{SZ}$ maintain the significance levels very well. The empirical sizes of $PCvM_{n}$ are also very close to the significance levels, but slightly more unstable in some cases. $T^{GWZ}_{n}$ can only maintain the significance level when it is $\alpha=0.05$ . $T_{n}^{ZH}$ can maintain the significance levels occasionally, but generally, it is conservative with smaller sizes. $ICM_{n}$ is the worst among these tests in both the significance level maintenance and power performance. According to our experience, when $p$ is smaller than $5$ , $ICM_{n}$ could work well. The powers of $ACM_{n}^{2}$ , $T_{n}^{SZ}$ , $PCvM_{n}$ and $T^{GWZ}_{n}$ are all very high for models $H_{11}$ , $H_{13}$ and $H_{14}$ . But $T^{GWZ}_{n}$ ’s power grows slightly slower than the other three, while, for model $H_{12}$ , $T^{GWZ}_{n}$ beats the other competitors. These may validate again the empirical experience in this area that locally smoothing tests perform better for high frequency/oscillating models, while globally smoothing tests work better for low frequency models. Nevertheless, $T_{n}^{ZH}$ , a representative of locally smoothing tests, has very low power for model $H_{12}$ . This is because $T_{n}^{ZH}$ severely suffers from the dimensionality problem, while $T^{GWZ}_{n}$ uses a dimension reduction technique to greatly alleviate the curse of dimensionality.

[TABLE]

The null models are all linear in $Study$ 1. We then consider nonlinear hypothetical models in the next simulation study.

$Study$ 2. The data are generated from the following models

[TABLE]

where $\beta_{1}=(\underbrace{1,\dots,1}_{p_{1}},0,\dots,0)^{\top}/\sqrt{p_{1}}$ and $\beta_{2}=(0,\dots,0,\underbrace{1,\dots,1}_{p_{1}})^{\top}/\sqrt{p_{1}}$ with $p_{1}=[p/2]$ , $\varepsilon$ is $N(0,1)$ , and $X$ is $N(0,I_{p})$ independent of $\varepsilon$ .

We report the empirical sizes and powers in Tables 5 and 6. For model $H_{21}$ , The conclusions are very similar to those in $Study$ 1. For model $H_{22}$ , we can see that the empirical sizes of $ACM_{n}^{2}$ , $T_{n}^{SZ}$ and $PCvM_{n}$ are very close to the significance levels, while $T_{n}^{ZH}$ and $T^{GWZ}_{n}$ can only control the level of $\alpha=0.1$ . $ICM_{n}$ is still the worst one. The empirical powers of $T^{GWZ}_{n}$ and $ACM_{n}^{2}$ are higher than the other competitors, while $T_{n}^{SZ}$ ’s empirical powers grow very slow in this case. This would confirm the theoretical result that $T_{n}^{SZ}$ is not an omnibus test.

[TABLE]

Therefore, overall, the proposed test in this paper performs well and can detect different alternatives. Further, the dimension of predictors has less negative impact on its performance.

4.3 A real data example

In this subsection we analyze the baseball salary data set that can be obtain through the website http://www4.stat.ncsu.edu/~boos/var.select/baseball.html. This data set contains 337 Major League Baseball players on the salary $Y$ from the year 1992 and 16 performance measures from the year 1991. The performance measures are $X_{1}$ : Batting average, $X_{2}$ : On-base percentage, $X_{3}$ : runs, $X_{4}$ : hits, $X_{5}$ : doubles, $X_{6}$ : triples, $X_{7}$ : home runs, $X_{8}$ : runs batted in, $X_{9}$ : walks, $X_{10}$ : strike-outs, $X_{11}$ : stolen bases, and $X_{12}$ : errors; and $X_{13}$ : Indicator of free agency eligibility, $X_{14}$ : Indicators of free agent in 1991/2, $X_{15}$ : Indicators of arbitration eligibility, and $X_{16}$ : Indicators of arbitration in 1991/2. The dummy variables $X_{13}-X_{16}$ measure the freedom of movement of a player to another team. For easy interpretation, we standardize all variables separately. To obtain the regression relationship between $Y$ and the performance measures $X=(X_{1},\cdots,X_{16})^{\top}$ , we first test for a linear regression model by the proposed test because the dimension $16\approx(337)^{0.476}$ and in the simplest case with linear model, the proposed test can theoretically handle $p=O(n^{1/2})$ . The value of the test statistic is $ACM_{n}^{2}=1.3651$ with the $p$ -value equal to $0.077$ . Since the $p$ -value is small although it is larger than, say, $0.05$ , an often used significance level, we may consider a more plausible model to better fit this dataset. Hence we apply the dimension reduction techniques. Recalling in Section 2.3, we claimed that to estimate the central subspace, the CSE method is used. The estimated structural dimension of this datset is $\hat{q}=1$ . This means that $Y$ may be conditionally independent of $X$ given the projected covariate $\hat{\beta}_{1}^{\top}X$ where

[TABLE]

is the first direction obtained by CSE. The scatter plot of $Y$ against $\hat{\beta}_{1}^{\top}X$ is presented in Figure 1(a). It indicates that a linear regression model for $(Y,X)$ is not reasonable.

[TABLE]

To further exhaust possible projected covariates, we consider the second projected covariate $\hat{\beta}_{2}^{\top}X$ obtained by CSE. The scatter plot of $Y$ against $(\hat{\beta}_{1}^{\top}X,\hat{\beta}_{2}^{\top}X)$ is presented in Figure 2.

[TABLE]

This figure shows that the second projected covariate $\hat{\beta}_{2}^{\top}X$ has no information in predicting the response $Y$ , as the plot along $\hat{\beta}_{2}^{\top}X$ is almost invariable. This means that the projection of the data onto the subspace $\hat{\beta}_{1}^{\top}X$ would already contain most of regression information of $(Y,X)$ . Figure 1(a) seems to suggest a quadratic polynomial of $\hat{\beta}_{1}^{\top}X$ to fit the data. Hence we use the following regression mode:

[TABLE]

Figure 1(b) adds the fitted curve on the scatter plot. The value of the test statistic $ACM_{n}^{2}=0.1038$ and the $p$ -value is about $0.83$ . Therefore the above regression model is plausible.

5 Discussions

In this paper, we investigate model checking for regressions when the dimension of predictors diverges to infinity as the sample size tends to infinity. Three remarkable features are worthwhile to discuss. First, although the empirical process is similar to that in Stute and Zhu (2002), it involves much more difficult estimation issues in the construction procedure of test statistics. Second, as the Khmaladze martingale transformation has become an important methodology for model checking as its asymptotically distribution-free property, we suggest another way to construct the transformation, rather than directly targeting the limit of shift terms in the fixed dimension cases. The transformed process still has the same limiting Gaussian process as that with fixed dimension. This provides us an easy way to handle the cases with divergent dimension. Third, the model adaptation property shows its advantage in maintaining the significance level and enhancing power performance. The research also leaves some unsolved topics. An important topic is about how to relax the condition on the diverging rate of the dimension. In this paper, we cannot have faster rate than $p=o(n^{1/4})$ in general although for some special regression models such as linear models, it can achieve $p=o(n^{1/2}).$ This is mainly because of technical difficulties in estimation. Thus, to attack this problem, we need to improve the asymptotic properties of involved estimators. This is beyond the scope of this paper and deserves further studies.

6 Appendix

6.1 Regularity Conditions

In this subsection we present some regularity conditions for the theoretical results. Although these conditions may not be the weakest possible, they make technical arguments easy to understand. In the following, $C$ always stands for a constant which may be different in different cases.

First, we give some regularity conditions for the norm consistency of $(\hat{\beta}_{n},\hat{\theta}_{n})$ to $(\tilde{\beta}_{0},\tilde{\theta}_{0})$ and the decomposition of $\left(\begin{array}[]{c}\hat{\beta}_{n}-\tilde{\beta}_{0}\\ \hat{\theta}_{n}-\tilde{\theta}_{0}\\ \end{array}\right)$ .

(A1) The matrix $\Sigma_{n}$ is positive definite and satisfies the following condition

[TABLE]

where $\lambda_{min}(\Sigma_{n})$ and $\lambda_{max}(\Sigma_{n})$ are the smallest and largest eigenvalues of $\Sigma$ , respectively.

The first to third derivatives of the regression function $g(\cdot)$ satisfy the conditions:

(A2) $E|Y|^{4}\leq C$ , $E|e|^{8}\leq C$ ; $E|g^{\prime}_{j}(\tilde{\beta}_{0},\tilde{\theta}_{0},X)|^{8}\leq C$ ;

(A3) $|g(\beta^{\top}x,\theta)|\leq F(x)$ with $EF(X)^{4}\leq C$ for all $(\beta,\theta)$ ;

(A4) $|g^{\prime}_{j}(\beta,\theta,x)|\leq F_{j}(x)$ with $EF_{j}(X)^{4}\leq C$ for all $j$ and $(\beta,\theta)$ ;

(A5) $|g^{\prime\prime}_{jk}(\beta,\theta,x)|\leq F_{jk}(x)$ with $EF_{jk}(X)^{4}\leq C$ for all $j,k$ , and $(\beta,\theta)$ ;

(A6) $|g^{\prime\prime\prime}_{jkl}(\beta,\theta,x)|\leq F_{jkl}(x)$ with $EF_{jkl}(X)^{4}\leq C$ for all $j,k,l$ , and $(\beta,\theta)$ ;

where $g^{\prime}_{j}(\beta,\theta,x)$ is the $j$ -th component of $g^{\prime}(\beta,\theta,x)$ , $g^{\prime\prime}_{jk}(\beta,\theta,x)$ is the $(j,k)$ -element of $g^{\prime\prime}(\beta,\theta,x)$ , and $g^{\prime\prime\prime}_{jkl}(\beta,\theta,x)$ is the $(j,k,l)$ -element of $g^{\prime\prime\prime}(\beta,\theta,x)$ .

Condition (A1) is similar to the regularity condition on the Fisher information matrix $I_{n}$ proposed by Fan and Peng (2004), where the Fisher information matrix $I_{n}$ plays the same role in deriving the asymptotic theory as the matrix $\Sigma_{n}$ does here. Conditions (A2)-(A6) are standard for nonlinear least squares estimation, see, e.g., Jennrich (1969) and White (1981).

Next, we present some regularity condition for the convergence of the adaptive-to-model residual marked empirical process.

(B1) There exists a constant $C$ such that if $\|\beta-\kappa\beta_{0}\|\leq C\sqrt{p\log{n}/n}$ , then

[TABLE]

where $\vartriangle$ denotes the symmetric difference of two sets. This condition is given by Zhu (1993) who showed the existence of distributions satisfying this condition.

(B2) If $M_{n}(u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\kappa\beta_{0}^{\top}X\leq u)]$ , $\|M_{n}(u)\|=O(1)$ uniformly in $u$ .

(B3) For any unit non-random vector $\gamma\in\mathbb{R}^{p}$ , there exist $F_{Y}$ -integrable functions $h_{i}(t)$ such that

[TABLE]

where $G(X)$ is given by (3.8) and $f_{Y|X}$ is the conditional density of $Y$ given $X$ .

6.2 Lemmas

In this subsection we present some Lemmas that will be needed in proving the propositions and theorems. Since we consider the empirical process with diverging dimension, there are no relevant results available in the literature. Thus, in the following Lemmas, we give the results about the convergence rate of the involved empirical process, which are different from the classical ones with fixed dimension in the literature.

Lemma 1.

Suppose $A_{n}(u)$ is nonsingular for all $u$ , then we have

[TABLE]

that is, (3.5) and thus (3.6) hold.

Proof. Assume $t\leq s$ . By the definition of $T_{n}V_{n}^{0}$ and the Fubini Theorem, the left-hand side of (3.5) equals

[TABLE]

It is easy to see that the sum of the last three terms is equal to zero. Thus we complete the proof. $\Box$

Next we consider the convergence rate of the involved empirical processes in the diverging dimension. Let $F(x)$ be a fixed function and $\mathcal{F}_{n}$ be a VC-class of functions with a VC-index $V(\mathcal{F}_{n})$ which may depend on $n$ . Let $N_{i}(\epsilon,\mathcal{F}_{n},L_{i}(Q))$ be the covering number of $\mathcal{F}_{n}$ with respective to the seminorm $L_{i}(Q)$ . See e.g. Pollard (1984) for details. Suppose $\sup_{\mathcal{F}_{n}}|f(x)|\leq 1$ for any $n$ and $x\in\mathbb{R}^{p}$ and

[TABLE]

Set $\tilde{\mathcal{F}}_{n}=\{F(x)f(x):f(x)\in\mathcal{F}_{n}\}$ . By some elementary calculations, we have

[TABLE]

whence

[TABLE]

Lemma 2.

Let $\tau_{n}$ and $\epsilon_{n}$ be positive sequences. If $E|F|^{4}<\infty$ and $Var(P_{n}Ff)/(4\epsilon_{n})^{2}\leq 1/2$ for $n$ large enough, then

[TABLE]

where $A_{n}$ and $W_{n}$ are constants which may depend on $n$ .

Proof. The proof is similar to Theorem 37 in Chapter 2 of Pollard (1984) and Theorem 3.1 in Zhu (1993). Since $Var(P_{n}Ff)/(4\epsilon_{n})^{2}\leq 1/2$ for $n$ large enough, by the formula (30) in Chapter 2 of Pollard (1984), we have

[TABLE]

Conditionally on $\{X_{1},\cdots,X_{n}\}$ . Using the same argument as that for proving the inequality (31) in Chapter 2 of Pollard (1984), it follows that

[TABLE]

Taking expectation, we obtain that

[TABLE]

Consequently,

[TABLE]

Therefore, we complete the proof. $\Box$

Lemma 3.

If $|g^{\prime}_{j}(\beta_{0},\theta_{0},x)|\leq F(x)$ and $E[F(X)]^{4}<\infty$ , then we have

[TABLE]

where $M_{n}(u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\kappa\beta_{0}^{\top}X\leq u)]$ .

Proof. Fix $\epsilon>0$ and let $\epsilon_{n}^{2}=\epsilon^{2}(\log{n})^{2}\sqrt{p}/n$ . We have

[TABLE]

For every term in the last sum, we use Lemma 2. Let

[TABLE]

It is easy to see that $\mathcal{F}_{1n}$ is a VC-class with VC-index $V(\mathcal{F}_{1n})=2$ . By Theorem 2.6.7 in Van Der Vaart and Wellner (1996), we obtain that

[TABLE]

where $K$ is a universal constant. Set $A=2K(16e)^{2}$ and $\tau_{n}^{2}=\sqrt{p\log{n}}$ . Lemma 2 leads to

[TABLE]

whence

[TABLE]

Therefore, we obtain the result. $\Box$

Lemma 4.

Let $\mathcal{F}$ be a permissible class of functions with $|f|\leq 1$ and $P|f|\leq\delta$ for all $f\in\mathcal{F}$ . Then

[TABLE]

For the definition of “a permissible class of functions ”, one can refer to Chapter 2 of Pollard (1984) for details.

Proof. This Lemma is a slightly modified version of Lemma 33 in Chapter 2 of Pollard (1984) as we need the result with diverging $p$ . But the proof can be very similar and thus is omitted here. $\Box$

Lemma 5.

Let $\delta_{n}$ and $\alpha_{n}$ be positive real valued sequences. Suppose $P|f|\leq\delta_{n}$ for all $f(x)\in\mathcal{F}_{n}$ and $Var(P_{n}Ff)/(4\epsilon_{n})^{2}\leq 1/2$ for $n$ large enough. If $E|F|^{8}<\infty$ , then

[TABLE]

where $A_{n}$ and $W_{n}$ are constants which may depend on $n$ .

Proof. The proof is similar to Theorem 37 in Chapter 2 of Pollard (1984) and Theorem 3.1 of Zhu (1993). Since $Var(P_{n}Ff)/(4\epsilon_{n})^{2}\leq 1/2$ , similar to the proof for Lemma 2, we have

[TABLE]

Conditionally on $\{X_{1},\cdots,X_{n}\}$ , we obtain

[TABLE]

Take expectation to obtain

[TABLE]

The last inequality is due to Lemma 4. Altogether we complete the proof. $\Box$

Lemma 6.

Suppose $H_{0}$ and condition (B1) hold. If $E\varepsilon^{8}<\infty$ and $p^{4}/n\to 0$ , then we have

[TABLE]

Proof. Fix $\epsilon>0$ and set $H_{C}=\{\beta:\beta\in\mathbb{R}^{p},\|\beta\|\leq 1,\|\beta-\kappa\beta_{0}\|\leq C\sqrt{p\log{n}/n}\}$ . Since $\|\hat{B}_{n}-\kappa\beta_{0}\|=O_{p}(\sqrt{p/n})$ , by condition (B1), it suffices to prove

[TABLE]

Let

[TABLE]

Then it is easy to see that

[TABLE]

Since $\mathcal{F}_{2n}^{1}$ and $\mathcal{F}_{2n}^{2}$ are both VC-classes with the VC-index $p+2$ and $2$ respectively, by Theorem 2.6.7 in Van Der Vaart and Wellner (1996), we obtain that

[TABLE]

whence

[TABLE]

Let $\tilde{\mathcal{F}}_{2n}=\{\varepsilon f_{\beta,u}(x):f_{\beta,u}\in\mathcal{F}_{2n}\}$ . It follows that

[TABLE]

Let $\delta_{n}=\sqrt{2p\log{n}/n}$ , $\alpha_{n}^{8}=\log{n}$ and $\epsilon_{n}=\epsilon/\sqrt{n}$ . Since

[TABLE]

by Lemma 5, we have

[TABLE]

Since $p^{4}/n\to 0$ , it follows that $\mathbb{P}\left\{\sup_{\mathcal{F}_{2n}}|P_{n}f_{\beta,u}|>\frac{8}{\sqrt{n}}\epsilon\right\}\to 0$ which completes our proof. $\Box$

Next, we consider the convergence rate of the following process

[TABLE]

Lemma 7.

Let $\tilde{M}_{n}(\beta,u)=E\{g^{\prime}(\beta_{0},\theta_{0},X)[I(\beta^{\top}X_{i}\leq u)-I(\kappa\beta_{0}X_{i}\leq u)]\}$ . Suppose conditions (A2) and (B1) hold. If $p^{4}/n\to 0$ , then

[TABLE]

Proof. Fix $\epsilon>0$ and set $\epsilon_{n}=\epsilon\sqrt{p^{1/2}\log{n}/n}$ , $\delta_{n}=\sqrt{2p\log{n}/n}$ , and $\alpha_{n}^{8}=p\log{n}$ . Similar to the proof for Lemma 6, it suffices to prove

[TABLE]

By the same argument for proving Lemma 3, we obtain

[TABLE]

For every term in the last sum, we use Lemma 5 to derive the result. Let

[TABLE]

Then we have

[TABLE]

where $K$ is a universal constant free of $n$ .

Recall $\epsilon_{n}=\epsilon\sqrt{p^{1/2}\log{n}/n}$ and $\delta_{n}=\sqrt{2p\log{n}/n}$ . By conditions (A2) and (B1), we have

[TABLE]

By Lemma 5, we obtain that

[TABLE]

whence

[TABLE]

Since $p^{4}/n\to 0$ , it follows that the right-hand side of the above inequality tends to zero. Hence we complete the proof. $\Box$

In the next lemma, we give the convergence rate of the kernel regression function estimator $\hat{r}_{n}(y)$ . Let $(X_{1},Y_{1}),\cdots,(X_{n},Y_{n})$ be a sample from $(X,Y)$ , $f(y)$ be the density function of $Y$ with a support $\mathcal{C}$ and $m_{n}(y)=r_{n}(y)f(y)=\{E(X_{11}|Y=y)f(y),\cdots,E(X_{1p}|Y=y)f(y)\}^{\top}$ . Suppose that

[TABLE]

It follows that $m_{n}(y)=O_{p}(\sqrt{p})$ uniformly in $y$ . Set

[TABLE]

Then $\hat{r}_{n}(y)=\hat{m}_{n}(y)/\hat{f}(y)$ . Here $K(\cdot)$ is the kernel function and $h$ is a bandwidth.

Lemma 8.

Suppose the above conditions hold. If $ph^{3}\log{n}\to 0$ , then we have

[TABLE]

Proof. Let $r_{ni}(y),\hat{r}_{ni}(y),m_{ni}(y)$ , and $\hat{m}_{ni}(y)$ be the $i$ -th component of $r_{n}(y),\hat{r}_{n}(y),m_{n}(y)$ , and $\hat{m}_{n}(y)$ respectively. For fixed $\epsilon$ , set $\epsilon_{n}^{2}=\frac{\log{n}}{n}\epsilon^{2}$ , $\delta_{n}=h$ , and $\alpha_{n}^{8}=p\log{n}$ . Then

[TABLE]

Define

[TABLE]

Without loss of generality, assume $|K(x)|\leq 1$ and $f(y)\leq 1$ . By the arguments in Example 38 of Chapter 2 of Pollard (1984), we obtain that

[TABLE]

where $A$ and $W$ are free of $n$ . Let $\tilde{\mathcal{F}}_{4n}=\{zf_{y,h}(u):f_{y,h}(u)\in\mathcal{F}_{4n}\}$ . Then

[TABLE]

Since

[TABLE]

and

[TABLE]

for $n$ large enough, Lemma 5 yields that

[TABLE]

whence

[TABLE]

Since $ph^{3}\log{n}\to 0$ , it is easy to see that the right-hand side of the inequality tends to zero. Thus $\sup_{y}\|\hat{m}_{n}(y)-E\hat{m}_{n}(y)\|=o_{p}(\sqrt{p\log{n}/(nh^{2})})$ . By the arguments for proving Lemma 3.3 of Zhu and Fang (1996), we obtain that

[TABLE]

Consequently,

[TABLE]

Thus we obtain the first result. For the second, note that

[TABLE]

and

[TABLE]

Combining these with the uniformly boundedness of $f(y)$ , the proof is concluded. $\Box$

6.3 Proofs of The Propositions and Theorems

For simplicity of notations, we consider a parametric family of functions $\mathcal{G}=\{g(\beta,\cdot):\beta\in\Theta\subset\mathbb{R}^{p}\}$ . Let $\hat{\beta}_{n}=\mathop{\rm argmin}\limits_{\beta}\sum_{i=1}^{n}[Y_{i}-g(\beta,X_{i})]^{2}$ and $\tilde{\beta}_{0}=\mathop{\rm argmin}\limits_{\beta}E[Y-g(\beta,X)]^{2}.$

Proof of Proposition 1. Let $\beta=\tilde{\beta}_{0}+\alpha$ and $F(\alpha)=\sum_{i=1}^{n}[Y_{i}-g(\tilde{\beta}_{0}+\alpha,X_{i})]g^{\prime}(\tilde{\beta}_{0}+\alpha,X_{i})$ . Then it suffices to show that there is a root $\alpha_{n}$ of $F(\alpha)$ such that $\|\alpha_{n}\|^{2}=O_{p}(p/n)$ . Applying the results in (6.3.4) of Ortega and Rheinboldt (1970), it in turn needs to show that $\alpha^{\top}F(\alpha)<0$ for $\|\alpha\|^{2}=Cp/n$ where $C$ is some large enough constant.

Let $\alpha=\sqrt{p/n}U$ with $\|U\|=C$ , and $e_{i}=Y_{i}-g(\tilde{\beta}_{0},X_{i})$ . Using Taylor’s expansion we obtain

[TABLE]

where $\beta_{1n},\beta_{2n},\beta_{3n}$ lie between $\tilde{\beta}_{0}$ and $\tilde{\beta}_{0}+\alpha$ and

[TABLE]

Then we have $|A_{1}|\leq\sqrt{p/n}\|U\|\|\sum_{i=1}^{n}g^{\prime}(\tilde{\beta}_{0},X_{i})e_{i}\|$ . Since $E[g^{\prime}(\tilde{\beta}_{0},X_{i})e_{i}]=0$ , it follows that

[TABLE]

Thus $A_{1}=p\|U\|O_{p}(1)$ . Recall that $\Sigma_{n1}=E[g^{\prime}(\tilde{\beta}_{0},X)g^{\prime}(\tilde{\beta}_{0},X)^{\top}],\Sigma_{n2}=E[eg^{\prime\prime}(\tilde{\beta}_{0},X)]$ , and $\Sigma_{n}=\Sigma_{n1}-\Sigma_{n2}$ . Then we decompose the term $A_{2}$ as follows

[TABLE]

By condition (A2), we obtain that

[TABLE]

It follows that $\frac{1}{n}\sum_{i=1}^{n}[g^{\prime}(\tilde{\beta}_{0},X_{i})g^{\prime}(\tilde{\beta}_{0},X_{i})^{\top}-\Sigma_{n1}]=\frac{p}{\sqrt{n}}O_{p}(1)$ . By the same argument, we have

[TABLE]

Therefore $A_{2}=pU^{\top}\Sigma_{n}U+\frac{p^{2}}{\sqrt{n}}\|U\|^{2}O_{p}(1)=pU^{\top}\Sigma_{n}U+p\|U\|^{2}o_{p}(1)$ . For the first term of $A_{3}$ , by the triangle inequality and condition (A6), we have

[TABLE]

For the second term of $A_{3}$ , we have

[TABLE]

By the same argument for the third and forth term of $A_{3}$ , we obtain that $A_{3}=\frac{p^{3}}{\sqrt{n}}\|U\|^{3}O_{p}(1)+\frac{p^{4}}{n}\|U\|^{4}O_{p}(1)$ . Therefore

[TABLE]

If $\|U\|=C$ be large enough, for any $\epsilon>0$ , we have

[TABLE]

Thus our result follows from (6.3.4) of Ortega and Rheinboldt (1970). $\Box$

If $g(\beta,X)=\beta^{\top}X$ follows a linear regression model, then $g^{\prime\prime}(\beta,x)=0$ and $g^{\prime\prime\prime}(\beta,x)=0$ . According to the proof of Proposition 1, we can obtain the norm consistency of $\hat{\beta}_{n}$ under the weaker condition $p^{2}/n\to 0$ .

Proof of Proposition 2. We use the same notations as those in the proof of Proposition 1. Let $\Psi_{n}(\beta)=\sum_{i=1}^{n}[Y_{i}-g(\beta,X_{i})]g^{\prime}(\beta,X_{i})$ . Then $\Psi_{n}(\hat{\beta}_{n})=0$ . Applying Taylor’s expansion around $\tilde{\beta}_{0}$ , we obtain

[TABLE]

where $\beta_{4n}$ lies between $\hat{\beta}_{n}$ and $\tilde{\beta}_{0}$ . Therefore

[TABLE]

Note that

[TABLE]

Following the same arguments in Proposition 1, we obtain that $\Sigma_{n}+\frac{1}{n}\Psi^{{}^{\prime}}_{n}(\tilde{\beta}_{0})=\frac{p}{\sqrt{n}}O_{p}(1)$ and $\Psi^{{}^{\prime\prime}}_{n}(\beta_{4n})=n\sqrt{p^{3}}O_{p}(1)$ . Since $\|\hat{\beta}_{n}-\tilde{\beta}_{0}\|=O_{p}(\sqrt{p/n})$ , it follows that

[TABLE]

Because $\Sigma_{n}^{-1}O_{p}(1)=O_{p}(1)$ , the result follows. Indeed, $\|\Sigma_{n}^{-1}O_{p}(1)\|^{2}=O_{p}(1)^{\top}\Sigma_{n}^{-2}O_{p}(1)\leq\lambda_{\rm max}(\Sigma_{n}^{-2})\|O_{p}(1)\|^{2}=O_{p}(1).$ $\Box$

If $g(X,\beta)=\beta^{\top}X$ , it is easy to see that $\Psi^{{}^{\prime\prime}}_{n}(\beta_{4n})=0$ . Consequently,

[TABLE]

Therefore only the convergence rate $p^{3}/n\to 0$ is needed to obtain the result in Proposition 2.

Proof of Proposition 3. (1) Suppose that $M_{n}\beta_{i}=\lambda_{i}\beta_{i}$ and $\hat{M}_{n}\hat{\beta}_{i}=\hat{\lambda}_{i}\hat{\beta}_{i}$ for $1\leq i\leq p$ . Similar to the arguments of Theorem 2.2 in Zhu and Fang (1996), we have

[TABLE]

By Theorem 3 in Zhu et al. (2010b), we obtain that $\sqrt{n}\beta_{i}^{\top}(\hat{M}_{n}-M_{n})\beta_{i}$ is asymptotically normal. Thus $\hat{\lambda}_{i}-\lambda_{i}=O_{p}(1/\sqrt{n})$ . Following the arguments of Lemma 1 in Tan et al. (2017), we obtain $\mathbb{P}(\hat{q}=1)\to 1$ . Again, by Theorem 2.2 in Zhu and Fang (1996), we obtain

[TABLE]

Note that $\hat{B}_{n}(1)=\hat{\beta}_{1}$ and $\beta_{1}=\kappa\beta_{0}$ under $H_{0}$ . Then we have

[TABLE]

Since $\sqrt{n}\beta_{1}^{\top}(\hat{M}_{n}-M_{n})\beta_{1}=O_{p}(1)$ and $\|\sum_{i=2}^{p}\frac{\beta_{i}}{\lambda_{1}-\lambda_{i}}\|^{2}=O(p)$ , it follows that $\|\sqrt{n}(\hat{B}_{n}(q)-\kappa\beta_{0})\|=O_{p}(\sqrt{p})$ .

(2) Note that $q$ is free of $n$ under $H_{1}$ . The proof is concluded from the argument for proving (1). $\Box$

Proof of Theorem 3.1. Under the null hypothesis, we have $\mathbb{P}(\hat{q}=1)\rightarrow 1$ . Thus we need only work on the event $\{\hat{q}=1\}$ . It follows that $\hat{\alpha}=1$ and we can rewrite $V_{n}(\hat{\alpha},u)$ as

[TABLE]

Let $\gamma=(\beta^{\top},\theta^{\top})^{\top}$ . Then we obtain that

[TABLE]

where $(\beta_{1n},\theta_{1n})$ lies between $(\hat{\beta}_{n},\hat{\theta}_{n})$ and $(\beta_{0},\theta_{0})$ . For the third term $V_{n13}$ in $V_{n1}$ , note that

[TABLE]

Therefore $V_{n13}=\frac{1}{\sqrt{n}}\frac{p}{n}n(p+d)O_{p}(1)=o_{p}(1)$ uniformly in $u$ . For $V_{n12}$ , recall that $M_{n}(u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\kappa\beta_{0}^{\top}X\leq u)]$ . Then we decompose $V_{n12}$ as follows

[TABLE]

For the second term in $V_{n12}$ , by Lemma 3, we have

[TABLE]

Conclude that

[TABLE]

Since $\|M_{n}(u)\|=O(1)$ uniformly in $u$ , by Proposition 2, we have

[TABLE]

Therefore, we obtain that

[TABLE]

Now we consider the term $V_{n2}$ . It can be decomposed as follow

[TABLE]

By Lemma 6, we obtain that $V_{n21}=o_{p}(1)$ uniformly in $u$ . For the second term $V_{n22}$ , let

[TABLE]

By Lemma 7, we have

[TABLE]

Therefore, we derive that

[TABLE]

Let $M_{n}(\beta,u)=E[g^{\prime}(\beta_{0},\theta_{0},X)I(\beta^{\top}X_{i}\leq u)]$ . By condition (B1), it is easy to see that

[TABLE]

Consequently,

[TABLE]

It follows that $V_{n22}=o_{p}(1)$ uniformly in $u$ .

Similar to the term $V_{n13}$ , we obtain that $V_{n23}=o_{p}(1)$ uniformly in $u$ . Combining these with (6.1), we obtain that

[TABLE]

It is easy to see that the first and second terms of the right-hand side of (6.2) are asymptotically tight.

Now we consider the convergence of finite-dimensional distributions. Let $Y_{ni}=(Y_{ni}(u_{1}),\cdots,\\ Y_{ni}(u_{m}))^{\top}$ where

[TABLE]

For any $\delta>0$ , we have

[TABLE]

Since

[TABLE]

and

[TABLE]

it follows that $\mathbb{P}(\|Y_{n1}\|>\delta)=O(p/n)$ . For $E\|Y_{n1}\|^{4}$ , it is easy to see that

[TABLE]

Since

[TABLE]

it follows that $EY_{n1}(u)^{4}=O(p^{2}/n^{2})$ . Hence $\sum_{i=1}^{n}E\|Y_{ni}\|^{2}I(\|Y_{ni}\|>\delta)=O(\sqrt{p^{3}/n})=o(1).$

For the covariance matrix $\sum_{i=1}^{n}Cov(Y_{ni})$ , we only need to consider $\sum_{i=1}^{n}Cov\{Y_{ni}(s),Y_{ni}(t)\}$ . It is easy to see that

[TABLE]

Thus $\sum_{i=1}^{n}Cov\{Y_{ni}(s),Y_{ni}(t)\}=K_{n}(s,t)$ . Since $K_{n}(s,t)\to K(s,t)$ , it follows that $Y_{ni}$ satisfies the conditions of Lindeberg-Feller Central limit theorem. Hence convergence of the finite-dimensional distributions holds. All together we have

[TABLE]

where $V_{\infty}^{1}(u)$ is a zero mean Gaussian process with covariance function $K(s,t)$ . Hence we complete the proof. $\Box$

Proof of Theorem 3.2. Similar to the proof for Theorem 3.1, we only need to work on the event $\{\hat{q}=1\}$ . Let

[TABLE]

On the event $\{\hat{q}=1\}$ , we have $\mathcal{S}_{\hat{q}}^{+}=\{1\}$ and then $\hat{\alpha}=1$ . Consequently, $V_{n}^{1}(\hat{\alpha},u)$ can be rewritten as

[TABLE]

Next we divide the whole proof of Theorem 3.2 into three parts.

(I) First, to prove that $\hat{T}_{n}V_{n}(\hat{\alpha},u)-\hat{T}_{n}V_{n}^{1}(\hat{\alpha},u)=o_{p}(1)$ uniformly in $u$ . Recall that

[TABLE]

Since

[TABLE]

by the same arguments in the proof of Theorem 3.1, we obtain that

[TABLE]

uniformly in $u$ . The two integrals in $\hat{T}_{n}V_{n}(\hat{\alpha},u)$ and $\hat{T}_{n}V_{n}^{1}(\hat{\alpha},u)$ differ by

[TABLE]

It equals

[TABLE]

where $(\hat{\beta}_{1n},\hat{\theta}_{1})$ and $(\hat{\beta}_{2n},\hat{\theta}_{2})$ both lie between $(\hat{\beta}_{n},\hat{\theta}_{n})$ and $(\beta_{0},\theta_{0})$ . Recall that

[TABLE]

Then the two integrals differ by

[TABLE]

Since $\int_{-\infty}^{u}a_{n}(z)\sigma_{n}^{2}(z)F_{\kappa\beta_{0}}(dz)=M_{n}(u)$ , it follows that $\hat{T}_{n}V_{n}(\hat{\alpha},u)-\hat{T}_{n}V_{n}^{1}(\hat{\alpha},u)=o_{p}(1)$ uniformly in $u$ .

(II) Second, to prove $T_{n}V_{n}^{1}(\hat{\alpha},u)-\hat{T}_{n}V_{n}^{1}(\hat{\alpha},u)=o_{p}(1)$ uniformly in $u$ . Indeed,

[TABLE]

Putting

[TABLE]

it follows that

[TABLE]

By the uniformly boundedness of $\sigma_{n}^{2}(z)$ , we have the sequence $\{h_{n}(z)\}$ is asymptotically tight. According to Lemma 3.4 in Stute, Thies, and Zhu (1998) and the arguments thereafter, we obtain that $T_{n3}=o_{p}(1)$ uniformly in $u\in[-\infty,u_{0}]$ . For $T_{n1}-T_{n2}$ , since both $a_{n}(z)$ and $A_{n}(z)$ depend on $(\beta_{0},\theta_{0})$ , we rewrite $a_{n}(z)$ and $A_{n}(z)$ as $a_{n}(\beta_{0},\theta_{0},z)$ and $A_{n}(\beta_{0},\theta_{0},z)$ respectively and define

[TABLE]

By the boundedness of $\sigma_{n}(u)$ and Condition (B1), we obtain that $l_{n}(\hat{\beta}_{n},\hat{\theta}_{n},u)-l_{n}(\beta_{0},\theta_{0},u)=o_{p}(1)$ . By Lemma 8, we show that

[TABLE]

Combining this with the uniformly boundedness of $\hat{\sigma}_{n}^{2}$ , we obtain $T_{n1}-T_{n2}$ tends to zero in probability.

(III) Finally, to prove $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)=o_{p}(1)$ uniformly in $u$ .

[TABLE]

Since

[TABLE]

by the same argument in Theorem 3.1, we obtain that $V_{n}^{1}(\hat{\alpha},u)-V_{n}^{0}(u)=o_{p}(1)$ uniformly in $u$ . For the integrals in $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)$ , note that the two integrals differ by

[TABLE]

Since $\|\hat{B}_{n}-\kappa\beta_{0}\|=O_{p}(\sqrt{p/n})$ , similar to the arguments in Lemma 6, the difference between the two integrals in $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)$ tends to zero. Hence $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)=o_{p}(1)$ uniformly in $u$ . All together we conclude that

[TABLE]

in distribution. $\Box$

Proof of Proposition 4. Let $Y=g(\beta_{0}^{\top}X,\theta_{0})+\varepsilon$ , $\alpha_{t}=E[XI(Y\leq t)]$ , $\tilde{\alpha}_{t}=E[XI(Y_{n}\leq t)]$ , $M_{n}=\int\alpha_{t}\alpha_{t}^{\top}F_{Y}(dt)$ , and $\tilde{M}_{n}=\int\tilde{\alpha}_{t}\tilde{\alpha}_{t}^{\top}F_{Y_{n}}(dt)$ . Then the space ${\rm span}(M_{n})\in\mathcal{S}_{Y|X}$ and the space ${\rm span}(\tilde{M}_{n})\in\mathcal{S}_{Y_{n}|X}$ . If we show that $\sqrt{n}\gamma^{\top}(\hat{M}_{n}-M_{n})\gamma$ is asymptotically normal for any unit vector $\gamma$ , the result of this proposition follows from the exact arguments for proving Proposition 3.

We now prove the above asymptotic normality. Under $H_{1n}$ , we have

[TABLE]

where $F_{Y|X}$ is the conditional distribution of $Y$ given $X$ . By Taylor’s expansion, we derive

[TABLE]

Here $\xi_{t}(X)$ lies between $t-\frac{1}{\sqrt{n}}G(X)$ and $t$ and $f_{Y|X}$ is the conditional density function of $Y$ given $X$ . Therefore,

[TABLE]

Note that $F_{Y_{n}}(t)=F_{Y}(t)-\frac{1}{\sqrt{n}}E[G(X)f_{Y|X}(t)]+\frac{1}{2n}E[f^{\prime}_{Y|X}(\xi_{t}(X))G(X)^{2}]$ . Consequently,

[TABLE]

By Theorem 3 in Zhu et al. (2010b), we have $\sqrt{n}\gamma^{\top}(\hat{M}_{n}-\tilde{M}_{n})\gamma$ is asymptotically normal. By condition (B3) in Appendix, $\sqrt{n}\gamma^{\top}(\hat{M}_{n}-M_{n})\gamma$ is also asymptotically normal. $\Box$

Proof of Proposition 5. The proof is similar to that for proving Propositions 1 and 2 with $e_{i}=\varepsilon_{i}$ and $\Sigma_{n}=E[g^{\prime}(\beta_{0},\theta_{0},X)g^{\prime}(\beta_{0},\theta_{0},X)^{\top}]$ . $\Box$

Proof of Theorem 3.3. (1) Under $H_{1}$ , Proposition 1 asserts that $P(\hat{q}=q)\to 1$ . Thus we only need work on the event $\{\hat{q}=q\}$ . It follows that $\sup_{\hat{\alpha}\in\mathcal{S}_{\hat{q}}^{+}}|\hat{T}_{n}V_{n}(\hat{\alpha},u)|=\sup_{\alpha\in\mathcal{S}_{q}^{+}}|\hat{T}_{n}V_{n}(\alpha,u)|$ .

Putting

[TABLE]

and

[TABLE]

Following the arguments in Theorem 3.2, we obtain that

[TABLE]

where

[TABLE]

and $F_{\alpha}$ is the cumulative distribution function of $\alpha^{\top}B^{\top}X$ . Consider

[TABLE]

Since

[TABLE]

it follows that $\frac{1}{\sqrt{n}}(\tilde{V}_{n}^{1}(\alpha,u)-\tilde{V}_{n}^{0}(\alpha,u))=o_{p}(1)$ . For the two integrals in $T_{n}\tilde{V}_{n}^{1}(\alpha,u)-T_{n}\tilde{V}_{n}^{0}(\alpha,u)$ , we have

[TABLE]

Therefore, we obtain that

[TABLE]

Note that

[TABLE]

It follows that

[TABLE]

where

[TABLE]

Therefore, we obtain that

[TABLE]

where $L(u)$ is an nonzero function.

(2) We use the same notations as in the arguments of Theorem 3.2. Under the local alternatives (3.8), by Proposition 3, we have $\mathbb{P}\{\hat{q}=1\}\to 1$ . Thus we just work on this event $\{\hat{q}=1\}$ . Hence $\mathcal{S}_{\hat{q}}^{+}=\{1\}$ and $\sup_{\hat{\alpha}\in\mathcal{S}_{\hat{q}}^{+}}|\hat{T}_{n}V_{n}(\hat{\alpha},u)|=|\hat{T}_{n}V_{n}(\hat{\alpha},u)|$ .

Following the same arguments for Theorem 3.2, we obtain that

[TABLE]

Next, we consider $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)$ . Recall that

[TABLE]

Under $H_{1n}$ , we have

[TABLE]

Then $V_{n}^{1}(\hat{\alpha},u)-V_{n}^{0}(u)=o_{p}(1)$ . For the integrals in $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)$ , since

[TABLE]

by the same arguments for Theorem 3.2, we have

[TABLE]

Hence we obtain that $T_{n}V_{n}^{1}(\hat{\alpha},u)-T_{n}V_{n}^{0}(u)=o_{p}(1)$ .

To complete the proof, it remains to derive the asymptotic distribution of $T_{n}V_{n}^{0}(u)$ . Under the alternatives, note that

[TABLE]

It follows that

[TABLE]

By Glivenko-Cantelli Theorem, we have

[TABLE]

Since $E[G(X)I(\kappa\beta_{0}^{\top}X\leq u)]\to G_{1}(u)$ and

[TABLE]

we conclude that

[TABLE]

where $V_{\infty}(u)$ is a zero-mean Gaussian process given by (3.6). $\Box$

Bibliography86

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bierens, H. J. (1982). Consistent model specification tests. Journal of Econometrics , 20 , 105-134.
2[2]
3[3] Bierens, H. J. (1990). A consistent conditional moment test of functional form. Ecomometrica , 58 , 1443-1458.
4[4]
5[5] Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graphics. New York: Wiley.
6[6]
7[7] Cook, R. D. and Weisberg, S. (1991). Discussion of Sliced inverse regression for dimension reduction, by K. C. Li. Journal of the American Statistical Association , 86 , 316-342.
8[8]

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Estimation and adaptive-to-model testing for regressions with diverging number of predictors

Abstract

1 Introduction

2 Adaptive-to-model residual-marked empirical process

2.1 Preliminary

Proposition 1**.**

Proposition 2**.**

Remark 1**.**

2.2 Basic test statistic construction

2.3 Adaptive-to-model approach

Proposition 3**.**

3 Main results

3.1 Basic properties of the process

Theorem 3.1**.**

3.2 Martingale transformation

Theorem 3.2**.**

3.3 The properties under the alternative hypothesis

Proposition 4**.**

Proposition 5**.**

Theorem 3.3**.**

4 Numerical studies

4.1 Test statistics in practical use

4.2 Numerical studies

4.3 A real data example

5 Discussions

6 Appendix

6.1 Regularity Conditions

6.2 Lemmas

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

Lemma 4**.**

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

6.3 Proofs of The Propositions and Theorems

Proposition 1.

Proposition 2.

Remark 1.

Proposition 3.

Theorem 3.1.

Theorem 3.2.

Proposition 4.

Proposition 5.

Theorem 3.3.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.