Testing for the Rank of a Covariance Operator

Anirvan Chakraborty; Victor M. Panaretos

arXiv:1901.02333·stat.ME·August 11, 2020

Testing for the Rank of a Covariance Operator

Anirvan Chakraborty, Victor M. Panaretos

PDF

Open Access

TL;DR

This paper introduces a novel bootstrap-based testing procedure to determine the rank of a covariance operator in functional data, effectively handling measurement errors and discretization without smoothing.

Contribution

It develops a matrix-completion inspired test statistic and a stepwise testing procedure with proven consistency and validity, advancing rank determination methods for functional data.

Findings

01

The procedure performs well across diverse simulation settings.

02

It effectively controls the family-wise error rate.

03

The method is demonstrated on real data analyses.

Abstract

How can we discern whether the covariance operator of a stochastic process is of reduced rank, and if so, what its precise rank is? And how can we do so at a given level of confidence? This question is central to a great deal of methods for functional data, which require low-dimensional representations whether by functional PCA or other methods. The difficulty is that the determination is to be made on the basis of i.i.d. replications of the process observed discretely and with measurement error contamination. This adds a ridge to the empirical covariance, obfuscating the underlying dimension. We build a matrix-completion inspired test statistic that circumvents this issue by measuring the best possible least square fit of the empirical covariance's off-diagonal elements, optimised over covariances of given finite rank. For a fixed grid of sufficiently large size, we determine the…

Figures4

Click any figure to enlarge with its caption.

Tables12

Table 1. Table 1: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models A1–A5 with homoskedastic errors for ( n , L ) = ( 150 , 25 ) 𝑛 𝐿 150 25 (n,L)=(150,25)

	Model A1					Model A2						Model A3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	$\geq 5$		1	2	3	4	$\geq 5$
Proposed test	0	2	97	0	1	0	1	98	1	0		0	0	100	0	0
$A I C_{y a o}$	0	0	13	59	26	0	0	16	64	20		0	0	74	25	1
$A I C_{m}$	34	54	12	0	0	41	53	6	0	0		80	20	0	0	0
$B I C_{m}$	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	67	33	0	0	0	70	30	0	0	0		99	1	0	0	0
$I C_{p 1}$	53	44	3	0	0	55	45	0	0	0		92	8	0	0	0
	Model A4					Model A5
$\hat{r}$	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$
Proposed test	0	0	100	0	0	0	0	0	0	0	97	2	1
$A I C_{y a o}$	0	0	68	32	1	0	0	0	0	0	100	0	0
$A I C_{m}$	77	23	0	0	0	77	19	4	0	0	0	0	0
$B I C_{m}$	0	0	100	0	0	0	0	0	0	33	67	0	0
$P C_{p 1}$	92	8	0	0	0	89	9	2	0	0	0	0	0
$I C_{p 1}$	90	10	0	0	0	90	8	2	0	0	0	0	0

Table 2. Table 2: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models A1–A5 with homoskedastic errors for ( n , L ) = ( 150 , 50 ) 𝑛 𝐿 150 50 (n,L)=(150,50)

	Model A1					Model A2						Model A3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	$\geq 5$		1	2	3	4	$\geq 5$
Proposed test	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$A I C_{y a o}$	0	0	0	0	100	0	0	0	1	99		0	0	0	2	98
$A I C_{m}$	9	38	52	1	0	8	37	55	0	0		47	46	7	0	0
$B I C_{m}$	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	41	51	8	0	0	37	54	9	0	0		76	24	0	0	0
$I C_{p 1}$	21	49	30	0	0	12	51	37	0	0		60	39	1	0	0
	Model A4					Model A5
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$
Proposed test	0	0	100	0	0	0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	0	1	99	0	0	0	0	0	92	8	0
$A I C_{m}$	37	49	14	0	0	24	28	40	6	2	0	0	0
$B I C_{m}$	0	0	100	0	0	0	0	0	0	0	100	0	0
$P C_{p 1}$	78	22	0	0	0	40	40	19	1	0	0	0	0
$I C_{p 1}$	62	36	2	0	0	46	37	16	1	0	0	0	0

Table 3. Table 3: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models S1–S5 with homoskedastic errors for ( n , L ) = ( 150 , 25 ) 𝑛 𝐿 150 25 (n,L)=(150,25)

	Model S1									Model S2
Selected rank	1	2	3	4	5	6	7	$\geq 8$		1	2	3	4	5	6	7	$\geq 8$
Proposed test	0	0	0	0	0	99	1	0		0	0	0	0	0	95	4	1
$A I C_{y a o}$	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$A I C_{m}$	31	31	38	0	0	0	0	0		33	35	32	0	0	0	0	0
$B I C_{m}$	0	0	0	0	77	23	0	0		0	0	0	2	77	21	0	0
$P C_{p 1}$	33	36	31	0	0	0	0	0		34	37	29	0	0	0	0	0
$I C_{p 1}$	65	31	4	0	0	0	0	0		63	32	5	0	0	0	0	0
	Model S3							Model S4							Model S5
Selected rank	1	2	3	4	5	$\geq 6$		1	2	3	4	5	$\geq 6$		1	2	3	4	$\geq 5$
Proposed test	0	0	0	93	5	2		0	0	0	94	4	2		0	0	94	4	2
$A I C_{y a o}$	0	0	0	100	0	0		0	0	0	100	0	0		0	0	77	22	1
$A I C_{m}$	1	8	56	35	0	0		1	8	53	38	0	0		1	20	79	0	0
$B I C_{m}$	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	1	10	60	29	0	0		2	11	55	32	0	0		1	20	79	0	0
$I C_{p 1}$	7	14	64	15	0	0		16	17	55	12	0	0		3	23	74	0	0

Table 4. Table 4: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models S1–S5 with homoskedastic errors for ( n , L ) = ( 150 , 50 ) 𝑛 𝐿 150 50 (n,L)=(150,50)

	Model S1									Model S2
Selected rank	1	2	3	4	5	6	7	$\geq 8$		1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	0	0	0	83	17	0		0	0	0	0	0	80	20	0
$A I C_{m}$	0	2	3	13	57	25	0	0		0	0	7	14	51	28	0	0
$B I C_{m}$	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$P C_{p 1}$	0	3	8	17	54	18	0	0		0	0	12	21	49	18	0	0
$I C_{p 1}$	1	11	27	36	25	0	0	0		1	10	27	37	25	0	0	0
	Model S3							Model S4							Model S5
Selected rank	1	2	3	4	5	$\geq 6$		1	2	3	4	5	$\geq 6$		1	2	3	4	$\geq 5$
Proposed	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$A I C_{y a o}$	0	0	0	3	42	55		0	0	0	3	43	54		0	0	0	1	99
$A I C_{m}$	0	1	16	83	0	0		0	0	9	91	0	0		0	8	92	0	0
$B I C_{m}$	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	0	2	17	81	0	0		0	0	9	91	0	0		0	7	93	0	0
$I C_{p 1}$	0	2	15	83	0	0		0	0	18	82	0	0		0	9	91	0	0

Table 5. Table 5: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models A1–A5 with heteroskedastic errors for ( n , L ) = ( 150 , 25 ) 𝑛 𝐿 150 25 (n,L)=(150,25)

	Model A1					Model A2						Model A3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	$\geq 5$		1	2	3	4	$\geq 5$
Proposed	0	0	95	4	1	0	0	94	5	1		0	0	93	6	1
$A I C_{y a o}$	0	0	25	57	18	0	0	39	56	5		0	0	22	62	16
$A I C_{m}$	21	45	34	0	0	27	51	22	0	0		93	7	0	0	0
$B I C_{m}$	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	60	38	2	0	0	63	37	0	0	0		100	0	0	0	0
$I C_{p 1}$	33	47	20	0	0	38	52	10	0	0		100	0	0	0	0
	Model A4					Model A5
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	94	5	1	0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	25	55	20	0	0	0	0	0	100	0	0
$A I C_{m}$	95	5	0	0	0	86	13	1	0	0	0	0	0
$B I C_{m}$	0	1	99	0	0	0	0	0	0	45	55	0	0
$P C_{p 1}$	99	1	0	0	0	99	1	0	0	0	0	0	0
$I C_{p 1}$	98	2	0	0	0	97	3	0	0	0	0	0	0

Table 6. Table 6: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models A1–A5 with heteroskedastic errors for ( n , L ) = ( 150 , 50 ) 𝑛 𝐿 150 50 (n,L)=(150,50)

	Model A1					Model A2						Model A3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	$\geq 5$		1	2	3	4	$\geq 5$
Proposed	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$A I C_{y a o}$	0	0	0	0	100	0	0	0	4	96		0	0	0	1	99
$A I C_{m}$	4	30	65	1	0	6	24	70	0	0		62	37	1	0	0
$B I C_{m}$	0	0	100	0	0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	35	53	12	0	0	27	51	22	0	0		84	16	0	0	0
$I C_{p 1}$	10	41	49	0	0	7	44	49	0	0		79	21	0	0	0
	Model A4					Model A5
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	100	0	0	0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	0	1	99	0	0	0	0	0	65	35	0
$A I C_{m}$	63	34	3	0	0	22	29	40	7	2	0	0	0
$B I C_{m}$	0	0	100	0	0	0	0	0	0	0	100	0	0
$P C_{p 1}$	86	14	0	0	0	37	40	22	1	0	0	0	0
$I C_{p 1}$	85	15	0	0	0	44	38	17	1	0	0	0	0

Table 7. Table 7: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models S1–S5 with heteroskedastic errors for ( n , L ) = ( 150 , 25 ) 𝑛 𝐿 150 25 (n,L)=(150,25)

	Model S1									Model S2
Selected rank	1	2	3	4	5	6	7	$\geq 8$		1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	0	0	3	97	0	0		0	0	0	0	1	96	3	0
$A I C_{y a o}$	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$A I C_{m}$	60	34	6	0	0	0	0	0		65	31	4	0	0	0	0	0
$B I C_{m}$	1	0	0	4	69	23	0	0		20	0	0	5	51	24	0	0
$P C_{p 1}$	64	31	5	0	0	0	0	0		77	21	2	0	0	0	0	0
$I C_{p 1}$	82	18	0	0	0	0	0	0		89	11	0	0	0	0	0	0
	Model S3							Model S4							Model S5
Selected rank	1	2	3	4	5	$\geq 6$		1	2	3	4	5	$\geq 6$		1	2	3	4	$\geq 5$
Proposed	0	0	0	93	5	2		0	0	0	94	6	0		0	0	100	0	0
$A I C_{y a o}$	0	0	0	66	34	0		0	0	0	71	29	0		0	0	65	34	1
$A I C_{m}$	41	37	22	0	0	0		43	34	23	0	0	0		2	23	75	0	0
$B I C_{m}$	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	38	38	23	1	0	0		38	34	27	1	0	0		4	23	73	0	0
$I C_{p 1}$	64	31	5	0	0	0		66	31	3	0	0	0		8	30	62	0	0

Table 8. Table 8: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models S1–S5 with heteroskedastic errors for ( n , L ) = ( 150 , 50 ) 𝑛 𝐿 150 50 (n,L)=(150,50)

	Model S1									Model S2
Selected rank	1	2	3	4	5	6	7	$\geq 8$		1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	0	0	0	14	57	29		0	0	0	0	0	12	62	26
$A I C_{m}$	0	8	20	35	35	2	0	0		1	7	22	37	33	0	0	0
$B I C_{m}$	0	0	0	0	0	100	0	0		0	0	0	0	0	100	0	0
$P C_{p 1}$	0	9	27	38	23	3	0	0		1	9	25	38	27	0	0	0
$I C_{p 1}$	13	26	39	20	2	0	0	0		16	21	43	19	1	0	0	0
	Model S3							Model S4							Model S5
Selected rank	1	2	3	4	5	$\geq 6$		1	2	3	4	5	$\geq 6$		1	2	3	4	$\geq 5$
Proposed	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$A I C_{y a o}$	0	0	0	0	4	96		0	0	0	0	5	95		0	0	1	6	93
$A I C_{m}$	0	1	18	81	0	0		0	0	13	87	0	0		0	4	96	0	0
$B I C_{m}$	0	0	0	100	0	0		0	0	0	100	0	0		0	0	100	0	0
$P C_{p 1}$	0	2	18	80	0	0		0	0	16	84	0	0		0	5	95	0	0
$I C_{p 1}$	0	4	34	62	0	0		0	2	28	70	0	0		0	8	92	0	0

Table 9. Table 9: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models SF1–SF3

$(n, L) = (150, 25)$
	Model SF1					Model SF2								Model SF3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$	1	2	3	4	5	6	7	$\geq 8$
Proposed	0	2	98	0	0	0	0	0	0	0	100	0	0	0	0	0	0	44	49	7	0
$A I C_{y a o}$	0	0	13	49	38	0	0	0	0	6	94	0	0	0	0	0	0	66	34	7	0
$A I C_{m}$	100	0	0	0	0	54	46	0	0	0	0	0	0	59	27	14	0	0	0	0	0
$B I C_{m}$	100	0	0	0	0	0	98	2	0	0	0	0	0	0	0	100	0	0	0	0	0
$P C_{p 1}$	100	0	0	0	0	80	20	0	0	0	0	0	0	75	21	4	0	0	0	0	0
$I C_{p 1}$	100	0	0	0	0	64	36	0	0	0	0	0	0	74	22	4	0	0	0	0	0
$(n, L) = (150, 50)$
	Model SF1					Model SF2								Model SF3
Selected rank	1	2	3	4	$\geq 5$	1	2	3	4	5	6	7	$\geq 8$	1	2	3	4	5	6	7	$\geq 8$
Proposed	0	0	100	0	0	0	0	0	0	0	100	0	0	0	0	0	0	0	100	0	0
$A I C_{y a o}$	0	0	1	0	99	0	0	0	0	0	41	55	4	0	0	0	0	0	13	40	47
$A I C_{m}$	82	17	1	0	0	22	78	0	0	0	0	0	0	10	28	62	0	0	0	0	0
$B I C_{m}$	67	6	27	0	0	0	25	1	20	12	42	0	0	0	0	1	9	87	3	0	0
$P C_{p 1}$	100	0	0	0	0	45	55	0	0	0	0	0	0	24	32	44	0	0	0	0	0
$I C_{p 1}$	98	2	0	0	0	31	69	0	0	0	0	0	0	23	32	45	0	0	0	0	0

Table 10. Table 10: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models I1–I4 with L = 25. The procedure labelled ‘Proposed (hom.)’ corresponds to our bootstrap procedure, modified to make use of homoskedasticity, as per the comment in Section 2.5 , at the top of p. 2.5 .

	Model I1				Model I2
Selected rank	1-3	4-6	7-9	$\geq$ 10	1-3	4-6	7-9	$\geq$ 10
Proposed	2	9	33	56	0	7	31	62
Proposed (hom.)	0	0	5	95	0	0	4	96
$A I C_{y a o}$	2	97	1	0	28	71	1	0
$A I C_{m}$	96	4	0	0	100	0	0	0
$B I C_{m}$	100	0	0	0	100	0	0	0
$P C_{p 1}$	88	12	0	0	100	0	0	0
$I C_{p 1}$	100	0	0	0	100	0	0	0
	Model I3				Model I4
Selected rank	1-3	4-6	7-9	$\geq$ 10	1-3	4-6	7-9	$\geq$ 10
Proposed	0	1	24	75	1	4	11	84
$A I C_{y a o}$	5	93	2	0	49	51	0	0
$A I C_{m}$	64	36	0	0	100	0	0	0
$B I C_{m}$	100	0	0	0	100	0	0	0
$P C_{p 1}$	42	58	0	0	100	0	0	0
$I C_{p 1}$	87	13	0	0	100	0	0	0

Table 11. Table 11: Table showing the true rank (in bold) and the empirical distribution of the estimated rank under Models I1–I4 with L = 50

	Model I1						Model I2
Selected rank	1-3	4-6	7-9	10-12	13-15	$>$ 15	1-3	4-6	7-9	10-12	13-15	$>$ 15
Proposed	0	0	0	0	0	100	0	0	0	0	0	100
$A I C_{y a o}$	0	6	92	1	0	0	1	50	49	0	0	0
$A I C_{m}$	54	46	0	0	0	0	100	0	0	0	0	0
$B I C_{m}$	100	0	0	0	0	0	100	0	0	0	0	0
$P C_{p 1}$	30	70	0	0	0	0	100	0	0	0	0	0
$I C_{p 1}$	87	13	0	0	0	0	100	0	0	0	0	0
	Model I3						Model I4
Selected rank	1-3	4-6	7-9	10-12	13-15	$>$ 15	1-3	4-6	7-9	10-12	13-15	$>$ 15
Proposed	0	0	0	0	0	100	0	0	0	0	0	100
$A I C_{y a o}$	0	30	60	10	0	0	3	73	24	0	0	0
$A I C_{m}$	15	76	9	0	0	0	99	1	0	0	0	0
$B I C_{m}$	93	7	0	0	0	0	100	0	0	0	0	0
$P C_{p 1}$	2	64	34	0	0	0	100	0	0	0	0	0
$I C_{p 1}$	41	59	0	0	0	0	100	0	0	0	0	0

Table 12. Table 12: Table showing the estimated rank of the Tecator data set under different error variances

Error variance	1	0.5	0.1	0.05	0.01	0.005	0.001	0.0005	0.0001
Proposed method	2	2	2	2	3	3	4	4	6
$A I C_{y a o}$	7	8	11	12	12	12	9	1	1
$A I C_{m}$	2	2	1	1	1	1	1	1	1
$B I C_{m}$	1	1	1	1	1	1	1	1	1
$P C_{p 1}$	2	2	2	1	1	1	1	1	1
$I C_{p 1}$	2	2	1	1	1	1	1	1	1

Equations407

k_{X} (s, t) = E [X (s) X (t)], (s, t) \in [0, 1]^{2} .

k_{X} (s, t) = E [X (s) X (t)], (s, t) \in [0, 1]^{2} .

k_{X} (s, t) = m \geq 1 \sum λ_{m} φ_{m} (s) φ_{m} (t)

k_{X} (s, t) = m \geq 1 \sum λ_{m} φ_{m} (s) φ_{m} (t)

X (t) = m \geq 1 \sum Y_{m} φ_{m} (t),

X (t) = m \geq 1 \sum Y_{m} φ_{m} (t),

W_{ij} = X_{i} (t_{j}) + ε_{ij}, i = 1, \dots, n, j = 1, \dots, L,

W_{ij} = X_{i} (t_{j}) + ε_{ij}, i = 1, \dots, n, j = 1, \dots, L,

0 \leq t_{1} < t_{2} < \dots < t_{L} \leq 1.

0 \leq t_{1} < t_{2} < \dots < t_{L} \leq 1.

E [ε_{ij}] = 0 & var [ε_{ij}] = σ_{j}^{2} < \infty, i = 1, \dots, n, j = 1, \dots, L .

E [ε_{ij}] = 0 & var [ε_{ij}] = σ_{j}^{2} < \infty, i = 1, \dots, n, j = 1, \dots, L .

K_{W, L} = cov {(X_{1} (t_{1}), X_{1} (t_{2}), \dots, X_{1} (t_{L}))^{⊤}} = K_{X, L} + D,

K_{W, L} = cov {(X_{1} (t_{1}), X_{1} (t_{2}), \dots, X_{1} (t_{L}))^{⊤}} = K_{X, L} + D,

\left\{\begin{array}[]{r@{}l}H_{0}&{}:\mathrm{rank}(k_{X})\leq d\\ H_{1}&{}:\mathrm{rank}(k_{X})>d\end{array}\right\}

\left\{\begin{array}[]{r@{}l}H_{0}&{}:\mathrm{rank}(k_{X})\leq d\\ H_{1}&{}:\mathrm{rank}(k_{X})>d\end{array}\right\}

\left\{\begin{array}[]{r@{}l}H_{0,q}&{}:\mathrm{rank}(k_{X})=q\\ H_{1,q}&{}:\mathrm{rank}(k_{X})>q\end{array}\right\},\qquad q=1,\ldots,d.

\left\{\begin{array}[]{r@{}l}H_{0,q}&{}:\mathrm{rank}(k_{X})=q\\ H_{1,q}&{}:\mathrm{rank}(k_{X})>q\end{array}\right\},\qquad q=1,\ldots,d.

K_{X, L} (i, j) = K_{W, L} (i, j), \forall i \neq = j

K_{X, L} (i, j) = K_{W, L} (i, j), \forall i \neq = j

L_{†} = 2 d + 1

L_{†} = 2 d + 1

Θ : rank (Θ) \leq q min ∥ P_{L} \circ (K_{W, L} - Θ) ∥_{F}^{2} = 0.

Θ : rank (Θ) \leq q min ∥ P_{L} \circ (K_{W, L} - Θ) ∥_{F}^{2} = 0.

K_{W, L} := \frac{1}{n} i = 1 \sum n \leavevmode (W_{i 1}, \dots, W_{i L}) (W_{i 1}, \dots, W_{i L})^{⊤} .

K_{W, L} := \frac{1}{n} i = 1 \sum n \leavevmode (W_{i 1}, \dots, W_{i L}) (W_{i 1}, \dots, W_{i L})^{⊤} .

T_{q} = Θ^{L \times L} : rank (Θ) \leq q min ∥ P_{L} \circ (K_{W, L} - Θ) ∥_{F}^{2},

T_{q} = Θ^{L \times L} : rank (Θ) \leq q min ∥ P_{L} \circ (K_{W, L} - Θ) ∥_{F}^{2},

r := min {q \geq 1 : p_{q} > α},

r := min {q \geq 1 : p_{q} > α},

{V > 0} \Leftrightarrow {V = 1} \Leftrightarrow {H_{0, q_{0}} \mbox ha s b ee n r e j ec t e d} \Leftrightarrow {p_{r_{0}} \leq α} .

{V > 0} \Leftrightarrow {V = 1} \Leftrightarrow {H_{0, q_{0}} \mbox ha s b ee n r e j ec t e d} \Leftrightarrow {p_{r_{0}} \leq α} .

P (r > r_{true}) \leq P (p_{r_{true}} \leq α) = P (V > 0) \leq α .

P (r > r_{true}) \leq P (p_{r_{true}} \leq α) = P (V > 0) \leq α .

Ψ : R^{L \times q} \to [0, \infty), Ψ (C) = ∥ P_{L} \circ (K_{W, L} - C C^{⊤}) ∥_{F}^{2} .

Ψ : R^{L \times q} \to [0, \infty), Ψ (C) = ∥ P_{L} \circ (K_{W, L} - C C^{⊤}) ∥_{F}^{2} .

n T_{q} \to d ∥ P_{L} \circ Z ∥_{F}^{2} - 8 \leavevmode (vec (P_{L} \circ Z))^{⊤} {(C_{0} \otimes I_{L}) (\nabla^{2} Ψ (C_{0}))^{- 1} (C_{0}^{⊤} \otimes I_{L})} vec (P_{L} \circ Z)

n T_{q} \to d ∥ P_{L} \circ Z ∥_{F}^{2} - 8 \leavevmode (vec (P_{L} \circ Z))^{⊤} {(C_{0} \otimes I_{L}) (\nabla^{2} Ψ (C_{0}))^{- 1} (C_{0}^{⊤} \otimes I_{L})} vec (P_{L} \circ Z)

m (W_{i}) = \overline{W} + Θ_{q} K_{W, L}^{- 1} (W_{i} - \overline{W}) .

m (W_{i}) = \overline{W} + Θ_{q} K_{W, L}^{- 1} (W_{i} - \overline{W}) .

D (j, j) = max {K_{W, L} (j, j) - Θ_{M} (j, j), 0},

D (j, j) = max {K_{W, L} (j, j) - Θ_{M} (j, j), 0},

M = m_{n} 1 {m_{n} < d} + d 1 {m_{n} \geq d}

M = m_{n} 1 {m_{n} < d} + d 1 {m_{n} \geq d}

m_{n} = min {m \geq q : T_{m} \leq ϵ \frac{lo g n}{n}},

m_{n} = min {m \geq q : T_{m} \leq ϵ \frac{lo g n}{n}},

p_{q, B}^{*} = \frac{1}{B} b = 1 \sum B 1 {T_{q, b}^{*} \leq T_{q}} = F_{q, B}^{*} (T_{q})

p_{q, B}^{*} = \frac{1}{B} b = 1 \sum B 1 {T_{q, b}^{*} \leq T_{q}} = F_{q, B}^{*} (T_{q})

T_{q, b}^{*} = Θ^{L \times L} : rank (Θ) \leq q min P_{L} \circ (\frac{1}{n} j = 1 \sum n ζ_{j, b} ζ_{j, b}^{⊤} - Θ)_{F}^{2} .

T_{q, b}^{*} = Θ^{L \times L} : rank (Θ) \leq q min P_{L} \circ (\frac{1}{n} j = 1 \sum n ζ_{j, b} ζ_{j, b}^{⊤} - Θ)_{F}^{2} .

\overset{ˇ}{D} = diag {L^{- 1} j = 1 \sum L D (j, j), \dots, L^{- 1} j = 1 \sum L D (j, j)} .

\overset{ˇ}{D} = diag {L^{- 1} j = 1 \sum L D (j, j), \dots, L^{- 1} j = 1 \sum L D (j, j)} .

W_{i}^{(q)} = X_{i}^{(q)} + δ_{i}, δ_{i} \mbox s am pl e d r an d o m l y w i t h r e pl a ce m e n t f r o m {ε_{1}, ..., ε_{n}} .

W_{i}^{(q)} = X_{i}^{(q)} + δ_{i}, δ_{i} \mbox s am pl e d r an d o m l y w i t h r e pl a ce m e n t f r o m {ε_{1}, ..., ε_{n}} .

K_{X, L}^{(q)} (i, j) = m = 1 \sum q λ_{m} φ_{m} (t_{i}) φ_{m} (t_{j}) .

K_{X, L}^{(q)} (i, j) = m = 1 \sum q λ_{m} φ_{m} (t_{i}) φ_{m} (t_{j}) .

m (W_{i}) = \overline{W} + K_{X, L}^{(q)} K_{W, L}^{- 1} (W_{i} - \overline{W})

m (W_{i}) = \overline{W} + K_{X, L}^{(q)} K_{W, L}^{- 1} (W_{i} - \overline{W})

V_{i} \sim N_{L} (0, D + K_{X, L}^{(q)} - K_{X, L}^{(q)} K_{W, L}^{- 1} K_{X, L}^{(q)}) .

V_{i} \sim N_{L} (0, D + K_{X, L}^{(q)} - K_{X, L}^{(q)} K_{W, L}^{- 1} K_{X, L}^{(q)}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Neural Networks and Applications · Statistical Methods and Bayesian Inference

Full text

Testing for the Rank of a Covariance Operator

Anirvan Chakrabortylabel=e1][email protected] [

Victor M. Panaretoslabel=e2][email protected] [ Ecole Polytechnique Fédérale de Lausanne

Indian Institute of Science Education and Research Kolkata & Ecole Polytechnique Fédérale de Lausanne

, e2

Abstract

How can we discern whether the covariance operator of a stochastic process is of reduced rank, and if so, what its precise rank is? And how can we do so at a given level of confidence? This question is central to a great deal of methods for functional data, which require low-dimensional representations whether by functional PCA or other methods. The difficulty is that the determination is to be made on the basis of i.i.d. replications of the process observed discretely and with measurement error contamination. This adds a ridge to the empirical covariance, obfuscating the underlying dimension. We build a matrix-completion inspired test statistic that circumvents this issue by measuring the best possible least square fit of the empirical covariance’s off-diagonal elements, optimised over covariances of given finite rank. For a fixed grid of sufficiently large size, we determine the statistic’s asymptotic null distribution as the number of replications grows. We then use it to construct a bootstrap implementation of a stepwise testing procedure controlling the family-wise error rate corresponding to the collection of hypotheses formalising the question at hand. Under minimal regularity assumptions we prove that the procedure is consistent and that its bootstrap implementation is valid. The procedure circumvents smoothing and associated smoothing parameters, is indifferent to measurement error heteroskedasticity, and does not assume a low-noise regime. An extensive simulation study reveals an excellent practical performance, stably across a wide range of settings, and the procedure is further illustrated by means of two data analyses.

62G05, 62M40,

15A99,

bootstrap,

functional data analysis,

functional PCA,

Karhunen-Loève expansion,

matrix completion,

measurement error,

scree-plot,

keywords:

[class=AMS]

keywords:

\startlocaldefs\endlocaldefs

and

1 Introduction
2 Methodology
2.1 Problem Statement and Background
2.2 Identifiability
2.3 The Testing Procedure
2.4 Asymptotic Theory
2.5 Bootstrap Calibration
2.6 Practical Implementation
3 Simulation study
3.1 Homoskedastic errors
3.2 Heteroskedastic errors
3.3 Spiked functional data
3.4 Infinite dimensional models
4 Data Analysis
5 Appendix
5.1 Proofs of Formal Statements
5.2 On the Critical Grid Size
5.3 On the Invertibility of the Hessian $\nabla^{2}\Psi$

1 Introduction

Principal component analysis (PCA) plays a fundamental role in statistics due to its ability to focus on a parsimonious data subspace that is most relevant for many practical purposes. In the case of functional data, it assumes an even more prominent role because a reduction in the data dimension maps the statistical problem back to a more familiar multivariate setting. Furthermore, regularization techniques that are necessary for functional regression, testing, prediction, and classification typically hinge on the identification of the most prominent sources of variation in the data.

One of the main drawbacks of principal component analysis for any kind of data is that the procedure of estimation/selection of the number of components to retain is often exploratory in nature. Indeed, one has to either inspect the scree plot or select the first few components that explain, say $85\%$ of the total variation (see, e.g., Jolliffe, (2002)). There are but a few confirmatory procedures to this end (see, e.g., Horn, (1965), Velicer, (1976) and Peres-Neto et al., (2005)). However, each of these procedures rely on its own assessment of what is an appropriate definition of the dimension of the data corresponding to how many components to retain. In this paper, we view the problem from the perspective of hypothesis testing. Indeed, the high level or global problem that is being considered is that of $\{H_{0}:\mbox{rank}(\Sigma)<d\}$ versus $\{H_{1}:\mbox{rank}(\Sigma)=d\}$ , where $\Sigma$ is the covariance matrix of the $d$ -dimensional distribution that generates the data. Thus, we want to test whether the data gives us enough evidence to conclude that its intrinsic variation is lower dimensional. If this null hypothesis is rejected based on the observed data, one can also consider a more detailed analysis and test the local hypotheses $\{H_{0,q}:\mbox{rank}(\Sigma)=q\}$ versus $\{H_{1,q}:\mbox{rank}(\Sigma)>q\}$ for each $q=1,2,\ldots,(d-1)$ .

When dealing with functional data, the object of interest is a covariance kernel $k_{X}$ (rather than a covariance matrix), which a priori is an infinite dimensional object. One would then wish to test $\{H_{0}:\mbox{rank}(k_{X})\leq d\}$ versus $\{H_{1}:\mbox{rank}(k_{X})>d\}$ at a global level, for some finite integer $d$ where the rank of $k_{X}$ is at most $d$ if and only if its Mercer expansion has no more than $d$ terms.

In practice, we can observe each of a sample of $n$ curves on a finite number, say $L$ , of grid nodes, so $d$ will certainly have to be at most $(L\wedge n)-1$ for the testing problem to not be vacuous. The local hypotheses will then be $\{H_{0,q}:\mbox{rank}(k_{X})=q\}$ versus $\{H_{1,q}:\mbox{rank}(k_{X})>q\}$ for $1\leq q\leq d$ . Based on the $n$ sample curves, and assuming that the are observed on the same grid, the simplest procedure would be to look at the rank of the $L\times L$ empirical covariance matrix.

At first sight, this problem seems simple: if the number of observations exceeds $d$ , then perfect inference will be feasible. However, functional data are most often additively corrupted by unobservable measurement errors that are usually modelled as independent random variables indexed by the grid points for each sample function. This additional noise adds a “ridge” to the true covariance. More specifically, the covariance matrix of the observed erroneous data is of full rank. Clearly, this gives rise to a problem of the true rank being confounded by the additive noise. One way of removing the effect of the errors is to use some smoothing procedure on the data (see e.g., Ramsay and Silverman, (2005)). But this smoothing step obfuscates the problem since the relationship between the rank of $k_{X}$ and the rank of the smoothed data is unknown, and further depends on the choice of tuning parameter(s) used for smoothing. At this stage, it would seem that the problem is “almost insolubly difficult” as pointed out by Hall and Vial, (2006), who further concluded that “conventional approaches based on formal hypothesis testing will not be effective”. As a workaround, Hall and Vial, (2006) considered a “low noise” setting (assuming the noise variance vanishes as the number of observations increases) and used an unconventional rank selection procedure based on the amount of unconfounded noise variance. A weakness of the procedure was that it required the analyst to provide acceptable values of the noise variance for the procedure to be implemented in practice, and these bounds are to be selected in an ad hoc manner.

An alternative approach altogether is to view the problem not as one of testing, but rather as one of model selection. For instance, as part of their PACE method, and assuming Gaussian data, Yao et al., (2005) offer a solution based on a pseudo-AIC criterion applied to a smoothed covariance whose diagonal has been removed. Later work by Li et al., (2013) provides estimates of the effective dimension based on a BIC criterion employing and the estimate of the error variance obtained using the PACE approach with the difference being that they used a adaptive penalty term in place of that used in the classical BIC technique. For densely observed functional data, Li et al., (2013) also studied a modification of the AIC technique in Yao et al., (2005) by assuming a Gaussian likelihood for the data. Li et al., (2013) finally considered versions of information theoretic criteria studied earlier by Bai and Ng, (2002) in the context of factor models in econometrics, where the latter method is used to choose the number of factors. For all of the procedures studied by Yao et al., (2005) and Li et al., (2013), the main drawback is the involvement of smoothing parameters which enter due to use of smoothing prior to dimension estimation. The asymptotic consistency of these procedures also depends on specific decay rates for the smoothing parameters, as well as on assumptions on the regularity of the true mean and covariance functions. In early work, not explicitly framed in the context of FDA, Kneip, (1994) used several smooth versions of the data constructed using a progression of smoothing parameters, to select a dimension based on a sum of residual estimated eigenvalues. Here too, the method’s performance and asymptotics depend on regularity assumptions and decay rates for smoothing parameters.

In this paper, we steer back to a formal hypothesis testing perspective for the dimension problem. We demonstrate that it is possible to construct a valid test that circumvents the smoothing step entirely, by means of matrix completion. The proposed test statistic measures the best possible least square fit of the empirical covariance’s off-diagonal elements by nonnegative matrices of a given finite rank, exploiting the fact that the corruption affects only the diagonal.

Compared to smoothing based alternatives, our approach presents the following advantages:

•

It provides a genuine testing procedure, inferring the rank with confidence guarantees.

•

It does not rely on pre-smoothing and consequently on the choice of smoothing parameters.

•

It rests on minimal regularity, in particular continuity of the covariance and sample paths.

•

It can handle heteroskedastic measurement errors, which are detrimental to smoothing.

•

It does not require a “low noise” regime, indeed the noise variances can be aribtrary.

•

It exhibits excellent finite sample performance, stably across a wide range of scenarios.

The paper is organized as follows. In subsection 2.1, we discuss the problem statement and setup in detail. We then develop a key identifiability result in subsection 2.2, which elucidates how the rank can be identified. Exploiting this result, Section 2.3 describes the testing procedure. The asymptotic distribution of the test statistic, and a valid bootstrap-based calibration approach are introduced in 2.4 and 2.5. Practical and computational aspects of its implementation are discussed in Section 2.6. An extensive simulation study is presented in section 3, where we benchmark the performance of our procedure relative to those studied by Yao et al., (2005) and Li et al., (2013). Two illustrative data analyses are presented in 4. Proofs of formal statements are collected in Section 5.1, and further technical details are given in sections 5.2 and 5.3.

2 Methodology

2.1 Problem Statement and Background

Let $X=\{X(t):t\in[0,1]\}$ be the stochastic process in question, assumed zero mean and with continuous covariance kernel on $[0,1]^{2}$ ,

[TABLE]

Continuity of $k_{X}$ implies that it admits the Mercer expansion,

[TABLE]

with the series on the right hand side converging uniformly and absolutely. Consequently, $X$ is mean square continuous and admits a Karhunen-Loève expansion

[TABLE]

where $\{Y_{m}\}$ is a sequence of uncorrelated zero-mean random variables with variances $\lambda_{m}$ , respectively. Convergence of the series is in the mean square sense, uniformly in $t$ . Given $n$ i.i.d. replications $\{X_{1},\ldots,X_{n}\}$ of $X$ , we observe the noise-corrupted discrete measurements

[TABLE]

for a grid of $L$ points

[TABLE]

We will assume that the grid nodes are regularly spaced, i.e. $(j-1)/L\leq t_{j}<j/L$ , for the sake of simplifying our statements, but this can be considerably relaxed. We assume that the $n\times L$ random variables $\epsilon_{ij}$ ’s are continuous random variables, independent of the $X_{i}$ ’s and themselves independent across both indices, with moments up to second order given by

[TABLE]

Note, in particular that the $\varepsilon_{ij}$ are allowed to be heteroskedastic in $j$ , i.e. the measurement precision may vary over the grid points. The measured vectors $\{(W_{i1},\ldots,W_{iL})^{\top}\}_{i=1}^{n}$ are now i.i.d. random vectors in $\mathbb{R}^{L}$ with $L\times L$ covariance matrix

[TABLE]

where:

–

$K_{X,L}:=\{k_{X}(t_{p},t_{q})\}_{p,q=1}^{L}$ is the $L\times L$ matrix obtained by pointwise evaluation of $k_{X}(\cdot,\cdot)$ on the pairs $(t_{i},t_{j})$ , and

–

$D=\mathrm{diag}\{\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{L}^{2}\}$ is the $L\times L$ covariance matrix of the $L$ -vector $(\varepsilon_{i1},\ldots,\varepsilon_{iL})^{\top}$ .

In this setup, we wish to use the observations $\{W_{ij}:i\leq n,j\leq L\}$ in order to infer whether the stochastic process $X$ is, in fact, low dimensional, and if so what its dimension might be. We use the term infer in its formal sense, i.e. we wish to be able to make statements in the form of hypothesis tests with a given level of significance. Concretely, the question posed pertains to whether the covariance $k_{X}$ is of reduced rank, in the sense of a finite Mercer expansion (2.1), and if so of what rank.

Formally, for some dimension $d<\infty$ , we wish to test the hypothesis pair

[TABLE]

Notice that we can never actually choose $d=\infty$ , since we have finite data, which is why we have to settle with a $d<L\wedge n$ . Typically $n\gg L$ so that $L\wedge n=L$ . This global hypothesis pair is related to the sequence of local hypotheses

[TABLE]

In particular, if we can sequentially test all $d$ local hypotheses with a controlled family-wise error rate, then we will have a test for the global hypothesis, and a means to infer what the rank is, when $H_{0}$ is valid (more details in the next section). In any case, $k_{X}$ can be replaced by $K_{X,L}$ in the null hypotheses $\{H_{0,q}\}_{q=1}^{d}$ , provided $L$ is sufficiently large relative to $d$ :

Proposition 1.

Let $k_{X}:[0,1]^{2}\rightarrow\mathbb{R}$ be a continuous covariance kernel and $K_{X,L}=\{k_{X}(t_{i},t_{j})\}_{i,j=1}^{L}$ . If $\mathrm{rank}(k_{X})\geq d$ there exists $L_{*}<\infty$ such that $\mathrm{rank}(K_{X,L})\geq d$ whenever $L\geq L_{*}$ .

As noted in the introduction, while this question is of clear intrinsic theoretical interest, it also arises very prominently when carrying out a functional PCA as a first step for further analysis, in particular when evaluating a scree plot to choose a truncation level: the choice of a truncation dimension $q$ can be translated into testing whether the rank of $k_{X}$ is equal to $q$ .

The frustrating tradeoff faced by the statistician in the context of this problem is that:

Without any smoothing, the noise covariance $D$ confounds the the problem by the addition of a ridge to the empirical covariance, leading to an inflation of the underlying dimensionality. Specifically, the rank of $K_{W,L}=K_{X,L}+D$ is at most $n\wedge L$ , with probability 1. 2. 2.

Attempts to denoise $K_{W,L}$ and approximate $K_{X,L}$ by means of smoothing will obfuscate the the problem, since the choice of smoothing/tuning parameters will interfere with the problem of rank selection.

It is this tradeoff that Hall and Vial, (2006) presumably had in mind when referring to this problem of rank inference as “almost insolubly difficult”. Despite the apparent difficulty, we wish to challenge their statement that “conventional approaches based on formal hypothesis testing will not be effective”, demonstrating that this can be achieved via matrix completion. The crucial obervation is that the corrupted diagonal can be entirely disregarded, while still being able to identify the rank, owing to the continuity of the problem. How precisely is described in the next section.

2.2 Identifiability

The main idea we wish to put forward here is that it is feasible make inferences about the rank of $K_{X,L}$ without resorting to smoothing or low noise assumptions, simply by focussing on the off-diagonal elements of the matrix $K_{W,L}$ for any sufficiently large but finite grid size $L$ . The point is that we have no information whatsoever on the diagonal matrix $D$ , and cannot attempt to annhilate it by means of smoothing without biasing inference on the rank. Still, we have

[TABLE]

i.e. the matrices are equal off the diagonal, even if their relationship on the diagonal is completely unknown. So the rank of $K_{X,L}$ may still be identifiable from its off-diagonal entries. The first of our main results shows that this is indeed the case, owing to the continuity of $k_{X}$ .

Theorem 1 (Identifiability).

Assume that the kernel $k_{X}$ is continuous on $[0,1]^{2}$ , let $d\geq 1$ , and let $q\in\{1,\ldots,d\}$ . Then, there exists a critical $L_{\dagger}=L_{\dagger}(d)<\infty$ such that, for all $L>L_{\dagger}$ , the functional

$\Theta\mapsto\sum_{i\neq j}\Big{(}K_{W,L}(i,j)-\Theta(i,j)\Big{)}^{2}$

restricted on the set $\mathcal{M}_{q}=\{\Theta\in\mathbb{R}^{L\times L}:\mathrm{rank}(\Theta)\leq q\}$ of matrices of rank at most $q$ ,

Vanishes uniquely at $K_{X,L}$ when $\mathrm{rank}(k_{X})=q$ . 2. 2.

Is bounded below by a positive constant when $\mathrm{rank}(k_{X})>q$ .

Remark 1 (Notation).

The sum-of-squares term $\sum_{i\neq j}\left(K_{W,L}(i,j)-\Theta(i,j)\right)^{2}$ is simply the squared Frobenius distance between $\Theta$ and $K_{W,L}$ when disregarding their diagonal entries. We can re-write it more compactly as $\|P_{L}\circ(K_{W,L}-\Theta)\|_{F}^{2}$ , where $P_{L}=\{\mathbf{1}\{i\neq j\}\}_{i,j=1}^{L}$ , $\|A\|_{F}=\sqrt{\mathrm{trace}(A^{\top}A)}$ is the Frobenius matrix norm, and ‘ $\circ$ ’ denotes the Hadamard (element-wise) product.

Remark 2 (Critical Grid Size).

The precise critical value $L_{\dagger}<\infty$ in Theorem 1 will generally depend on the on the boundary value $d$ in the global hypothesis pair (2.4), and the spectrum of $k_{X}$ . For most scenarios encountered in functional data analysis, the value

[TABLE]

suffices. This includes polynomial or trigonometric eigensystems and warped versions thereof, systems comprised of splines or other piecewise (non-vanishing) analytic basis elements, and more generally systems with eigenfunctions that are linearly independent over sets of positive Lebesgue measure. Note that it is not the regularity of the eigenfunctions that is elemental here – for instance, the last class described can include eigenfunctions that are nowhere differentiable. See Section 5.2 for a detailed discussion.

The theorem affirms that sequentially checking whether the rank of $k_{X}$ is equal to $q$ or exceeds $q$ , for $q\in\{1,...,d\}$ , is feasible by means of the off-diagonal entries of $K_{X,L}$ alone, and indeed for any finite grid $L>L_{\dagger}$ . That is, the collection of local hypothesis pairs $\{H_{0,q},H_{1,q}\}_{q=1}^{d}$ is identifiable non-asymptotically in the grid size, even when observation is discrete and noisy. Consequently, we will henceforth be working in a framework where $L$ is assumed fixed but sufficiently large relative to $d$ (i.e. $L>L_{\dagger}$ , where $L_{\dagger}$ is as in Theorem 1).

Indeed, the identifiability is constructive, in that if we had access to the true matrix $K_{W,L}$ , starting with $q=1$ and proceeding sequentially, we could discern all $d$ hypothesis pairs as follows:

For any candidate rank $q\leq d$ , we check whether

[TABLE] 2. 2.

If the minimum is positive we are certain that rank $(K_{X,L})>q$ .

2.3 The Testing Procedure

This constructive identifiability can be leveraged to construct a testing procedure. Of course, in practice the matrix $K_{W,L}$ is unobservable and we must rely on $\widehat{K}_{W,L}$ , the empirical covariance of the observed vector $(W_{i1},\dots,W_{iL})^{\top}$ ,

[TABLE]

This motivates testing the local hypothesis pair $\{H_{0,q},H_{1,q}\}$ by means of the test statistic

[TABLE]

rejecting $H_{0,q}$ in favour of $H_{1,q}$ for large values of $T_{q}$ . Note the interpretation of the test statistic: to test whether the rank is $q$ , we measure the best possible fit of the off-diagonal elements of the empirical covariance $\widehat{K}_{W,L}$ by a matrix of rank $q$ . We reject when this fit is poor, and the calibration of $T_{q}$ is considered in the next two sections, via an asymptotic analysis based on $M$ -estimation, and hinging on Theorem 1.

For the moment, though, assume that we can obtain a $p$ -value $p_{q}$ for $T_{q}$ (or some appropriately re-scaled version, e.g. $n\times T_{q}$ ) under the hypothesis $H_{0,q}$ . In order to be able to test the global pair $\{H_{0},H_{1}\}$ (2.4), and infer the rank when the global null $\{H_{0}:\mathrm{rank}(k_{X})\leq d\}$ is valid, we consider a stepwise procedure, for a given significance level $\alpha$ :

Step 1:

Test $H_{0,1}:\mathrm{rank}(K_{X,L})=1$ vs $H_{1}:\mathrm{rank}(K_{X,L})>1$ by means of $T^{(1)}$ .

Stop if the corresponding $p$ -value, $p_{1}$ exceeds $\alpha$ ; otherwise continue to Step 2.

Step 2:

Test $H_{0,2}:\mathrm{rank}(K_{X,L})=2$ vs $H_{1}:\mathrm{rank}(K_{X,L})>2$ by means of $T^{(2)}$ .

Stop if the corresponding $p$ -value, $p_{2}$ , exceeds $\alpha$ ; otherwise continue similarly.

$\qquad\vdots$

We reject the global null $\{H_{0}:\mathrm{rank}(k_{X})\leq d\}$ in (2.4) if and only if the sequential procedure terminates with the rejection of all local hypotheses up to and including the $d$ -th one. If the procedure terminates earlier, the global null is not rejected, and we subsequently declare the rank of the functional data to be the value

[TABLE]

i.e. the smallest $q$ for which we fail to reject $H_{0,q}$ . This stepwise procedure strongly controls the Family Wise Error Rate (FWER) at level $\alpha$ (see Maurer et al., (1995) and Lynch et al., (2017)). Indeed, observe that at most one of the hypotheses $\{H_{0,q}\}_{q=1}^{d}$ can be true, and suppose it corresponds to $q=q_{0}$ . Then, if $V$ denotes the number of false discoveries among the number of rejections, one has

[TABLE]

So, FWER = $P(V>0)=P({p}_{r_{0}}>\alpha)\leq\alpha$ , where the probabilities are calculated under the given configuration of true and false null hypotheses, equivalently, under the assumption that $\{H_{0,q_{0}}:\mathrm{rank}(K_{X,L})=q_{0}\}$ is true (which automatically ensures that the other hypotheses are false). Since $q_{0}$ is arbitrary, the FWER is controlled at level $\alpha$ .

Finally, if $\mathrm{rank}(k_{X})<d$ , we have

[TABLE]

Thus, the control over the FWER translates into a control over the probability of over-estimating the true rank.

To implement the procedure, we will require the $p$ -values $\{p_{q}\}$ corresponding to (an appropriately re-scaled version of) the test statistic $T_{q}$ under $H_{0,q}$ . To this aim, the next two sections determine the large- $n$ sampling distribution of $n\times T_{q}$ under $H_{0,q}$ and describe a valid bootstrap procedure for approximating $p$ -values $\{p^{*}_{q}\}$ under $H_{0,q}$ in practice. En route, they also establish the consistency of the resulting test (and bootstrap procedure) as $n\rightarrow\infty$ under $H_{1,q}$ , for all $L$ sufficiently large.

2.4 Asymptotic Theory

To justify the use of the test statistic $T_{q}$ for testing $\{H_{0,q}\,\mathrm{vs}\,H_{1,q}\}$ (for some given $q\leq d$ ), we will derive its asymptotic distribution under the null $H_{0,q}$ and the alternative $H_{1,q}$ as $n\rightarrow\infty$ for any $q\leq d$ and $L>L_{\dagger}$ , after appropriate re-scaling (by $n$ , in particular). To this aim, we introduce the functional,

[TABLE]

Furthermore, we collect the following assumptions:

Assumption (C):

The covariance kernel $k_{X}(\cdot,\cdot)$ is continuous on $[0,1]^{2}$ , the grid nodes $\{t_{1},...,t_{L}\}$ are regularly spaced, and $\mathrm{var}[\varepsilon_{ij}]=\sigma^{2}_{j}\in[0,\infty)$ .

Assumption (H):

Under $H_{0,q}$ , there exists a factor $C_{0}\in\mathbb{R}^{L\times q}$ of $K_{X,L}$ , i.e. $K_{X,L}=C_{0}C_{0}^{\top}$ , such that the Hessian $\nabla^{2}\Psi(C_{0})$ is non-singular.

Remark 3 (On The Hessian Condition).

A sufficient condition for (H) to hold true is

Assumption (E):

The $q$ leading eigenvectors of $K_{X,L}$ have non-zero entries.

In particular, if (E) is valid, then $C_{0}$ can be taken to be equal to $V\Lambda^{1/2}$ where $K_{X,L}=V\Lambda V^{\top}$ is the eigendecomposition of $K_{X,L}$ , and the Hessian $\nabla^{2}\Psi(V\Lambda^{1/2})$ is provably non-singular. Condition (E), and hence Assumption (H), is automatically satisfied in all the settings listed in Remark 2. See Section 5.3 for more details.

We can now state our second main result:

Theorem 2 (Asymptotic Distribution of the Test Statistic).

Suppose that Assumptions (C) and (H) hold and let $q\leq d\leq L_{\dagger}<\infty$ be as in Theorem 1. Denote the weak (centered Gaussian) limit of $\sqrt{n}(\widehat{K}_{W,L}-K_{W,L})$ by the random matrix $Z$ . Then, for any $L>L_{\dagger}$ ,

•

When $H_{0,q}$ is valid, we have as $n\rightarrow\infty$

[TABLE]

•

When $H_{1,q}$ is valid, $nT_{q}$ diverges to infinity as $n\rightarrow\infty$ .

The theorem justifies the use of $nT_{q}$ as a test statistic: though $T_{q}$ will not be precisely zero even when the true rank is $q$ , the test statistic will converge to zero under $H_{0,q}$ , with an asymptotic variance of the order of $n^{-2}$ . The diffuse limiting law of $nT_{q}$ under $H_{0,q}$ in principle allows for calibration (though it does depend on unknown quantities, see the next Section). That $nT_{q}$ diverges under $H_{1,q}$ establishes the consistency of a test based on $nT_{q}$ .

2.5 Bootstrap Calibration

Since the limiting null distribution of $nT_{q}$ established in Theorem 2 depends on unknown quantities, we consider a bootstrap strategy in order to generate approximate $p$ -values of $nT_{q}$ for testing the pair $\{H_{0,q},H_{1,q}\}$ . If $H_{0,q}$ is truly valid, then a naïve bootstrap would suffice. But if $H_{1,q}$ is actually valid instead, a naïve bootstrap will fail to correctly approximate the sought $p$ -values under $H_{0,q}$ . In effect, we need a re-centering (or rather, re-ranking) scheme in order to generate bootstrap replications “conforming” to $H_{0,q}$ , even when $H_{1,q}$ holds true in reality. The purpose of this section is to present such a scheme and establish its validity.

The proposed bootstrap scheme is:

(1)

Find a minimizer $\widehat{\Theta}_{q}$ of $\|P_{L}\circ(\widehat{K}_{W,L}-\Theta)\|_{F}$ over nonnegative definite matrices $\Theta$ satisfying $\mathrm{rank}(\Theta)\leq q$ .

(2)

For each $1\leq i\leq n$ , define

$\widehat{m}({\bf W}_{i})=\overline{{\bf W}}+\widehat{\Theta}_{q}\widehat{K}_{W,L}^{-1}({\bf W}_{i}-\overline{{\bf W}}).$

where $\overline{{\bf W}}=n^{-1}\sum_{i=1}^{n}{\bf W}_{i}$ . Under the null hypothesis $\{H_{0,q}:\mathrm{rank}(K_{X,L})=q\}$ , this is an estimator of the best linear predictor of the discretely sampled curve ${\bf X}_{i}$ given the noise-corrupted version $\bf{W}_{i}$ , i.e. $m({\bf W}_{i})=\overline{{\bf W}}+K^{(q)}_{X,L}{K}_{W,L}^{-1}({\bf W}_{i}-\overline{{\bf W}}).$

(3)

Estimate $D$ by $\widehat{D}$ , defined as the diagonal matrix with $j$ th diagonal element defined as

$\widehat{D}(j,j)=\max\{\widehat{K}_{W,L}(j,j)-\widehat{\Theta}_{M}(j,j),0\},$

where $\widehat{\Theta}_{M}$ is a minimiser of $\|P_{L}\circ(\widehat{K}_{W,L}-\Theta)\|_{F}$ over nonnegative definite matrices $\Theta$ satisfying $\mathrm{rank}(\Theta)\leq M$ , and

$M=m_{n}\mathbf{1}\{m_{n}<d\}+d\mathbf{1}\{m_{n}\geq d\}$

with

${m_{n}=\min\left\{m\geq q:T_{m}\leq\epsilon\frac{{\log n}}{n}\right\}},$

and $0<\epsilon\leq 1$ an arbitrary constant.

(4a)

Draw $n$ bootstrap observations $U^{*}_{1},U^{*}_{2},\ldots,U^{*}_{n}$ from $\{\widehat{m}({\bf W}_{i}):1\leq i\leq n\}$ .

(4b)

Draw $n$ i.i.d. observations $V^{*}_{1},V^{*}_{2},\ldots,V^{*}_{n}$ from an $L$ -dimensional centered Gaussian distribution with covariance matrix $\widehat{D}+\widehat{A}$ , where $\widehat{A}:=\widehat{\Theta}_{q}-\widehat{\Theta}_{q}\widehat{K}_{W,L}^{-1}\widehat{\Theta}_{q}$ .

(5)

Define the $L$ -vectors $\bm{\zeta}_{j}=U^{*}_{j}+V^{*}_{j}$ for $j=1,2,\ldots,n$ .

(6)

Let $F^{*}_{q}$ be the law of

$T_{q}^{*}=\min_{\Theta^{L\times L}:\mathrm{rank}(\Theta)\leq q}\left\|P_{L}\circ\left(\frac{1}{n}\sum_{j=1}^{n}\bm{\zeta}_{j}\bm{\zeta}_{j}^{\top}-\Theta\right)\right\|^{2}_{F}.$

(7)

To test the pair $\{H_{0,q},H_{1,q}\}$ use the bootstrap $p$ -value $p^{*}_{q}=F^{*}_{q}(T_{q}).$

Of course, in practice we use $B<\infty$ random samples $\{\bm{\zeta}_{1,b},...,\bm{\zeta}_{n,b}\}_{b=1}^{B}$ to approximate the $p$ -value $p^{*}$ in Step (7) by

[TABLE]

where

[TABLE]

If one is willing to assume that the measurement errors are heteroskedastic, one can replace $\widehat{D}$ in Step 3 by its diagonally averaged version,

[TABLE]

The next remark explains the heuristic behind the bootstrap procedure, and the theorem succeeding it establishes the bootstrap procedure’s validity. The procedure’s finite sample performance is investigated thoroughly in the next Section.

Remark 4 (Bootstrap Heuristic).

Assume that the errors $\varepsilon_{ij}$ in (2.3) are Gaussian. Let $X^{(q)}_{i}(u)=\sum_{m=1}^{q}\langle X_{i},\varphi_{m}\rangle_{L^{2}}\varphi_{m}(u)$ be the $q$ -truncated Karhunen-Loève expansion of the curve $X_{i}$ and $\mathbf{X}_{i}^{(q)}=\{X_{i}^{(q)}(t_{j})\}_{j=1}^{L}$ its discrete version when evaluated at the $\{t_{j}\}_{j=1}^{L}$ . If we had access to the $L$ -vectors $\{\mathbf{X}_{i}^{(q)}\}_{i=1}^{n}$ and $\{\bm{\varepsilon}_{i}\}_{i=1}^{n}$ , then we would generate a bootstrap sample conforming to $H_{0,q}$ by means of constructing $n$ random $L$ -vectors

[TABLE]

These bootstrapped vectors would have covariance matrix $K^{(q)}_{X,L}+D$ , where

[TABLE]

If instead of observing $\{\mathbf{X}_{i}^{(q)}\}_{i=1}^{n}$ and $\{\bm{\varepsilon}_{i}\}_{i=1}^{n}$ , we only had access to their covariance $K^{(q)}_{X,L}$ and $D$ , then we would do the “next best thing”, i.e. replace $\mathbf{X}^{(q)}_{i}$ by its best linear predictor given the actual observations,

[TABLE]

and replace $\bm{\delta}_{i}$ by

[TABLE]

The reason this is the “next best thing” is that the resulting $m(\mathbf{W}_{i})+V_{i}$ has zero mean and covariance matrix

[TABLE]

In other words, $\zeta_{i}=m(\mathbf{W}_{i})+V_{i}$ is a

“rank $q$ proxy version of $\bm{X}_{i}$ + Gaussian measurement error”.

whose first and second moments match those of the ideal (but unobservable) bootstrap samples $\mathbf{W}^{(q)}_{i}=\mathbf{X}^{(q)}_{i}+\bm{\delta}_{i}$ (and thus when $X$ and $\varepsilon$ are Gaussian, their laws match, too).

The idea of the bootstrap procedure is to materialise this heuristic, replacing the unknown matrices $\{K_{W,L}^{-1},K^{(q)}_{X,L},D\}$ by their “hat counterparts” $\{\widehat{K}^{-1}_{W,L},\widehat{\Theta}_{q},\widehat{D}\}$ . In particular, as part of the next theorem, the informal statement that the bootstrap scheme generates samples conforming to $H_{0,q}$ even when $H_{1,q}$ is true will be made rigorous, by means of establishing validity of the bootstrap.

Theorem 3 (Bootstrap Validity).

Let $q\leq d\leq L_{\dagger}<\infty$ be as in Theorem 1 and assume that (C) and (H) hold true. Let $p^{*}_{q}=F^{*}_{q}(T_{q})$ be the bootstrapped $p$ -value as defined in Step (7) of the bootstrap procedure above. Then, for all $L>L_{\dagger}$ ,

•

When $H_{0,q}$ holds true, one has

[TABLE]

provided the underlying processes $\{X_{i}\}$ and errors $\{\varepsilon_{ij}\}$ are Gaussian.

•

When $H_{1,q}$ holds true, one has

[TABLE]

Remark 5.

Regardless of whether or not the $\{\mathbf{W}_{i}\}$ are Gaussian, as part of the proof of the theorem we establish that under $H_{0,q}$ the (random) bootstrap law $F_{q}^{*}$ converges pointwise almost surely to the distribution function of the random variable

[TABLE]

where $C_{0}$ is as in Assumption (H), and the random vector $Z_{\dagger}$ is the (centred Gaussian) weak limit of $\sqrt{n}\{\frac{1}{n}\sum_{j=1}^{n}\zeta_{j}\zeta_{j}^{\top}-(\widehat{\Theta}_{q}+\widehat{D})\}$ under $H_{0,q}$ . When the $\{\mathbf{W}_{i}\}$ are Gaussian, the covariance of $Z_{\dagger}$ coincides with that of the centred Gaussian $Z$ encountered in Theorem 2, and so the the bootstrap distribution asymptotically coincides with the limiting law of $nT_{q}$ under $H_{0,q}$ as given by Theorem 2.

When the $\{\mathbf{W}_{i}\}$ are not Gaussian, it is not guaranteed the centred Gaussians $Z_{\dagger}$ and $Z$ will share the same covariance. Hence the large $n$ limit of $p^{*}_{q}=F^{*}_{q}(T_{q})$ (given by $G(T_{q})$ ) may not behave as a uniform random variable under $H_{0,q}$ , leading to a significance level different than the nominal one. We investigate the potential effect of non-Gaussianity on calibration of the bootstrap in our simulation study (Section 3), and find that this effect is negligible (in fact undetectable). We expect that Gaussianity can be weakened to higher-order moment conditions, at the expense of an even lengthier proof.

2.6 Practical Implementation

We now discuss practical aspects related to the implementation of our procedure.

2.6.1 Hypothesis Boundary, Grid Size, Bootstrap Parameters

Recall that the global hypothesis pair (2.4) to be tested is given by $\{H_{0}:\mathrm{rank}(k_{X})\leq d\}$ versus $\{H_{1}:\mathrm{rank}(k_{X})>d\}$ for some prescribed $d<\infty$ . Notice, furthermore, that the bottom-up nature of our iterative testing procedure (Section 2.3) translates to the FWER remaining invariant to the choice of boundary value $d$ in the global hypothesis pair $\eqref{global_hypotheses}$ . This means that as far as FWER control is concerned, we may choose $d$ as we wish. Indeed, we are even free to “data snoop” when choosing $d$ to set up the global hypothesis pair, i.e. formulate our hypothesis boundary by looking at the data.

The only constraint on the choice of $d$ is the need to ensure that the grid size $L$ is sufficiently large relative to $d$ for our identifiability result (Theorem 1) to hold true. As per Remark 2, it suffices to have grid size $L\geq 2d+1$ for virtually any type of covariance operator encountered in FDA practice, so it is prudent to always respect the constraint $d\leq\lfloor(L-1)/2\rfloor$ . Of course, one can always choose $d$ to be smaller if an inspection of the data suggests so: for instance we can set $d$ to be a value near an elbow of the off-diagonal scree plot111use of the off-diagonal rather than the classical scree plot is recommended, since the former is immune to the presence of measurement errors when $n$ is large

[TABLE]

provided this choice not exceed $\lfloor(L-1)/2\rfloor$ .

The value $M$ in Step (3) of the bootstrap procedure can similarly be chosen by inspection of the off-diagonal scree plot, as its formal definition suggests: it should represent an elbow of the graph, but can be taken no larger than our choice of $d$ .

These observations motivate the following practical recommendations:

(I)

The boundary $d$ should be no larger than $\lfloor(L-1)/2\rfloor$ .

(II)

In particular, $d$ can be chosen empirically, for instance as a value near an elbow of the off-diagonal scree-plot $j\mapsto T^{(j)}-T^{(j-1)}$ .

(III)

If the empirical choice is equivocal or exceeds $\lfloor(L-1)/2\rfloor$ , we simply recommend fixing $d=\lfloor(L-1)/2\rfloor$ .

(IV)

Either way, we recommend setting $M$ in Step (3) of the bootstrap procedure as the minimum of $d$ or a value slightly above an elbow of the off-diagonal scree-plot $j\mapsto T^{(j)}-T^{(j-1)}$ .

In our simulations, we set $d=\lfloor(L-1)/2\rfloor$ throughout for reasons of automation. As for $M$ , we inspected the off-diagonal scree plots from a sample simulation run in each scenario, and fixed the value of $M$ as a value distinctly above an apparent elbow in that run’s plot, to accommodate potential variation in other realisations of the plot (unless this exceeded $d$ , in which case we took $M=d$ ). This yielded excellent results irrespectively of the simulation setting.

2.6.2 Computation

Recall that evaluation of the test statistic $T^{(j)}$ requires the solution of the optimisation problem

[TABLE]

This being a non-convex optimization problem, we cannot ensure that standard techniques like gradient descent will converge to a global minimum (note that there are infinitely many minima when using the parametrisaion $CC^{\top}$ due to the fact that if $C_{1}$ is a minimum, so is $C_{1}V$ for any $j\times j$ orthogonal matrix $V$ .

However, recent work by Chen and Wainwright, (2015) shows that projected gradient descent methods with a suitable starting point have a high probability of returning a “good” local optimum in factorised matrix completion problems. For our simulation study, we used the in-built solver optim in the R software with starting point $C_{1}=U_{j}\Sigma_{j}^{1/2}$ , where $U\Sigma U^{\top}$ is the spectral decomposition of $\widehat{K}_{W,L}$ , $U_{j}$ is the matrix obtained by retaining the first $j$ columns of $U$ , and $\Sigma_{j}$ is the matrix obtained by keeping the first $j$ rows and columns of $\Sigma$ . Although we do not exactly use the approach by Chen and Wainwright, (2015), it is seen in the simulations that our chosen method of optimisation converges reasonably quickly and yields stable results.

Although our procedure bootstraping a statistic whose value is the solution of a non-convex problem, its implementation was feasible in quite reasonable computational time in all the simulations that we carried out. A single implementation of our bootstrap test procedure, when run on a 64-bit Intel(R) Core(TM) i7-8550U CPU @ 1.80 GHz machine with 16 GB RAM, typically took about 10 seconds when the sample size was $n=150$ and the grid size was $L=50$ .

3 Simulation study

We will now investigate the finite sample performance of our procedure. Recall that in our notation

[TABLE]

where $\{\varphi_{j},\lambda_{j}\}$ are the eigenfunction/eigenvalue pairs of $k_{X}$ and the principal component scores $Y_{j}=\int_{0}^{1}X(u)\varphi_{j}(u)du$ satisfy $E(Y_{j})=0$ and $Var(Y_{j})=\lambda_{j}$ for all $1\leq j\leq r_{\mathrm{true}}$ . We observe $W_{ij}=X_{i}(t_{j})+\epsilon_{ij}$ for $1\leq i\leq n$ and $1\leq j\leq L$ , where $0<t_{1}<t_{2}<\ldots<t_{L}<1$ are equispaced grid points. For the purposes of the simulation, the errors $\{\varepsilon_{ij}\}$ are taken to be independent and normally distributed, potentially heteroskedastic in the grid index, $\varepsilon_{ij}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}N(0,\sigma_{j}^{2})$ for each $1\leq j\leq L$ . We will initially define our simulation scenarios with homoskedastic errors, and in a later section switch to heteroskedastic regimes.

3.1 Homoskedastic errors

In the case of homoskedastic measurement errors, we consider the following models (and we comment on their features as we define them):

Model A1

$r_{\mathrm{true}}=3$ , $\mu(t)=5(t-0.6)^{2}$ , $(\lambda_{1},\lambda_{2},\lambda_{3})=(0.6,0.3,0.1)$ , $Y_{j}\sim N(0,\lambda_{j})$ , $\varphi_{1}(t)=1$ , $\varphi_{2}(t)=\sqrt{2}\sin(2{\pi}t)$ , $\varphi_{3}(t)=\sqrt{2}\cos(2{\pi}t)$ , and $\sigma_{j}^{2}=1$ for all $j$ .

Model A2

Same as Model A1 except that we now set $\varphi_{3}(t)=\sqrt{2}\cos(4{\pi}t)$ , and $Y_{j}$ now has a mixture distribution that is $N(2\sqrt{\lambda_{j}/3},\lambda_{j}/3)$ with probability $1/3$ and $N(-\sqrt{\lambda_{j}/3},\lambda_{j}/3)$ with probability $2/3$ . Thus, the $X$ -paths are somewhat “curvier” and the principal component scores follow skewed Gaussian mixture models. The latter is chosen to investigate the behaviour of the bootstrap procedure for non-Gaussian processes (see Remark 5).

Model A3

$r_{\mathrm{true}}=3$ , $\mu(t)=12.5(t-0.5)^{2}-1.25$ , $(\lambda_{1},\lambda_{2},\lambda_{3})=(4,2,1)$ , $Y_{j}\sim N(0,\lambda_{j})$ , $\varphi_{1}(t)=1$ , $\varphi_{2}(t)=\sqrt{2}\cos(2{\pi}t)$ , $\varphi_{3}(t)=\sqrt{2}\sin(4{\pi}t)$ , and $\sigma_{j}^{2}=2$ for all $j$ .

Model A4

Same Model A3 but with principal component scores having a skewed Gaussian mixture law as in Model A2.

Model A5

$r_{\mathrm{true}}=6$ , $\mu(t)=0$ , $(\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5},\lambda_{6})=(4,3.5,3,2.5,2,1.5)$ , $Y_{j}\sim N(0,\lambda_{j})$ , $\varphi_{1}(t)=1$ , $\varphi_{2k}(t)=\sqrt{2}\sin(2k{\pi}t)$ for $k=1,2,3$ , $\varphi_{2k+1}(t)=\sqrt{2}\cos(2k{\pi}t)$ for $k=1,2$ , and $\sigma_{j}^{2}=3$ for all $j$ .

Models (A1)-(A3) are similar to those considered in Li et al., (2013). To go beyond globally defined eigenfunctions, the next set of models feature piecewise polynomial eigenfunctions.

Model S1

$r_{\mathrm{true}}=6$ , $\mu(t)=5(t-0.6)^{2}$ , $(\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5},\lambda_{6})=(2,1.7,1.4,1.1,0.8,0.5)$ , $Y_{j}\sim N(0,\lambda_{j})$ , the eigenfunctions $\varphi_{t}$ are orthonormalised functions obtained from the basis of cubic splines with knots at $(0.3,0.5,0.7)$ , and $\sigma_{j}^{2}=3$ for all $j$ .

Model S2

The model parameters are the same as in Model S1, with the only difference that being that the principal component scores are now distributed according to the skewed Gaussian mixture form in Model A2.

Model S3

$r_{\mathrm{true}}=4$ , $\mu(t)=5(t-0.6)^{2}$ , $(\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4})=(1.4,1.1,0.8,0.5)$ , $Y_{j}\sim N(0,\lambda_{j})$ , the $\varphi_{t}$ ’s are orthonormalized functions obtained from the basis of quadratic splines with knots at $(0.2,0.6)$ , and $\sigma_{j}^{2}=2$ for all $j$ .

Model S4

The model parameters are the same as in Model S3 except that now the principal component scores now have a skewed Gaussian mixture distribution as in Model A2.

Model S5

$r_{\mathrm{true}}=3$ , $\mu(t)=5(t-0.6)^{2}$ , $(\lambda_{1},\lambda_{2},\lambda_{3})=(1.1,0.8,0.5)$ , the $\varphi_{t}$ ’s are orthonormalized functions obtained from the basis of linear splines with knots at $(0.2,0.6)$ , and $\sigma_{j}^{2}=1$ for all $j$ . The principal component scores now have the same skewed Gaussian mixture form as in Model A2.

For each of these models, we have considered two combinations of sample size $n$ and grid size $L$ , namely $(n,L)$ equal to $(150,25)$ and $(150,50)$ , to emulate more sparsely/densely observed settings. The parameter $M$ described in the bootstrap algorithm in the previous section is set to $M=10$ for all the simulations in this and the next sub-section. As discussed in the previous section, we can choose $M$ by visual inspection of the off-diagonal scree plot $j\mapsto T^{(j)}-T^{(j-1)}$ . When using this approach in a trial runs from each scenario, the plot was suggestive of $M=9$ for models A5, S1 and S2, and $M=6$ for the other models. The fixed value of $M=10$ was thus chosen for use across the simulation scenarios.

To probe the performance of the bootstrap procedure, we set the number of bootstrap samples to $B=500$ and set the significance level to $\alpha=0.05$ . For each model, we carried out $100$ independent replications to report the empirical distribution of the estimated rank. We benchmark the performance of our procedure with two well-known techniques for selecting the rank of a functional data, namely: the AIC based criterion ( $AIC_{yao}$ ) in Sec. 2.5 of Yao et al., (2005); the modified AIC ( $AIC_{m}$ ) and modified BIC ( $BIC_{m}$ ) criteria proposed in equations (16) and (6), respectively, in Li et al., (2013); and the modified information theoretic criteria $PC_{p1}$ and $IC_{p1}$ given in equation (20) in Li et al., (2013). The information theoretic criteria are inspired by similar techniques in Bai and Ng, (2002) who used them to estimate the number of factors in an approximate factor model. We underline that these procedures are used purely for the purpose of benchmarking, since these are procedures whose purpose is model selection and thus are geared toward inducing parsimony, though some come with theoretical guarantees of consistently selecting the true rank asymptotically, if the rank is truly finite. The results are tabulated in Tables 1–4.

It is observed from Tables 1–4 that the proposed method selects the true rank in at least $90\%$ of the iterations for all of the chosen models, irrespective of whether the true rank is large/small, the observation grid is sparse/dense, the distribution is Gaussian or not, the signal is smooth/rough, and the noise is large/small compared to the signal. In fact, the when $(n,L)=(150,50)$ , the bootstrap procedure chooses the true rank in all the $100$ iterations under all of the above simulation models. Moreover, the evidence (as seen from the magnitude of the $p$ -values) is quite strong. In cases where the detection of the true rank is not perfect, we found that on making the test procedure more conservative (by choosing a smaller $\alpha$ , e.g., $\alpha=0.01$ or $0.001$ ), the rate of correct identification of the rank surged to $100\%$ .

It is observed from the results shown in Tables 1–4 that $AIC_{yao}$ estimates the true rank accurately if the rank is large (equal to $6$ as in Models (A5), (S1) and (S2)). When the rank is small, the performance of $AIC_{yao}$ varies depending on the model. Investigating a bit more, it may be observed that it overestimates the rank under Models A1 and A2 (rank = 3), where the error dominates the leading eigenvalue of the signal. On the other hand, for Models (S3) and (S4) (rank = 4), $AIC_{yao}$ accurately selects the true rank. When the rank is small (equal to $3$ or $4$ ) but the grid is dense ( $L=50$ ), it is seen that $AIC_{yao}$ grossly over-estimates the true rank in almost all models. This over-estimation is exacerbated when the eigenfunctions are trigonometric, which is surprising since one would expect this to be an easier setting than in (S3), (S4) and (S5). The over-estimation of the rank by $AIC_{yao}$ was also observed by Li et al., (2013).

The $AIC_{m}$ , $PC_{p1}$ and $IC_{p1}$ criteria do not perform well in general and mostly under-estimate the rank irrespective of the sample size and the sparse/dense regime. The $BIC_{m}$ procedure, on the other hand, yields the same perfect estimation results as our procedure when the grid is dense ( $L=50$ ). It does so also when the grid is sparse provided that the true rank is small (equal to $3$ or $4$ ). However, for Models (S1) and (S2) with $L=25$ , where the rank is large (equal to $6$ ) and the grid is sparse ( $L=25$ ), the $BIC_{m}$ criterion mostly selects a smaller rank. This is different from its performance under Model (A5) with $L=25$ (which is also of rank $6$ ), where it selects the true rank in $67\%$ of iterations. The difference in this behaviour of $BIC_{m}$ may be attributed to the fact that for Model (A5), the eigenfunctions are smooth, while they are only twice continuously differentiable for Models (S1) and (S2) due to the presence of knots.

Summarising the observations from Tables 1–4, it may be concluded that the $BIC_{m}$ and the $AIC_{yao}$ criteria are most appropriate among the competing information-based procedures. Some tentative conclusions on the two methods are as follows. While the latter works well when the rank is large (irrespective of the sparsity/denseness of the grid), the former is suited when the grid is dense (irrespective of the magnitude of the rank). The $BIC_{m}$ procedure also works very well when the grid is sparse, provided that the rank is small. However, both procedures appear to be quite sensitive to departures from the above situations – $AIC_{yao}$ grossly over-estimates, while $BIC_{m}$ mildly under-estimates. Note that the difference in performance is observed between $L=25$ and $L=50$ . This change in number of observations is not so stark so as to be classified immediately as sparse versus dense, and the fact that the performance of these two procedures vary in such a moderate change of grid size is concerning. We also mention in passing that the performance of the $AIC_{yao}$ and the $BIC_{m}$ procedures crucially depend on the choice of the smoothing parameters. Indeed, Li et al., (2013) considered models similar to Models (A1)-(A5) but worked with an undersmoothing choice of the bandwidth parameter, and the relative performance of the above two procedures differs from that observed in our simulation results.

On the other hand, Table 1–4 shows that our proposed procedure always selects the true rank in at least $90\%$ of the iterations (the percentage being much higher in most cases), irrespective of the magnitude of the rank and the sparsity/denseness of the grid. Thus, the proposed method seems to provide an effective and stable alternative. Beyond this advantage, our method also comes with a probabilistic guarantee on overestimation, and hence provides an automatic quantification of uncertainty about the true rank, while not relying on smoothing.

3.2 Heteroskedastic errors

Our theory suggests that our testing procedure automatically adapts to a heteroskedastic variance structure for the measurement errors. We therefore the same model scenarios as before, but this time with heteroskedastic errors in order to gauge how this translates into practical performance. All else being the same, the measurement error variances are now given by

[TABLE]

where $U=L/5$ , $k=1,2,\ldots,U$ and $p=1,2,\ldots,5$ . This specific error structure may be viewed from the perspective of a local averaging of the signal along with a downscaling by a factor of $3/2$ . For these simulation models, the results obtained are provided in Tables 5 to 8. It is observed that the performance of the proposed procedure remains invariant to the presence of homoskedasticity, as our theory predicts.

3.3 Spiked functional data

One may also consider a spiked covariance model, in analogy to high-dimensional statistics (see, e.g., Johnstone, (2001), Paul, (2007)) where some of the eigenvalues are considerably larger than the rest (Amini and Wainwright,, 2012). One instance of the latter setting is the Tecator data set considered in Section 4. This is a particularly challenging setting: heuristically, a prominent bend is expected to appear in the scree plot, well before the index value of the true rank (see Figure 2 which shows the scree plots for the spiked models considered immediately below). To probe the performance of our method in this setting, we also consider the following spiked scenarios:

Model SF1

Model (A1) is modified to now have $(\lambda_{1},\lambda_{2},\lambda_{3})=(4,0.2,0.1)$ and $\sigma_{j}^{2}=1$ for all $j$ . Here the first eigenvalue explains about $93\%$ of the total variation in the signal. Note that the error variance is five and ten times the size of the penultimate and last eigenvalue, respectively.

Model SF2

Model (A5) is modified to have $(\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5},\lambda_{6})=(5,4,0.2,0.2,0.1,0.1)$ and $\sigma_{j}^{2}=1$ . Here the top two eigenvalues explain $93.75\%$ of the total variation in the signal, and there are three more trailing eigenvalues are of order between 1/5 and 1/10 the size of the noise variance.

Model SF3

Model (A5) is modified to have $(\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5},\lambda_{6})=(4,3.5,3,0.3,0.2,0.1)$ and $\sigma_{j}^{2}=3$ . Here the top three eigenvalues explain about $95\%$ of the total variation in the signal, and the last three are between 1/10 and 1/300 the size of the of the noise variance. This is a very challenging setup which features several trailing eigenvalues the last of which has size negligible relative to the noise variance.

Table 9 gives the empirical distribution of the selected rank for each of the above three models when $(n,L)=(150,25)$ and $(150,50)$ – sparse and dense grids, contrasting our proposed methods with the benchmark methods. The vale of $M=10$ was used again in these settings. Intriguingly, it is observed that the proposed method yields near perfect estimation of the rank under all of the above models. The only exception is the most challenging model (SF3), and this only when the grid is sparse, where our method returns the true rank or the true rank minus in approximately a 50-50 split. Note that this is the scenario where the smallest eigenvalue is 1/300 the size of the error variance and the grid is sparse. The results suggests a certain degree of robustness of the proposed procedure against extreme forms of the spectrum.

By comparison, the benchmark procedures markedly underperform relative to our procedure. The $AIC_{m}$ , $PC_{p1}$ and $IC_{p1}$ procedures yield poor results as in the previous two subsections. The performance of $AIC_{yao}$ performs similarly as in earlier simulations, namely, it does well when the rank is large and the grid is sparse and the error variance is small. Its performance degrades significantly if the grid is dense or the rank is small, which results in either overestimation. When the error variance is not small, $AIC_{yao}$ underestimates the true rank. The most striking difference in performance is observed for the $BIC_{m}$ procedure, which now heavily underestimates the true rank in the spiked regime. The situation does not improve much even if we take dense grids (here $L=50$ ). While this can be explained by the fact that $BIC_{m}$ is a model selection procedure targeting parsimonious models, it does also show that the consistency of $BIC_{m}$ may be slow to manifest in unbalanced spectra.

3.4 Infinite dimensional models

We now proble the finite sample performance of the procedure when the data are truly infinite dimensional, even prior to noise contamination; and we compare this with the output of model selection-based alternative procedures in such situations. To this aim, we consider infinite dimensional models $X(t)=\sum_{j=1}^{\infty}Y_{j}\varphi_{j}(t),t\in[0,1]$ with $E(Y_{j})=0$ , $Var(Y_{j})=\lambda_{j}>0$ for all $j=1,2,\ldots$ . The measurement error again satisfies $\epsilon_{ij}\stackrel{{\scriptstyle\mathrm{i.i.d}}}{{\sim}}N(0,\sigma_{j}^{2})$ for each $i$ . We consider four settings:

Model I1

$X$ is a standard Brownian motion, which features polynomial decay of eigenvalues and non-differentiable sample paths. Also, $\sigma_{j}^{2}=1$ for $1\leq j\leq L$ .

Model I2

$X$ is a Gausian process with $k_{X}(t,s)=\exp\{-(t-s)^{2}/10\}$ – which features exponential decay of eigenvalues and infinitely smooth paths. Also, $\sigma_{j}^{2}=1$ for $1\leq j\leq L$ .

Model I3

$X$ is as in Model (I1). However, $\sigma_{j}^{2}=t_{j}$ for $1\leq j\leq L$ , where $t_{1}<t_{2}\ldots<t_{L}<1$ is the observation grid.

Model I4

$X$ is as in Model (I2). However, $\sigma_{j}^{2}=t_{j}$ for $1\leq j\leq L$ , where $0<t_{1}<t_{2}\ldots<t_{L}$ is the observation grid.

Inspection of the off-diagonal scree plot in a trial run from each scenario suggested no evident elbow below $\lfloor(L-1)/2\rfloor$ , and so as per the recommendations of Section 2.6.1, we chose $M=\lfloor(L-1)/2\rfloor$ in each case. Tables 10 and 11 give the estimated ranks in $100$ iterations under Models (I1)-(I4) for $(n,L)=(150,25)$ and $(150,50)$ .

It is observed that the model selection procedures like $AIC_{yao}$ , $AIC_{m}$ and $BIC_{m}$ target some level of parsimonious representation of the data, the degree of parsimony depending on the method used. Unsurprisingly, they fail to inform us on whether the model is truly infinite dimensional or not (similar to having low power error in the testing paradigm). In the majority of cases, regardless of scenario, the chosen rank is between 1 and 3, in fact. By contrast, the proposed method exhibits very good performance in terms of power, typically rejecting low-dimensional representations across all scenarios. In the case of dense grids ( $L=50$ ), the procedure never chose a rank below 15. In the case of a sparser grid ( $L=25$ ), the results varied somewhat between homoskedastic and heteroskedastic noise settings. In the two heteroskedastic scenarios, the procedure chose a rank of at least ten in 75% and 85% of runs. In the two homoskedastic scenarios, these percentages were modestly lower at about 56% and 62%. When we incorporated the assumption of homoskedasticity in the procedure (as per the comment in Section 2.5, at the top of p. 2.5), the performance surged in the two homoskedastic scenarios, with a rank of at least ten being chosen in 95% and 96% of runs. This suggests that, when operating with sparse grids, it can be beneficial in terms of power to make use of homoskedasticity if this can indeed be assumed.

4 Data Analysis

We will apply the bootstrap technique for estimating the rank to some benchmark data sets. The first of these is the well-known Tecator dataset which contains spectrometric curves for $n=215$ samples of finely chopped meat (see Ferraty and Vieu, (2006)). Each curve corresponds to the absorbances measured over $L=100$ wavelengths. A standard functional PCA followed by a scree plot of the eigenvalues reveal an essentially finite dimensional structure since the eigenvalues decay to zero very fast. A scree-plot approach would suggest the underlying rank to be three/four. In fact, the top four eigenvalues are $0.2613$ , $0.0024$ , $0.0008$ and $0.0003$ . The percentage of total variation explained by these principal components are $98.679\%$ , $0.901\%$ , $0.296\%$ and $0.114\%$ , respectively. So the first four eigenvalues explain $99.99\%$ of the total variation. Since these data are recorded to high precision, and the curves are very smooth, it may be safely assumed that the measurements are essentially error-free. We will artificially add i.i.d. noise to the data and then apply our method and the alternative procedures considered in the previous section to evaluate their performance. Also, we will vary the error variance to investigate the effect of the magnitude of the signal-to-noise ratio on the rank selection algorithms.

The errors are taken to be i.i.d. centered Gaussian with variances $1,0.5,0.1,0.05,0.01,0.005,0.001,0.0005$ and $0.0001$ . These values range from “noise dominating signal completely” to “noise smaller than fourth largest eigenvalue”. For our procedure and each value of the noise variance, we choose $M=10$ as suggested by the off-diagonal scree plot.

Table 12 shows the estimated ranks obtained from the different procedures under the chosen levels of the error variance. It is seen that unless the error variance is very small (comparable to the fourth largest eigenvalue), $AIC_{yao}$ generally chooses unrealistically high values of the rank. In the other situations, the rank is chosen to be one. On the other hand, all of $AIC_{m}$ , $PC_{p1}$ and $IC_{p1}$ select the rank to be one unless the error variance completely overwhelms the signal. The $BIC_{m}$ procedure always selects the rank as one. These observations can be explained by noting that the Tecator data is an example of a spiked functional dataset and the behaviour of these model selection procedures for such data was found to exhibit such behaviour in Section 3.3. The procedure proposed in the paper estimates the rank to be three or four in all cases where the error variance is interlaced and comparable with the second/third/fourth eigenvalues. Only when the error variance is very small (1/3 of the the fourth largest eigenvalue), is the rank overestimated (as being six), which is arguably modest a deviation.

Thus, when the error variance is moderate (neither too small nor overwhelming the signal), only the proposed method seems to provide a proper estimate of the rank of the Tecator data.

The next data set that we consider concerns the number of eggs laid by each of $1000$ female Mediterranean fruit flies (medflies), Ceratitis capitata, in a fertility study described in Carey et al., (1998). The data222Accessible at http://anson.ucdavis.edu/$\sim$mueller/data/medfly1000.txt contain the total number of eggs laid by each medfly as well as the daily breakup of the number of eggs laid. It is discussed in Carey et al., (1998) that there is a change in the pattern of egg production at day $51$ post birth for those medflies which lived past that age. Also, the variation in the number of eggs laid from day $51$ onwards is in general much larger than that before day $51$ . Taking these observations into account, it seems more pertinent to look at the egg-laying data till the age $50$ days for those medflies that live past that age. This results in a sample of $n=145$ medflies. Since the number of eggs laid in days $1$ to $3$ for these medflies equal zero, we only keep the number of eggs laid from day $4$ onwards for our analysis.

Among the competing procedures, $AIC_{yao}$ estimates the rank of the data to be equal to $9$ while $BIC_{m}$ selects the rank to be $7$ . All of $AIC_{m}$ , $PC_{p1}$ and $IC_{p1}$ select the rank to be one, which appears way off based on a visual inspection of the data. The bootstrap procedure proposed in this paper is carried out by selecting $M=10$ . In fact, the off-diagonal scree plot as well as the results obtained from the competing methods indicate that the rank is likely smaller than $10$ . Our procedure selects the rank to be $7$ at significance level $\alpha=1\%$ . Further, our bootstrap test rejects the hypotheses $H_{0,q}$ for $q=1,2,\ldots,6$ with $p$ -values that are numerically zero.

Our procedure thus yields the same result as the $BIC_{m}$ approach, in this case, in addition to providing a confidence level. We compared the $AIC_{m}$ , the $BIC_{m}$ and the $AIC_{yao}$ approaches by computing the average relative squared error

[TABLE]

where $\widehat{X_{i}}(\cdot)=\widehat{\mu}(\cdot)+\sum_{j=1}^{\widehat{r}}\widehat{\xi}_{ij}\widehat{\phi}(\cdot)$ is the prediction of $X_{i}(s)$ using the PACE estimates of $\mu$ , $\phi$ and $\xi_{ij}$ ’s (see Yao et al., (2005)). For computing the $ARSE$ for each approach, we use the estimated value $\widehat{r}$ of the rank obtained from the corresponding approach. It is found that the $ARSE$ for the $AIC_{yao}$ approach (with $\widehat{r}=9$ ) equals $0.200$ and the $ARSE$ for the $BIC_{m}$ approach (with $\widehat{r}=7$ ) is $0.204$ . Note that since our approach yields the same estimate of the rank as the $BIC_{m}$ approach, the $ARSE$ for our approach is also equal to $0.204$ . Thus, there is no significant improvement in the $ARSE$ by considering $9$ principal components (obtained using $AIC_{yao}$ ) instead of $7$ (obtained using our approach or $BIC_{m}$ ). The $ARSE$ of the $AIC_{m}$ approach (as well as that of the $PC_{p1}$ and the $IC_{p1}$ approaches) equals $4.258$ . It would seem that these three approaches perform poorly in determining the true rank of the process in this example.

5 Appendix

5.1 Proofs of Formal Statements

We will first state and prove some auxiliary results that will simplify the proofs of our main results.

Lemma 1.

If the covariance kernel $k_{X}$ is continuous and $\mathrm{rank}(k_{X})\geq d$ , then we can find $u_{1}<\ldots<u_{d}$ such that the matrix $\{k_{X}(u_{i},u_{j})\}_{i,j=1}^{d}$ is of full rank $d$ .

Proof.

Using Mercer’s theorem we may write

[TABLE]

for any collection of $d$ points $\{x_{j}\}_{j=1}^{d}$ . Note that both $A$ and $B$ are non-negative definite matrices, so it suffices to prove that we can find $(u_{1},...,u_{d})$ such that the matrix $\{A(u_{i},u_{j})\}_{i,j=1}^{d}$ is of full rank $d$ . We may write

[TABLE]

where

[TABLE]

and of course $\mathrm{det}(A)=\mathrm{det}^{2}(U)$ . We claim that there exists $d$ -tuple such that $\mathrm{det}(U)\neq 0$ . For suppose that $\mathrm{det}(U)=0$ for all $(x_{1},...,x_{d})$ . Using the Leibniz formula for the determinant this translates to

[TABLE]

where $\mathrm{Sym}(d)$ is permutation group on $d$ elements and $\mathrm{sgn}(\pi)$ is the signature of a permutation $\pi$ . Keeping $(x_{1},\ldots,x_{d-1})$ fixed, multiply both sides of the equation by $\lambda^{1/2}_{d}\varphi_{d}(x_{d})$ and integrate with respect to $x_{d}$ to get:

[TABLE]

Repeating the same process, multiplying both sides of the equation by $\lambda^{1/2}_{d-j}\varphi_{j}(x_{d-j})$ for $j\in\{1,...,d-1\}$ while keeping the remaining variables fixed, and then integrating with respect to $x_{d-j}$ eventually yields

[TABLE]

This last equality contradicts the fact that $\mathrm{rank}(k_{X})\leq d$ . Thus $\mathrm{det}(U)\neq 0$ for at least one $d$ -tuple, say $(v_{1},..,v_{d})$ . The elements of this $d$ -tuple will necessarily be distinct, because $U$ would otherwise have two coincident lines, contradicting $\mathrm{det}(U)\neq 0$ , and hence we may re-order them to get the sought $u_{1}<\ldots<u_{d}$ . ∎

Corollary 1.

If the covariance kernel $k_{X}$ is continuous and $\mathrm{rank}(k_{X})\geq d$ , then we can find $u_{1}<\ldots<u_{d}$ and $\delta>0$ such:

The balls $B_{\delta}(u_{j})=[u_{j}-\delta,u_{j}+\delta]$ are pairwise disjoint; 2. 2.

The matrix $\{k_{X}(v_{i},v_{j})\}_{i,j=1}^{d}$ is of full rank $d$ for all $(v_{1},...,v_{d})$ such that $v_{i}\in B_{\delta}(u_{i})$ .

Proof.

By Lemma 1 we know that there exist $u_{1}<\ldots<u_{d}$ such that $\{k_{X}(u_{i},u_{j})\}_{i,j=1}^{d}$ is of full rank. Define the function $\Delta:[0,1]^{r}\rightarrow\mathbb{R}$ as

[TABLE]

Since $k_{X}$ is uniformly continuous on $[0,1]^{2}$ it follows that so is $\Delta$ on $[0,1]^{r}$ . Now

[TABLE]

It follows that there exists a $\delta>0$ depending on the modulus of continuity of $k_{X}$ , such that $|\Delta(v_{1},...,v_{r})|>0$ whenever $|v_{j}-u_{j}|<\delta$ , i.e. whenever $v_{j}\in B_{\delta}(u_{j})$ , the ball of radius $\delta$ centred at $u_{j}$ . Since the $\{u_{j}\}$ are pairwise distinct, we can take $\delta$ sufficiently small so that the balls $B_{\delta}(u_{j})$ are also pairwise disjoint. ∎

Proof of Proposition 1.

If $\mathrm{rank}(k_{X})\geq d$ , we can choose nodes $u_{1}<\ldots<u_{d}$ and corresponding balls $\{B_{\delta}(u_{j})\}_{j=1}^{d}$ as in the statement of Corollary 1. Since $\{t_{1},\ldots,t_{L}\}$ are regularly spaced nodes for any $L$ , there is a finite $L_{*}$ such that for $L>L_{*}$ each of the $r$ balls $B_{\delta}(u_{j})$ contains at least $1$ grid point. Since the balls are disjoint, one can thus choose a subcollection $\{t_{j_{1}},...,t_{j_{d}}\}$ of $d$ distinct grid points such that the matrix $\{k_{X}(t_{j_{p}},t_{j_{q}})\}_{p,q=1}^{d}$ has full rank $d$ , as ensured by Corollary 1. It follows that $K_{X,L}$ has a non-vanishing minor of order $d$ , and hence $\mathrm{rank}(K_{X,L})\geq d$ . ∎

The next lemma will be used in the proof of Theorem 1. Informally, it states that a diagonal entry $a_{q,q}$ of an $L\times L$ matrix $A$ of rank $d<L$ can be imputed from the off-diagonal entries of $A$ provided there is a non-vanishing $d$ -minor of $A$ that does not depend on $a_{q,q}$ .

Lemma 2.

Let $A=\{a_{i,j}\}_{i,j=1}^{L}$ be an $L\times L$ matrix of rank $d<L$ and let $C=\{a_{i_{p},j_{p}}\}_{p=1}^{d}$ be a $d\times d$ submatrix of $A$ that is also of rank $d$ . If $i_{p}\neq q$ and $j_{p}\neq q$ for all $p\in\{1,\ldots,d\}$ , it follows the diagonal element $a_{q,q}$ is uniquely determined as a continuous function of $C$ and the entries $\{a_{q,j_{p}}\}_{p=1}^{d}\cup\{a_{i_{p},q}\}_{p=1}^{d}$ .

Proof.

Since the statement is invariant to conjugations of $PAP^{\top}$ by permutation matrices $P$ , we may assume without loss of generality that $q>i_{d}$ and $q>j_{d}$ . Thus we may extract a $(d+1)\times(d+1)$ submatrix $D$ of $A$ , of the form

[TABLE]

where $u=(a_{q,j_{1}},\ldots,a_{q,j_{d}})$ , $v^{\top}=(a_{i_{1},q},...,a_{i_{d},q})$ . Now $\mathrm{det}(D)=0$ because $\mathrm{rank}(A)=d$ , so we may write

[TABLE]

showing that $a_{q,q}$ is uniquely determined as a rational function of the entries of $C$ , $u$ , and $v$ . ∎

Proof of Theorem 1.

Without loss of generality, we will prove the theorem for the largest possible value of $q$ , i.e. for $q=d$ . When $\mathrm{rank}(k_{X})\geq d$ , as is considered in the conclusions (1) and (2) of the Theorem’s statement, we can choose $u_{1}<\ldots<u_{d}$ and $\{B_{\delta}(u_{j})\}_{j=1}^{d}$ as in the statement of Corollary 1. Since $\{t_{1},\ldots,t_{L}\}$ are regularly spaced for any $L$ , there is a finite $L_{\dagger}=L_{\dagger}(d)$ such that for all $L>L_{\dagger}$ each of the $d$ balls $B_{\delta}(u_{j})$ contain at least $3$ grid points. It follows that for any $L\geq L_{\dagger}$ the matrix $K_{X,L}$ contains a $3d\times 3d$ submatrix $S_{X,L}$ , which can be organised into an $d\times d$ matrix of $3\times 3$ blocks, with the property that: any $d\times d$ submatrix of $S_{X,L}$ extracted by retaining one row from each of the $d$ consecutive triples of rows, and one column from each of the $d$ consecutive triples of columns, has rank $d$ (Figure 3 provides a visualisation).

Moreover, it is possible to extract such rank- $d$ submatrices of $S_{X,L}$ that contain no diagonal cells of $K_{X,L}$ among their entries (simply by making sure that we do not choose the same order of row and column from corresponding consecutive triples of rows/columns, e.g. always picking the first row from each consecutive row triple and the second column from each consecutive column triple).

Finally, given any specific diagonal element $K_{X,L}(p,p)$ of $K_{X,L}$ , we can extract a $d\times d$ submatrix of $S_{X,L}$ of rank- $d$ that: (a) contains no diagonal cells of $K_{X,L}$ , and (b) has row/column indices distinct from the index $p$ of the diagonal entry $K_{X,L}(p,p)$ . To see this, notice that any submatrix constructed as in the preceding paragraph satisfies both (a) and (b) whenever the index $p$ is not among the row/column indices forming $S_{X,L}$ . Otherwise, the diagonal element $K_{X,L}(p,p)$ in question is contained on the diagonal of one of the $d$ blocks of size $3\times 3$ along the diagonal of $S_{X,L}$ , say the bottom right block without loss of generality. In this case choose the first row from every row triple and the second column for column triple, except for the last row/column triples. From the last row/column triples: choose the third row and second column of this block, if $K_{X,L}(p,p)$ is the top-left element in that block; choose the first row and third column of this block, if $K_{X,L}(p,p)$ is the central element in that block; choose the first row and third column of this block, if $K_{X,L}(p,p)$ is the bottom-right element in that block. See Figure 4 for an illustration.

Collecting all these facts, we can now see that for $L>L_{\dagger}$ ,

When $\mathrm{rank}(k_{X})=d$ , the matrix $K_{X,L}$ uniquely solves the equation $\|P_{L}\circ(K_{W,L}-\Theta)\|=0$ among $L\times L$ matrices $\Theta$ of rank $d$ ,. To see this, let $K_{X,L}(q,q)$ be an element on the diagonal of $K_{X,L}$ , for some $q\leq L$ . Then, based on the discussion in the previous paragraph, we can find a $d\times d$ submatrix $C_{q}$ of $S_{X,L}$ of rank $d$ that contains no diagonal elements of $K_{X,L}$ and no elements from the $q$ th column or row of $K_{X,L}$ . As a result, Lemma 2 implies that we can determine $K_{X,L}(q,q)$ uniquely as a continuous function of the elements of $C_{q}$ . Repeating this process for all $q\in\{1,...,L\}$ effectively shows that any rank $d$ matrix that coincides with $K_{X,L}$ off the diagonal must also coincide with $K_{X,L}$ on the diagonal (equivalently that there exists a continuous function $\Xi$ , such that $K_{X,L}=\Xi(P_{L}\circ K_{W,L})$ ). 2. 2.

When $\mathrm{rank}(k_{X})>d$ , there is no $L\times L$ matrix $\Theta$ of rank less than $d$ such that $\|P_{L}\circ(K_{W,L}-\Theta)\|_{F}=0$ . This is because there exists a $d\times d$ submatrix $A$ of $S_{X,L}$ of full rank $d$ that contains no diagonal elements of $K_{X,L}$ . So $\|P_{L}\circ(K_{W,L}-\Theta)\|_{F}=0$ would imply that $\mathrm{rank}(\Theta)\geq d$ . Indeed, when $\mathrm{rank}(k_{X})>d$ , we have the stronger statement that $\inf_{\mathrm{rank}(\Theta)\leq d-1}\|P_{L}\circ(K_{W,L}-\Theta)\|_{F}>0$ . This is because

[TABLE]

where $A$ is the rank $d$ submatrix of $K_{X,L}$ as in point 1 above and $\Theta_{A}$ is the corresponding submatrix of $\Theta$ . Since $\Theta$ is of rank at most $d-1$ , so is $\Theta_{A}$ . Hence

[TABLE]

where $\gamma_{d}(A)\neq 0$ is the $d$ -th singular value of $A$ . The set of rank $d-1$ matrices being closed now shows that $\inf_{\mathrm{rank}(\Theta)\leq d-1}\|P_{L}\circ(K_{W,L}-\Theta)\|>0$ .

Taken together, statements 1 and 2 above yield the theorem as stated, and thus complete the proof.

∎

In order to prove Theorm 2, we introduce some additional short hand notation, in the form of the following spaces and functionals:

[TABLE]

Proof of Theorem 2.

Assumption C guarantees the validity of Theorem 1. We will throughout assume that $L$ is finite and fixed, and satisfies $L\geq L_{\dagger}$ , where $L_{\dagger}<\infty$ is the critical grid size whose existence is guaranteed by Theorem 1. We will divide the proof into several parts. As defined earlier, $\psi(C)=\|P_{L}\circ(K_{X,L}-CC^{\top})\|_{F}=\|P_{L}\circ(K_{W,L}-CC^{\top})\|_{F}$ . Also, $\widehat{\psi}(C)=\|P_{L}\circ(\widehat{K}_{W,L}-CC^{\top})\|_{F}$ . Then, the test statistic can be written as

[TABLE]

The proof will be broken down into the following sequence of steps:

First we will determine the gradients and Hessians of the functionals $\Psi=\psi^{2}$ and $\widehat{\Psi}=\widehat{\psi}^{2}$ . 2. 2.

Then we will show the strong consistency any empirical minimiser $\widehat{K}_{X,L}$ of the functional $\widehat{\pi}$ for the minimiser $K_{X,L}$ of $\pi$ . 3. 3.

We will translate this into consistency of an appropriately chosen factor $\widehat{C}$ of $K_{X,L}$ (i.e. $K_{X,L}=\widehat{C}\widehat{C}^{\top}$ ) to the factor $C_{0}$ of $K_{X,L}$ defined in Assumption (H). 4. 4.

Finally, we will use the penultimate step combined with a Taylor expansion of $\Psi$ and $\widehat{\Psi}$ in order to determine the sought weak convergence.

Step 1: We begin by determining the gradient and Hessian of $\Psi:=\psi^{2}$ , denoted by $\nabla\Psi$ and $\nabla^{2}\Psi$ , respectively. Since $\psi$ is a real valued function of a matrix, $\nabla\Psi(C)$ is a matrix and $\nabla^{2}\Psi(C)$ is a tensor (Kronecker product). Note that for any $S\in\mathbb{R}^{L\times q}$ , we have

[TABLE]

The last equality is obtained by using the fact that for a symmetric matrix $A$ , we have $\langle A,CS^{\top}\rangle_{F}=\mathrm{tr}(SC^{\top}A)=\mathrm{tr}(ACS^{\top})=\langle AC,S\rangle_{F}$ and $\langle A,SC^{\top}\rangle=\mathrm{tr}(CS^{\top}A)=\mathrm{tr}(ACS^{\top})=\langle AC,S\rangle_{F}$ . Thus,

[TABLE]

where $\widehat{\Psi}:=\widehat{\psi}^{2}$ , and the form of $\nabla\widehat{\Psi}$ follows from the same calculations as above.

To determine the Hessian, we note that for any $R\in\mathbb{R}^{L\times q}$ , we have

[TABLE]

Now observe that for any $L\times L$ matrix $A$ , we have $P_{L}\circ A=A-\sum_{j=1}^{L}\mathcal{P}_{j}A\mathcal{P}_{j}$ , where $\mathcal{P}_{j}$ is the matrix whose $(j,j)$ th entry is one and all other entries are zero. So,

[TABLE]

Next, recall that for compatible matrices $Q_{1},Q_{2},Q_{3}$ and $Q_{4}$ , we have $\langle Q_{2}Q_{3}Q_{4}^{\top},Q_{1}\rangle_{F}=\mathrm{tr}(Q_{1}^{\top}Q_{2}Q_{3}Q_{4}^{\top})=(\mathrm{vec}(Q_{1}))^{\top}(Q_{4}\otimes Q_{2})\mathrm{vec}(Q_{3})$ , where $\mathrm{vec}$ denotes the standard vectorization operator. Hence,

[TABLE]

where $M$ is the commutation matrix of order $(L,q)$ , i.e. the permutation matrix satisfying $\mathrm{vec}(R^{\top})=M\mathrm{vec}(R)$ for $R\in\mathbb{R}^{L\times q}$ . Further, $\mathrm{vec}(\mathcal{P}_{j}R)=\mathrm{vec}(\mathcal{P}_{j}RI_{q})=(I_{q}\otimes\mathcal{P}_{j})\mathrm{vec}(R)$ , which implies that $(\mathrm{vec}(\mathcal{P}_{j}S))^{\top}=(\mathrm{vec}(S))^{\top}(I_{q}\otimes\mathcal{P}_{j})^{\top}=(\mathrm{vec}(S))^{\top}(I_{q}^{\top}\otimes\mathcal{P}_{j}^{\top})=(\mathrm{vec}(S))^{\top}(I_{q}\otimes\mathcal{P}_{j})$ . So,

[TABLE]

Observe that $\lim_{t\rightarrow 0}t^{-1}\{\langle\nabla\Psi(C+tR)-\nabla\Psi(C),S\rangle_{F}\}$ equals $\langle\nabla^{2}\Psi(C)\mathrm{vec}(R),\mathrm{vec}(S)\rangle$ . Thus, using equations (5.4), (5.5) and (5.6), we have

[TABLE]

Now note that $\sum_{j=1}^{L}(I_{q}\otimes\mathcal{P}_{j})=I_{q}\otimes(\sum_{j=1}^{L}\mathcal{P}_{j})=I_{q}\otimes I_{L}=I_{qL}$ . Also, $I_{q}\otimes\mathcal{P}_{j}$ is the projection matrix onto the rows $\{j,j+L,j+2L,\ldots,j+(q-1)L\}$ for each $j=1,2,\ldots,L$ . Thus, for a matrix $B$ of order $qL$ , we have $P_{qL}\circ B=B-\sum_{j=1}^{L}(I_{q}\otimes\mathcal{P}_{j})B(I_{q}\otimes\mathcal{P}_{j})$ . Hence,

[TABLE]

Let us also note that for a matrix $A$ of order $L$ ,

[TABLE]

and $P_{qL}\circ(I_{q}\otimes A)$ sets the diagonal entries of this matrix equal to zero, equivalently, the diagonal entries of each $A$ on the diagonal equal to zero. Thus, $P_{qL}\circ(I_{q}\otimes A)=I_{q}\otimes(P_{L}\circ A)$ . Next, for a matrix $E=\{e_{i,j}\}_{i,j=1}^{q}$ , we have

[TABLE]

These two observations yield the form of the Hessian as

[TABLE]

Step 2: Strong Consistency of Empirical Minimizers. We will now show that any minimizer $\widehat{\Theta}$ of the functional $\Theta\mapsto\widehat{\pi}^{2}(\Theta):=\|P_{L}\circ(\widehat{K}_{W,L}-\Theta)\|_{F}^{2}$ over the space of $L\times L$ matrices $\Theta$ with $\mathrm{rank}(\Theta)\leq q$ is consistent for $K_{X,L}$ as $n\rightarrow\infty$ when $H_{0,q}$ is valid.

To show this, let $\widehat{K}_{X,L}$ be the (unobservable) random $L\times L$ matrix

[TABLE]

Under $H_{0,q}$ , $\mathrm{rank}(\widehat{K}_{X,L})\leq q$ . Thus, if $\widehat{\Theta}$ is a local minimiser of $\widehat{\pi}$ , it must be that

[TABLE]

The right hand side, however, converges to zero almost surely by the strong law of large numbers and the continuous mapping theorem (note that since $k_{X}$ is continuous, the covariance operator of the process $X$ is trace-class, and so is any discretization thereof). Since $\widehat{K}_{W,L}$ converges to $K_{W,L}$ almost surely as $n\rightarrow\infty$ , we therefore have

[TABLE]

as $n\rightarrow\infty$ , which implies that

[TABLE]

This being the case, the event

[TABLE]

satisfies $P(\lim\inf A_{n})=1$ . This is because matrices of rank bounded by an integer form a closed set, and thus no sequence $B_{n}$ can converge almost surely to $B$ unless eventually $\mathrm{rank}(B_{n})\geq\mathrm{rank}(B_{n})$ , almost surely.

Consequently, with probability 1, eventually in $n$ , a minor of $P_{L}\circ\widehat{\Theta}_{n}$ is non-vanishing if and only if the corresponding minor of $P_{L}\circ K_{X,L}$ is non-vanishing. But $\widehat{\Theta}$ is always at most rank $q$ , by definition. Thus, under $H_{0,q}$ , we follow the exact same steps as in the proof of Theorem 1, by forming the submatrix $S_{X,L}$ of $K_{X,L}$ and corresponding submatrix $\widehat{S}_{X,L}$ of $\widehat{\Theta}$ , in order to obtain that

•

$K_{X,L}=\Xi(P_{L}\circ K_{X,L})$ for a continuous map $\Xi$ (i.e. the diagonal of $K_{X,L}$ , and hence the entire matrix $K_{X,L}$ can be determined as a continuous function of the off-diagonal entries).

•

with probability 1, eventually in $n$ , $\widehat{\Theta}=\Xi(P_{L}\circ\widehat{\Theta})$ (i.e. the diagonal of $\Theta$ , and hence the entire matrix $\Theta$ , can be determined as a continuous function of the off-diagonal entries; indeed the function is the same function we use for $K_{X,L}$ ).

Thus, with probability 1, for all $n$ sufficiently large, $\Xi(P_{L}\circ\widehat{\Theta})$ is well defined, equals $\widehat{\Theta}$ , and thus we

[TABLE]

In summary, we have established strong consistency of any minimising sequence $\widehat{\Theta}$ of $\widehat{\pi}$ .

Step 3: Consistency of an Appropriately Chosen Factor Since $\mathrm{rank}(\widehat{\Theta})\leq q$ , we can write $\widehat{\Theta}=\widecheck{C}\widecheck{C}^{\top}$ , where $\widecheck{C}\in\mathbb{R}^{L\times q}$ . Thus, $\widehat{\pi}(\widehat{\Theta})=\min_{\Theta}\widehat{\Pi}(\Theta)=\min_{C}\widehat{\Psi}(C)=\widehat{\Psi}(\widecheck{C})$ . Of course, $\widecheck{C}U$ will also yield the same minimum value for any $q\times q$ orthogonal matrix $U$ . Since we are not interested in the law of the argument $\widecheck{C}U$ itself, but with the law of the optimised objective $\widehat{\Psi}(\widecheck{C}U)=\widehat{\Psi}(\widecheck{C})$ , we can will work with any choice of $U$ , even an “oracle choice”. Define, in particular,

[TABLE]

and subsequently, define $\widehat{C}=\widecheck{C}\widehat{U}$ . Thus, $\widehat{C}$ is the version of $\widecheck{C}$ that is “aligned” with $C_{0}$ in the above sense. The solution above minimization problem, known as the orthogonal Procrustes problem, is given by $\widehat{U}=\widecheck{U}\widecheck{V}^{\top}$ , where $\widecheck{U}\widecheck{D}\widecheck{V}$ is the singular value decomposition of the matrix $C_{0}^{\top}\widecheck{C}$ . Furthermore, as earlier remarked,

[TABLE]

so the asymptotic distributions of $\min_{C\in\mathbb{R}^{L\times q}}\widehat{\Psi}(C)$ and $\widehat{\Psi}(\widehat{C})$ will agree.

We will now show that $\widehat{C}=\widehat{C}_{n}$ (to be explicit about the dependence on $n$ ) converges to $C_{0}$ almost surely as $n\rightarrow\infty$ under $H_{0,q}$ . Note that since $\widehat{\Theta}=\widehat{\Theta}_{n}=\widehat{C}_{n}\widehat{C}_{n}^{\top}$ converges almost surely to $K_{X,L}=C_{0}C_{0}^{\top}$ , we have that $\|\widehat{C}_{n}\|_{F}=\sqrt{\mathrm{tr}(\widehat{C}_{n}\widehat{C}_{n}^{\top})}$ converges almost surely to $\|C_{0}\|_{F}=\sqrt{\mathrm{tr}(K_{X,L})}$ . Thus, the set $\Omega=\{\omega:\|\widehat{C}_{n}(\omega)\|_{F}\leq 2\|C_{0}\|_{F}\ \mbox{and}\ \widehat{C}_{n}(\omega)\widehat{C}_{n}(\omega)^{\top}\rightarrow K_{X,L}\ \mbox{as}\ n\rightarrow\infty\}$ has probability measure one. Fix any $\omega\in\Omega$ . Since $\widehat{C}_{n}(\omega)$ lies in the compact set (closed ball of radius $2||C_{0}||_{F}$ ) for all large $n$ , so in order to show that $\widehat{C}_{n}(\omega)$ converges to $C_{0}$ , we will show that all subsequences of $\widehat{C}_{n}(\omega)$ converge to $C_{0}$ . Suppose that there exists a subsequence $\{k^{\prime}\}$ of $\{n\}$ such that $\widehat{C}_{k^{\prime}}(\omega)\rightarrow C_{1}(\omega)$ as $k^{\prime}\rightarrow\infty$ , where $C_{1}(\omega)\neq C_{0}$ . But then $\widehat{C}_{k^{\prime}}(\omega)\widehat{C}_{k^{\prime}}(\omega)^{\top}\rightarrow C_{1}(\omega)C_{1}(\omega)^{\top}=K_{X,L}$ . Thus, $C_{1}(\omega)=C_{0}V(\omega)$ for some $q\times q$ orthogonal matrix $V(\omega)$ . Suppose that $C_{1}(\omega)\neq C_{0}$ , equivalently, $V(\omega)\neq I_{q}$ . Define $\widehat{C}_{k^{\prime}}^{(0)}(\omega)=\widehat{C}_{k^{\prime}}(\omega)V(\omega)^{\top}$ for each $k^{\prime}\geq 1$ . Then,

[TABLE]

On the other hand, $\|\widehat{C}_{k^{\prime}}(\omega)-C_{0}\|_{F}\rightarrow\|C_{1}(\omega)-C_{0}\|_{F}>0.$ Recall that $\widehat{C}_{k^{\prime}}(\omega)=\widecheck{C}_{k^{\prime}}(\omega)U(\omega)$ as per our construction. So, there exists $k^{\prime}_{0}\geq 1$ such that

[TABLE]

This leads to a contradiction unless $V(\omega)=I_{q}$ . Hence, $C_{1}(\omega)=C_{0}$ so that the limit does not depend on $\omega\in\Omega$ . A standard subsequence argument (using the fact that $\widehat{C}_{n}$ lies in a compact set for all sufficiently large $n$ almost surely) now shows that the entire sequence $\widehat{C}_{n}$ must converge to $C_{0}$ as $n\rightarrow\infty$ on $\Omega$ .

Step 4: Determination of the Asymptotic Distribution We will now proceed to derive the asymptotic distribution of $\min_{C}\widehat{\Psi}(C)$ under $H_{0}$ . First, observe that $\nabla^{2}\widehat{\Psi}(C)-\nabla^{2}\Psi(C)$ equals

[TABLE]

and is thus not dependent on $C$ . Furthermore, it is $O_{\mathbb{P}}(n^{-1/2})$ as $n\rightarrow\infty$ .

Observe that by Taylor’s theorem,

[TABLE]

where $\widetilde{C}=\alpha\widehat{C}+(1-\alpha)C_{0}$ for some $0<\alpha<1$ . Since $\sqrt{n}(\widehat{K}_{W,L}-K_{W,L})$ converges weakly to a centered Gaussian random matrix $Z$ , and we have shown above that $\widehat{C}\rightarrow C_{0}$ in probability, it follows that

[TABLE]

as $n\rightarrow\infty$ . Now $\nabla^{2}\Psi(C_{0})$ is non-singular by Assumption (H), so the inverse function theorem applied to the function $\nabla\Psi$ implies that:

(i)

the function $\nabla\Psi$ is invertible in a neighbourhood of $C_{0}$ .

(ii)

the function $(\nabla\Psi)^{-1}$ is continuously differentiable in that same neighbourhood.

(iii)

$\nabla((\nabla\Psi)^{-1})(\nabla\Psi(C))=(\nabla^{2}\Psi(C))^{-1}$ in that same neighbourhood.

Since $\widehat{C}\rightarrow C_{0}$ in probability, we have $\mathbb{P}(\widetilde{C}\ \mbox{lies in the above neighbourhood})\rightarrow 1$ . Also, from the fact that $(\nabla\Psi)^{-1}$ is continuously differentiable in that neighbourhood, it follows that $\mathbb{P}(\nabla^{2}\Psi(\widetilde{C})\ \mbox{is invertible})\rightarrow 1$ . Further,

[TABLE]

in probability as $n\rightarrow\infty$ by the continuity of $\nabla\Psi$ and the continuous differentiability of $(\nabla\Psi)^{-1}$ . It now follows from (5.9) and (5.10) that

[TABLE]

as $n\rightarrow\infty$ .

Next note that for some $\widetilde{C}_{1}=\beta\widehat{C}+(1-\beta)C_{0}$ with $0<\beta<1$ , we have

[TABLE]

It now follows from equations (5.10), (5.11) and (5.12) that

[TABLE]

as $n\rightarrow\infty$ . Thus, as argued earlier, we have

[TABLE]

as $n\rightarrow\infty$ .

The proof of the last statement will follow by establishing that

[TABLE]

as $n\rightarrow\infty$ , the latter term being strictly positive under $H_{1,q}$ by statement (2) of Theorem 1. To establish this limit, apply the reverse triangle inequality to obtain

[TABLE]

Thus

[TABLE]

and notice that the right hand side converges almost surely to $\inf_{\Theta\in\mathcal{M}_{q}}\|P_{L}\circ\Theta-P_{L}\circ{K}_{W,L}\|_{F}$ . It follows that $nT_{q}$ diverges almost surely as $n\rightarrow\infty$ , and the proof is complete. ∎

Proof of Theorem 3.

The proof will be broken down into the following series of steps:

We will first show that the quantities $\widehat{\Theta}_{q}$ , $\widehat{K}_{W,L}^{-1}$ , $M$ , $\widehat{\Theta}_{M}$ , and $\widehat{D}$ either converge $\mathbb{P}$ -almost surely as $n\rightarrow\infty$ , or are $\mathbb{P}$ -strongly tight as $n\rightarrow\infty$ (meaning they eventually lie in a compact set with $\mathbb{P}$ -probability 1). This will be broken down into three sub-cases: when $H_{0,q}$ is true; when $H_{1,q}$ and $H_{0}$ are both true; and, finally, when $H_{1}$ is true. 2. 2.

We will then prove statement (a) in the present theorem under the assumption that $H_{0,q}$ is true. This proof will be further broken down into the following five sub-steps:

2a.

We will first show that the minimizer, say $\Theta^{*}_{q}$ , of the bootstrap functional $\Theta\mapsto||P_{L}\circ(\overline{K}_{\zeta,L}-\Theta)||_{F}$ over ${\cal M}_{q}$ converges to $K_{X,L}$ $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely.

2b.

Then we will show that an appropriately chosen rank $q$ factorization of $\Theta^{*}_{q}$ converges to the rank $q$ factorization $C_{0}$ of $K_{X,L}$ defined in Assumption (H).

2c.

We will next derive the $\mathbb{P}^{*}$ -weak convergence of $\overline{K}_{\zeta,L}$ to an appropriate limit, $\mathbb{P}$ -almost surely.

2d.

Step (2c) will be used along with Taylor’s formula on the bootstrap functional in Step (2a) to derive the $\mathbb{P}^{*}$ -weak limit of the bootstrapped test statistic $nT^{*}_{q}$ .

2e.

We will conclude step (2) by showing that the weak limit obtained in Step (2d) is the same as the weak limit of $nT_{q}$ , thus establishing (a) in the theorem’s statement. 3. 3.

We will then prove statement (b) in the present theorem under the assumption that $H_{1,q}$ is true.

Step 1: Empirical estimators are either strongly convergent or strongly tight. Consider first the case when $H_{0,q}$ is true. We will show that, in this case, we have strong consistency of the required quantities. Suppose that we can show that under $H_{0,q}$ , $M$ converges almost surely to $q$ as $n\rightarrow\infty$ . Since $M$ is integer-valued, this wold imply that $M=q$ eventually, $\mathbb{P}$ -almost surely. Consequently, $\widehat{\Theta}_{M}=\widehat{\Theta}_{q}$ eventually, $\mathbb{P}$ -almost surely. So, following the same steps as in the proof of Theorem 2, we would obtain that $\widehat{\Theta}_{M}$ converges $\mathbb{P}$ -almost surely to $K_{X,L}$ , which in turn would imply that $\widehat{D}$ converges $\mathbb{P}$ -almost surely to $D$ .

To prove that $M$ converges almost surely to $q$ , we claim that

[TABLE]

To see this, we first note that $T_{q}$ converges to zero $\mathbb{P}$ -almost surely as $n\rightarrow\infty$ by the continuous mapping theorem (applied to the strong convergence of $\widehat{\Theta}_{q}$ to $K_{X,L}$ ). Next, we observe that $\frac{n}{2\log\log n}T_{q}$ is eventually bounded $\mathbb{P}$ -almost surely, by means of the law of the iterated logarithm333Recall that under $H_{0,q}$ , it $\mathbb{P}$ -almost surely holds that $\widehat{\Theta}=\Xi(P_{L}\circ\widehat{\Theta})$ for all $n$ sufficiently large. This, combined with (5.8), yields (for all $n$ sufficiently large)

${\alpha(n)\widehat{\pi}^{2}(\widehat{\Theta})=\alpha(n)\|P_{L}\circ\widehat{\Theta}-P_{L}\circ\widehat{K}_{W,L}\|^{2}_{F}\leq\|\sqrt{\alpha(n)}(P_{L}\circ\widehat{K}_{X,L}-P_{L}\circ K_{X,L})\|^{2}_{F}=\|\sqrt{\alpha(n)}Y_{n}\|^{2}_{F}.}$

Now each entry of $Y_{n}$ is an average of $n$ iid random variavbles of mean zero and finite variance. So if we pick $\alpha(n)=n/(2\log\log n)$ , the LIL will imply that $\lim\sup_{n\rightarrow\infty}\alpha(n)\widehat{\pi}^{2}(\widehat{\Theta})$ is in a bounded set, almost surely. picking $\alpha(n)=n/\log n$ will then yield convergence of $\alpha(n)T_{q}$ to zero, $\mathbb{P}$ -almost surely. So

[TABLE]

for any $\epsilon>0$ . The term $\|P_{L}\circ\widehat{K}_{W,L}\|^{2}_{F}$ converges $\mathbb{P}$ -almost surely to the positive constant $\|P_{L}\circ{K}_{W,L}\|^{2}_{F}$ . Therefore $m_{n}$ , and consequently $M$ , converges $\mathbb{P}$ -almost surely to $q$ under $H_{0,q}$ .

Now consider the setting where both $H_{1,q}$ and the global null $H_{0}$ are true. In this case, one of $H_{0,q+1},...,H_{0,d}$ is true. Say it’s $H_{0,r}$ for $r\in\{q+1,...,d\}$ . By arguments similar to above, we can show that $M$ converges $\mathbb{P}$ -almost surely to $r$ . Thus, eventually, $\widehat{\Theta}_{M}$ equals $\widehat{\Theta}_{r}$ . Since $H_{0,r}$ is true, $\widehat{\Theta}_{r}$ converges $\mathbb{P}$ -almost surely to $K_{X,L}$ , yielding strong consistency of $\widehat{D}$ for $D$ . As for $\widehat{\Theta}_{q}$ , this may not converge almost surely, since $q<r$ , but we will show that it is strongly tight. Define $K^{(q)}_{X,L}(i,j)=\sum_{l=1}^{q}\lambda_{l}\varphi_{l}(t_{i})\varphi_{l}(t_{j})$ so that $\mathrm{rank}(K^{(q)})\leq q$ . Since $\widehat{\Theta}_{q}$ is a minimizer of the function $\Theta\mapsto||P_{L}\circ(\widehat{K}_{W,L}-\Theta)||_{F}$ over matrices of rank at most $q$ ,

[TABLE]

The right hand side converges almost surely to a constant, therefore $P_{L}\circ\widehat{\Theta}_{q}$ eventually lies in a closed ball of finite radius centred at $P_{L}\circ{K}_{W,L}$ . To see that the diagonal elements of $\widehat{\Theta}_{q}$ must also eventually lie in some compact set almost surely, consider an arbitrary diagonal such element $\widehat{\theta}_{ii}$ of $\widehat{\Theta}_{q}$ . Since $L>L_{\dagger}$ , where $L_{\dagger}$ is as in Theorem 1, there exists a $(q+1)\times(q+1)$ submatrix $S$ of $\widehat{\Theta}_{q}$ that contains $\widehat{\theta}_{ii}$ but contains no other diagonal elements of $\widehat{\Theta}_{q}$ (see, e.g., the first part of the proof of Theorem1). The matrix $S$ is clearly of rank at most $q$ , thus the column $S_{i}$ that contains $\widehat{\theta}_{ii}$ is in the span of the remaining columns of $S$ , $\{S_{j}\}_{j\neq i}$ . In other words, there exist $q$ coefficients $\{\alpha_{j}\}_{j\neq i}$ such that

[TABLE]

Now all entries of $S$ other than $\widehat{\theta}_{ii}$ are elements of $P_{L}\circ\widehat{\Theta}_{q}$ , and the latter eventually lies in a compact set almost surely. It follows that

(i)

Any coefficients $\{\alpha_{j}\}_{j\neq i}$ such that $S_{i}=\sum_{j\neq i}\alpha_{j}S_{j}$ also eventually lie in a compact set almost surely. This because $S_{k,i}=\sum_{j\neq i}\alpha_{j}S_{k,j}$ for all $k\neq i$ and the $\{S_{k,j}\}_{k\neq i}$ are all eventually bounded almost surely.

(ii)

Thus, $\widehat{\theta}_{ii}$ lies in a compact set eventually almost surely, since $\widehat{\theta}_{ii}=\sum_{j\neq i}\alpha_{j}S_{i,j}$ .

Since $\widehat{\theta}_{ii}$ was arbitrarily chosen, we establish that $\lim\sup_{n\rightarrow\infty}\|\widehat{\Theta}_{q}\|_{F}<\infty$ almost surely.

To conclude Step 1 of the proof, consider now the case where the global alternative $H_{1}$ is true (i.e. the true rank exceeds $d$ ). In this case, we will show that $M$ converges almost surely to $d$ . We will also show that $\widehat{\Theta}_{d}$ is eventually almost surely in a compact set. To prove that $M\stackrel{{\scriptstyle a.s.}}{{\rightarrow}}d$ , let us revisit the quantity

[TABLE]

Recalling the reverse triangle inequality (applied to the Frobenius norm)

[TABLE]

we may write

[TABLE]

Under $H_{1}$ , for any $m\leq d$ , we thus have

[TABLE]

and we notice that the right hand side converges to $\inf_{\Theta\in\mathcal{M}_{m}}\|P_{L}\circ\Theta-P_{L}\circ{K}_{W,L}\|_{F}$ . This is a strictly positive quantity from part 2 of Theorem 1. It follows that $m_{n}$ diverges almost surely, and thus $M$ converges to $d$ almost surely. The proof of the strong tightness of $\widehat{\Theta}_{d}$ is established in exactly in the same way as the proof of the strong tightness of $\widehat{\Theta}_{q}$ in the case where $H_{1,q}\cap H_{0}$ is valid, considered in the previous paragraph.

Step 2: Asymptotic theory for bootstrap under $H_{0,q}$ (statement (a) of the Theorem). As announced earlier in the proof, this step is broken down into five substeps, and we now establish these in order.

Step 2a: Consistency of minimizer of bootstrap functional. We can, for simplicity, assume that $E(W)=0$ , and drop the term $\overline{{\bf W}}$ in the definition of $\widehat{m}({\bf W}_{i})$ . We will fix a set $\Omega_{0}$ of $\mathbb{P}$ -measure one on which a.s. convergence and law of iterated logarithm results hold as will be required in the proof. Fix any $\omega\in\Omega_{0}$ and work with the resulting population $\{{\bf W}_{1}(\omega),\ldots,{\bf W}_{n}(\omega),\ldots\}$ . All statements will be conditional on this population. We will drop the dependence on $\omega$ for simplicity of notation.

Assume that $H_{0,q}$ is true, and recall that $\zeta_{j}=U^{*}_{j}+V^{*}_{j}$ . Define

[TABLE]

Now, observe that

[TABLE]

where the last equality follows from the independence of $U^{*}_{1}$ and $V^{*}_{1}$ along with the fact that $\mathbb{E}^{*}(V^{*}_{1})=0$ . It is standard that all of $t_{1}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})$ , $t_{2}-\mathbb{E}^{*}(V^{*}_{1}V^{*^{\top}}_{1})$ and $t_{3}$ converge to zero $\mathbb{P}^{*}$ -almost surely $\mathbb{P}$ -almost surely (though, for completeness, we show this in Lemma 3, after this proof). Thus, combining these statements, it follows that $\overline{K}_{\zeta,L}-(\widehat{\Theta}_{q}+\widehat{D})$ converges to zero $\mathbb{P}^{*}$ -almost surely.

Now define the function $\overline{\Pi}(\Theta)=\|P_{L}\circ(\overline{K}_{\zeta,L}-\Theta)\|_{F}^{2}$ . Define $\Theta^{*}_{q}$ to be a minimizer of the functional $\overline{\Pi}(\cdot)$ over $\mathcal{M}_{q}$ . We will now prove that $\Theta^{*}_{q}$ converges to $K_{X,L}$ $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely under $H_{0,q}$ . To show this, let $\overline{K}_{X,L}$ be the (unobservable) random $L\times L$ matrix satisfying

[TABLE]

where $\{{\bf X}_{k}^{*}:k=1,2,\ldots,n\}$ is the (unobservable) bootstrap sample from the (also unobservable) $\{{\bf X}_{k}:k=1,2,\ldots,n\}$ (i.e. with those indices sampled by the bootstrap). Under $H_{0,q}$ , $\mathrm{rank}(\overline{K}_{X,L})\leq q$ . Also, define $\widehat{K}_{X,L}=n^{-1}\sum_{i=1}^{n}{\bf X}_{i}{\bf X}_{i}^{\top}$ . Thus, it must hold that

[TABLE]

Under $H_{0,q}$ , it has been proved above that $\widehat{\Theta}_{q}$ converges $\mathbb{P}$ -almost surely to $K_{X,L}$ . Further, $\widehat{D}$ converges $\mathbb{P}$ -almost surely to $D$ . So, $\widehat{\Theta}_{q}+\widehat{D}$ converges $\mathbb{P}$ -almost surely to $K_{X,L}+D=K_{W,L}$ . Also, $P_{L}\circ K_{X,L}=P_{L}\circ K_{W,L}$ . Thus, the right hand side converges to zero $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely by the classical as well as the bootstrap almost sure convergence results and the continuous mapping theorem (note that since $k_{X}$ is continuous, the covariance operator of the process $X$ is trace-class, and so is any discretization thereof). We therefore have

[TABLE]

$\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely as $n\rightarrow\infty$ , which implies that

[TABLE]

$\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely as $n\rightarrow\infty$ . Now, the same arguments as in the remainder of the proof of Step 2 (Consistency of Empirical Minimizers) in Theorem 2 show that $\Theta^{*}_{q}$ converges $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely as $n\rightarrow\infty$ to $K_{X,L}$ .

Step 2b: Consistency of appropriate rank factorization. Since $\mathrm{rank}(\Theta^{*}_{q})\leq q$ , we can write $\Theta^{*}_{q}=\widetilde{C}\widetilde{C}^{\top}$ , where $\widetilde{C}\in\mathbb{R}^{L\times q}$ . Thus, $\overline{\Pi}(\Theta^{*}_{q})=\min_{\Theta}\overline{\Pi}(\Theta)=\min_{C}\overline{\Psi}(C)=\overline{\Psi}(\widetilde{C})$ , where $\overline{\Psi}(C)=\|P_{L}\circ(\overline{K}_{\zeta,L}-CC^{\top})\|_{F}^{2}$ . We now make the observation that $\widetilde{C}U$ will also yield the same minimum value for any $q\times q$ orthogonal matrix $U$ . So, we will work with the following modified estimator instead. Define

[TABLE]

and subsequently, define $\overline{C}=\widetilde{C}\overline{U}$ . Thus, $\overline{C}$ is the version of $\widetilde{C}$ that is “aligned” with $C_{0}$ in the above Procrustes distance minimization sense. It is well known that the solution of the above minimization problem is given by $\overline{U}=\widetilde{U}\widetilde{V}^{\top}$ , where $\widetilde{U}\widetilde{D}\widetilde{V}$ is the singular value decomposition of the matrix $C_{0}^{\top}\widetilde{C}$ . Further,

[TABLE]

So, the bootstrap asymptotic distributions of $\min_{C\in\mathbb{R}^{L\times q}}\overline{\Psi}(C)$ and $\overline{\Psi}(\overline{C})$ will agree $\mathbb{P}$ -almost surely. Since it is these asymptotic distributions that we wish to determine, we can work with $\overline{C}$ , even though it is an oracle quantity (similar to Step 3 of Theorem 2).

We will now show that $\overline{C}=\overline{C}_{n}$ (to be explicit about the dependence on $n$ ) converges to $C_{0}$ $\mathbb{P}^{*}$ -almost surely as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely under $H_{0,q}$ . Note that since $\Theta^{*}_{q}=\Theta^{*}_{q,n}=\overline{C}_{n}\overline{C}_{n}^{\top}$ converges $\mathbb{P}^{*}$ -almost surely to $K_{X,L}=C_{0}C_{0}^{\top}$ $\mathbb{P}$ -almost surely (proven in Step (2b)), we have that $\|\overline{C}_{n}\|_{F}=\sqrt{\mathrm{tr}(\overline{C}_{n}\overline{C}_{n}^{\top})}$ converges $\mathbb{P}^{*}$ -almost surely to $\|C_{0}\|_{F}=\sqrt{\mathrm{tr}(K_{X,L})}$ $\mathbb{P}$ -almost surely. Thus, the set $\Omega^{*}=\{\omega^{*}:\|\overline{C}_{n}(\omega^{*})\|_{F}\leq 2\|C_{0}\|_{F}\ \mbox{and}\ \overline{C}_{n}(\omega^{*})\overline{C}_{n}(\omega^{*})^{\top}\rightarrow K_{X,L}\ \mbox{as}\ n\rightarrow\infty\}$ has $\mathbb{P}^{*}$ -probability measure one.

Fix any $\omega^{*}\in\Omega^{*}$ . Since $\overline{C}_{n}(\omega^{*})$ lies in the closed ball of radius $2||C_{0}||_{F}$ for all large $n$ , in order to show that $\overline{C}_{n}(\omega^{*})$ converges to $C_{0}$ , we will show that all subsequences of $\overline{C}_{n}(\omega^{*})$ converge to $C_{0}$ . Suppose that there exists a subsequence $\{k^{\prime}\}$ of $\{n\}$ such that $\overline{C}_{k^{\prime}}(\omega^{*})\rightarrow C_{1}(\omega^{*})$ as $k^{\prime}\rightarrow\infty$ , where $C_{1}(\omega^{*})\neq C_{0}$ . But then $\overline{C}_{k^{\prime}}(\omega^{*})\overline{C}_{k^{\prime}}(\omega^{*})^{\top}\rightarrow C_{1}(\omega^{*})C_{1}(\omega^{*})^{\top}=K_{X,L}$ . Thus, $C_{1}(\omega^{*})=C_{0}V(\omega^{*})$ for some $q\times q$ orthogonal matrix $V(\omega^{*})$ . Suppose that $C_{1}(\omega^{*})\neq C_{0}$ , equivalently, $V(\omega^{*})\neq I_{q}$ . Define $\overline{C}_{k^{\prime}}^{(0)}(\omega^{*})=\overline{C}_{k^{\prime}}(\omega^{*})V(\omega^{*})^{\top}$ for each $k^{\prime}\geq 1$ . Then,

[TABLE]

On the other hand, $\|\overline{C}_{k^{\prime}}(\omega^{*})-C_{0}\|_{F}\rightarrow\|C_{1}(\omega^{*})-C_{0}\|_{F}>0.$ Recall that $\overline{C}_{k^{\prime}}(\omega^{*})=\widetilde{C}_{k^{\prime}}(\omega^{*})\overline{U}(\omega^{*})$ as per our construction. So, there exists $k^{\prime}_{0}\geq 1$ such that

[TABLE]

This leads to a contradiction unless $V(\omega^{*})=I_{q}$ . Hence, $C_{1}(\omega^{*})=C_{0}$ so that the limit does not depend on $\omega^{*}\in\Omega^{*}$ . A standard subsequence argument (using the fact that $\overline{C}_{n}$ lies in a compact set for all sufficiently large $n$ almost surely) now shows that the entire sequence $\overline{C}_{n}$ must converge to $C_{0}$ as $n\rightarrow\infty$ on $\Omega^{*}$ .

We will now derive the asymptotic distribution of $\min_{C}\overline{\Psi}(C)$ under $H_{0}$ . Define $\widecheck{\Psi}(C)=||P_{L}\circ(\widehat{\Theta}_{q}+\widehat{D})-CC^{\top})||_{F}^{2}=||P_{L}\circ\widehat{\Theta}_{q}-CC^{\top}||_{F}^{2}$ .

Step 2c: Asymptotic distribution of bootstrap covariance matrix. We know determine the asymptotic distribution of the bootstrap covariance matrix $\overline{K}_{\zeta,L}$ First, use the form of the Hessian established in Step 1 of Theorem 2 to observe that $\nabla^{2}\overline{\Psi}(C)-\nabla^{2}\widecheck{\Psi}(C)=-4I_{q}\otimes(P_{L}\circ(\overline{K}_{\zeta,L}-(\widehat{\Theta}_{q}+\widehat{D}))$ is free of $C$ . We will now derive the asymptotic distribution of $\sqrt{n}(\overline{K}_{\zeta,L}-(\widehat{\Theta}_{q}+\widehat{D}))$ . This will then imply that $\nabla^{2}\overline{\Psi}(C)-\nabla^{2}\widecheck{\Psi}(C)=O_{\mathbb{P}^{*}}(n^{-1/2})$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. For simplicity of notation, denote $\widehat{\Theta}_{q}+\widehat{D}$ by $\widehat{K}$ . In order to derive the $\mathbb{P}^{*}$ -weak convergence of $\sqrt{n}(\overline{K}_{\zeta,L}-\widehat{K})$ , it is enough to derive the $\mathbb{P}^{*}$ -weak convergence of $\sqrt{n}({\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}-{\bf a}^{\top}\widehat{K}{\bf b})$ for each fixed ${\bf a},{\bf b}\in\mathbb{R}^{L}$ . Observe that from the definition of $\overline{K}_{\zeta,L}$ , it follows that ${\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}=n^{-1}\sum_{j=1}^{n}\mathbf{1}^{\top}{\bf R}_{j}$ , where ${\bf R}_{j}$ is a $4$ -tuple given by

[TABLE]

Clearly,

[TABLE]

We now proceed to find $\mathrm{Cov}^{*}({\bf R}_{j})$ . Clearly,

[TABLE]

Using the fact that for compatible matrices $Q_{1},Q_{2},Q_{3}$ and $Q_{4}$ , we have $\mbox{tr}\{Q_{1}^{\top}Q_{2}Q_{3}Q_{4}^{\top}\}=(\mbox{vec}(Q_{1}))^{\top}(Q_{4}\otimes Q_{2})\mbox{vec}(Q_{3})$ , the above term equals

[TABLE]

Using the properties of Kronecker products, it follows that $AA^{\top}\otimes BB^{\top}=(A\otimes B)(A^{\top}\otimes B^{\top})=(A\otimes B)(A\otimes B)^{\top}$ . So, the above term equals

[TABLE]

The matrix $Q$ above converges $\mathbb{P}$ -almost surely under $H_{0,q}$ to

[TABLE]

Assuming Gaussianity of the observations $\{{\bf W}_{i}\}$ , we have that ${\bf W}_{1}{\bf W}_{1}^{\top}$ follows a central $L$ -dimensional Wishart distribution with parameter $K_{W,L}$ . Also, observe that for any vector ${\bf x}\in\mathbb{R}^{L}$ , we have ${\bf x}\otimes{\bf x}=\mbox{vec}({\bf x}{\bf x}^{\top})$ . It now follows from equations (3.131), (3.132) and (3.135) in Izenman, (2008, p. 64) that

[TABLE]

where $\eta_{L}=\mbox{vec}(K_{W,L})$ is a $L^{2}$ -dimensional vector, and $M_{L^{2}}$ is the $L^{2}\times L^{2}$ commutation matrix that satisfies $M_{L^{2}}\mbox{vec}(F)=\mbox{vec}(F^{\top})$ for any $L\times L$ matrix $F$ . So, $Q$ converges $\mathbb{P}$ -almost surely to

[TABLE]

Observe that $\mathbb{E}^{*}({\bf a}^{\top}U^{*}_{j}{\bf b}^{\top}U^{*}_{j})={\bf a}^{\top}\widehat{\Theta}_{q}\widehat{K}_{W,L}^{-1}\widehat{\Theta}_{q}{\bf b}$ converges $\mathbb{P}$ -almost surely to ${\bf a}^{\top}K_{X,L}K_{W,L}^{-1}K_{X,L}{\bf b}$ . Also, note that for vectors ${\bf x},{\bf y}\in\mathbb{R}^{L}$ and for $L\times L$ matrices $E$ and $F$ , we have

[TABLE]

Thus, it follows that

[TABLE]

Next, let us consider $\mathrm{Var}^{*}[({\bf a}^{\top}V^{*}_{j})({\bf b}^{\top}V^{*}_{j})]$ . This can be simplified as follows by using the fact that the $V^{*}_{j}$ ’s are themselves centered Gaussians.

[TABLE]

where $\widehat{\gamma}_{L}=\mbox{vec}(\widehat{D}+\widehat{A})$ and $\gamma_{L}=\mbox{vec}(D+A)$ .

Next, we use the independence of the $U^{*}_{j}$ ’s and the $V^{*}_{j}$ ’s to write

[TABLE]

Similarly,

[TABLE]

Now note that by independence of $U^{*}_{j}$ ’s and $V^{*}_{j}$ ’s, and using the fact that the $V^{*}_{j}$ ’s are centered, we have

[TABLE]

Further,

[TABLE]

since $\mathbb{E}^{*}[({\bf a}^{\top}V^{*}_{j})({\bf b}^{\top}V^{*}_{j})^{2}]$ is bounded $\mathbb{P}$ -almost surely and $\mathbb{E}^{*}[{\bf a}^{\top}U^{*}_{j}]={\bf a}^{\top}\widehat{\Theta}_{q}\widehat{K}_{W,L}^{-1}\overline{W}$ converges to zero $\mathbb{P}$ -almost surely since $\mathbb{E}({\bf W}_{1})=0$ by assumption. Similarly,

[TABLE]

Finally,

[TABLE]

So, collecting all the expressions together, we get that

[TABLE]

Observe that for an $L\times L$ matrix $E$ and an $L$ -dimensional vector ${\bf x}$ , we have

[TABLE]

by the definition of $M_{L^{2}}$ and noting that $(E{\bf x})(E{\bf x})^{\top}$ is a symmetric $L\times L$ matrix. Thus,

[TABLE]

Now, observe that for any two vectors ${\bf x},{\bf y}\in\mathbb{R}^{L}$ and any $L\times L$ covariance matrix $\Sigma$ equalling $\mathbb{E}({\bf S}{\bf S}^{\top})$ for a centered $L$ -dimensional random variable ${\bf S}$ , we have

[TABLE]

which follows by using the fact that for two $L$ -dimensional vectors ${\bf u}$ and ${\bf v}$ , we have $({\bf u}^{\top}\otimes{\bf u}^{\top})({\bf v}\otimes{\bf v})=({\bf u}^{\top}{\bf v})^{2}$ . Thus, we have that

[TABLE]

Now, in order to derive the asymptotic distribution of $\sqrt{n}({\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}-{\bf a}^{\top}\widehat{K}{\bf b})$ we verify the Lyapunov condition. Recall that ${\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}=n^{-1}\sum_{j=1}^{n}\mathbf{1}^{\top}{\bf R}_{j}$ and ${\bf a}^{\top}\widehat{K}{\bf b}=\mathbb{E}^{*}(n^{-1}\sum_{j=1}^{n}\mathbf{1}^{\top}{\bf R}_{j})$ . Also, $s_{n}^{2}:=\mathrm{Var}^{*}(n^{-1}\sum_{j=1}^{n}\mathbf{1}^{\top}{\bf R}_{j})=n^{-1}\mathrm{Var}^{*}(\mathbf{1}^{\top}{\bf R}_{1})$ converges $\mathbb{P}$ -almost surely to a positive constant (derived previously). We will now show that

[TABLE]

So, it is enough to show that

[TABLE]

Now,

[TABLE]

Since the right hand side of the above expression converges $\mathbb{P}$ -almost surely to a positive finite constant, it follows that $n^{-2}\sum_{j=1}^{n}\mathbb{E}^{*}[(\mathbf{1}^{\top}{\bf R}_{j})^{4}]\rightarrow 0$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely.

Hence, by the Lindeberg CLT

[TABLE]

where the second statement follows upon using Slutsky’s theorem combined with the fact that $ns_{n}^{2}$ converges to $v_{{\bf a},{\bf b}}$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. This concludes Step (2c). As a side remark, observe that even if the Gaussian assumption is not true, we still have the above weak convergence (under $H_{0,q}$ ) albeit with a different expression for the asymptotic variance.

Step 2d: Asymptotic distribution of $nT^{*}_{q}$ . Denote the Procrustes aligned rank factorization of $\widehat{\Theta}_{q}$ by $\mathring{C}$ . Since $\widehat{\Theta}_{q}$ converges $\mathbb{P}$ -almost surely to $K_{X,L}$ under $H_{0,q}$ , it can be shown that $\mathring{C}$ converges $\mathbb{P}$ -almost surely to $C_{0}$ by using arguments similar to those used to prove the almost sure convergence of $\overline{C}$ .

Recall that we denoted $\widehat{\Theta}+\widehat{D}$ by $\widehat{K}$ . First, use Taylor’s formula to observe that

[TABLE]

where $\widetilde{C}_{1}=\alpha\overline{C}+(1-\alpha)\mathring{C}$ for some $0<\alpha<1$ . We have already proved that $\sqrt{n}(\overline{K}_{\zeta,L}-\widehat{K})$ converges $\mathbb{P}^{*}$ -weakly to a centered Gaussian random matrix $Z_{\dagger}$ as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely, in Step (2c) above. Further, we have that $\overline{C}\rightarrow C_{0}$ $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely (Step (2b)), and similarly that $\mathring{C}$ converges $\mathbb{P}$ -almost surely to $C_{0}$ . This statement implies that $\overline{C}-\mathring{C}\rightarrow 0$ $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely. As earlier, by the invertibility of $\nabla^{2}\Psi(C_{0})$ , there is an open neighbourhood $\mathcal{N}$ of $C_{0}$ where (i) the function $\nabla\Psi$ is invertible,

(ii) the function $(\nabla\Psi)^{-1}$ is continuously differentiable, and

(iii) $\nabla((\nabla\Psi)^{-1})(\nabla\Psi(C))=(\nabla^{2}\Psi(C))^{-1}$ for any $C$ in that neighbourhood.

Since, $\overline{C}\rightarrow C_{0}$ in $\mathbb{P}^{*}$ -probability $\mathbb{P}$ -almost surely, we have

[TABLE]

as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely. Also, from the fact that $(\nabla\Psi)^{-1}$ is continuously differentiable in that neighbourhood, it follows that $\mathbb{P}^{*}(\nabla^{2}\Psi(\widetilde{C})\ \mbox{is invertible})\rightarrow 1$ as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely. Moreover, $(\nabla^{2}\Psi(\widetilde{C}_{1}))^{-1}\rightarrow(\nabla^{2}\Psi(C_{0}))^{-1}$ in $\mathbb{P}^{*}$ -probability as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely.

It now follows from the above equations that

[TABLE]

Since $I_{q}\otimes(P_{L}\circ(\widehat{K}-K_{W,L}))\rightarrow 0$ $\mathbb{P}$ -almost surely, and $(\nabla^{2}\Psi(\widetilde{C}_{1}))^{-1}\rightarrow(\nabla^{2}\Psi(C_{0}))^{-1}$ $\mathbb{P}^{*}$ -almost surely as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely, it follows that $\|(\nabla^{2}\Psi(\widetilde{C}_{1}))^{-1}[I_{q}\otimes(P_{L}\circ(\widehat{K}-K_{W,L}))]\|_{F}\rightarrow 0$ in $\mathbb{P}^{*}$ -probability as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely. Hence, $\mathbb{P}^{*}(I_{qL}-4(\nabla^{2}\Psi(\widetilde{C}_{1}))^{-1}[I_{q}\otimes(P_{L}\circ(\widehat{K}-K_{W,L}))]\ \mbox{is invertible})\rightarrow 1$ as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely. Also, the inverse converges to $I_{qL}$ in $\mathbb{P}^{*}$ -probability as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely. Combining these facts, we get that

[TABLE]

as $n\rightarrow\infty$ , $\mathbb{P}$ -almost surely.

Next note that for some $\widetilde{C}_{1}=\beta\overline{C}+(1-\beta)\mathring{C}$ with $0<\beta<1$ , we have

[TABLE]

By (5.15), the facts that $\widehat{K}\rightarrow K_{W,L}$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely, and $\overline{K}_{\zeta,L}-\widehat{K}\rightarrow 0$ $\mathbb{P}^{*}$ -almost surely as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely, the convergence $\mathbb{P}^{*}$ -almost surely of $\widetilde{C}_{1}$ to $C_{0}$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely, and the continuity of $\nabla^{2}\Psi$ , it follows that

[TABLE]

Thus,

[TABLE]

as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. Thus, the bootstrap version of the test statistic, namely, $T^{*}_{q}=\min_{\Theta\in{\cal M}_{q}}\|P_{L}\circ(\overline{K}_{\zeta,L}-\Theta)\|_{F}^{2}$ satisfies

[TABLE]

as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely.

Step 2e: Bootstrap weak limit coincides with original weak limit in Theorem 2. We will now conclude Step 2 by showing that $Z_{\dagger}$ has the same distribution as $Z$ , which was the weak limit of $\sqrt{n}(\widehat{K}_{W,L}-K_{W,L})$ under $H_{0,q}$ in the statement of Theorem 2. For this, observe that it is enough to show that $\sqrt{n}({\bf a}^{\top}\widehat{K}_{W,L}{\bf b}-{\bf a}^{\top}K_{W,L}{\bf b})$ converges $\mathbb{P}$ -weakly to $N(0,v_{{\bf a},{\bf b}})$ . Note that ${\bf a}^{\top}\widehat{K}_{W,L}{\bf b}=n^{-1}\sum_{i=1}^{n}({\bf a}^{\top}{\bf W}_{i})({\bf b}^{\top}{\bf W}_{i})$ . Now, under Gaussainity of the ${\bf W}_{i}$ ’s and the assumption that $\mathbb{E}({\bf W}_{1})=0$ , we have

[TABLE]

So, using the classical CLT, it follows that $\sqrt{n}({\bf a}^{\top}\widehat{K}_{W,L}{\bf b}-{\bf a}^{\top}K_{W,L}{\bf b})$ converges $\mathbb{P}$ -weakly to $N(0,v_{{\bf a},{\bf b}})$ . Hence, $Z\dagger\stackrel{{\scriptstyle d}}{{=}}Z$ under Gaussianity of the observations and $H_{0,q}$ . Consequently, the asymptotic distribution of the bootstrap statistic is the same as that of the original statistic under Gaussianity and $H_{0,q}$ .

Wrapping up Step 2 of our proof, we are now ready to establish statement (a) of the present theorem. Denote by $H^{*}_{q}$ the empirical CDF of $nT^{*}_{q}$ and that of $T^{*}_{q}$ by $F^{*}_{q}$ . Then, $F^{*}_{q}(x)=H^{*}_{q}(nx)$ for each $x\in\mathbb{R}$ and each $n\geq 1$ . Let $G^{*}_{q}$ denote the generalized inverse CDF of $H^{*}_{q}$ . From the usual properties of the generalized inverse of a cdf, it follows that the two events $\{F^{*}_{q}(T_{q})\geq u\}$ (equivalently, $\{H^{*}_{q}(nT_{q})\geq u\}$ ) and $\{nT_{q}\geq G^{*}_{q}(u)\}$ are the same for any $u\in(0,1)$ . Let us denote the $u$ -quantile of the asymptotic limit, say $Y_{\dagger}$ , of $nT^{*}_{q}$ by $y_{u}$ . Since $H^{*}_{q}$ converges $\mathbb{P}^{*}$ -weakly, $\mathbb{P}$ -almost surely to $Y_{\dagger}$ , and $Y_{\dagger}$ has a continuous distribution (continuous map of the Gaussian random matrix $Z_{\dagger}$ ), it follows from Lemma 21.2 in van der Vaart, (1998) that $G^{*}_{q}(u)$ converges to $y_{u}$ for all $u\in(0,1)$ $\mathbb{P}$ -almost surely. Now, $nT_{q}$ converges $\mathbb{P}$ -weakly to $Y$ which has the same distribution as $Y_{\dagger}$ (since $Z$ and $Z_{\dagger}$ have the same distributions). So, by Slutsky’s theorem, $nT_{q}-G^{*}_{q}(u)$ converges $\mathbb{P}$ -weakly to $Y-y_{u}$ for any $u\in(0,1)$ . Hence,

[TABLE]

for any $u\in(0,1)$ . Observe that $p^{*}_{q}=1-F^{*}_{q}(T_{q})$ . Thus,

[TABLE]

for any $u\in(0,1)$ . This completes the proof of the first conclusion of the present theorem.

Step 3: Asymptotic theory for bootstrap under $H_{1,q}$ . For the second claim in the theorem, assume $H_{1,q}$ . We will first consider the case when the global null is also still true, i.e., $H_{0,r}$ is true for some $r\in\{q+1,q+2,\ldots,d\}$ . In this case, it has been proven in Step 1 that $M$ converges almost surely to $r$ so that $\widehat{D}$ converges almost surely to $D$ . Though $\widehat{\Theta}_{q}$ might not converge, it has also been shown in Step 1 of the proof that $\widehat{\Theta}_{q}$ is almost surely tight. So, $\mathbb{P}(\widehat{\Theta}_{q}\ \mbox{lies in a compact set eventually as}\ n\rightarrow\infty)=1$ . We will work with an $\omega$ that satisfies all the almost sure convergences and laws of iterated logarithm as needed earlier along with the previous tightness requirement (such $\omega$ ’s comprise an event of $\mathbb{P}$ -measure one). So, by the compactness condition, there will exist a subsequence $\{n^{\prime}\}$ (possibly depending on $\omega$ ) such that $\widehat{\Theta}_{q}$ converges to some $K_{q}=K_{q}(\omega)$ as $n^{\prime}\rightarrow\infty$ . Clearly, $\widehat{D}$ converges to $D$ along this subsequence. We will work by viewing this subsequence as our original sequence, and all convergence statements will be as $n^{\prime}\rightarrow\infty$ .

Since $\widehat{K}$ converges to $K_{q}+D$ along $\{n^{\prime}\}$ , observe that we will be able to prove (following the same arguments as in Step 2d) that $\sqrt{n}({\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}-{\bf a}^{\top}\widehat{K}{\bf b})$ converges $\mathbb{P}^{*}$ -weakly to $N(0,u_{{\bf a},{\bf b}})$ along $\{n^{\prime}\}$ for each ${\bf a},{\bf b}\in\mathbb{R}^{L}$ . Note that the limiting variance term will be different from the that is Step 2, which assumes $H_{0,q}$ to be true. We can denote the limiting random matrix of $\sqrt{n}(\overline{K}_{\zeta,L}-\widehat{K})$ by $Z^{(1)}_{\dagger}$ .

Next observe that since $\mathrm{rank}(\widehat{\Theta}_{q})\leq q$ (by construction), it follows from the definition of a minimum that

[TABLE]

for each $n\geq 1$ . So, $n^{\prime}T^{*}_{q}$ is $\mathbb{P}^{*}$ -tight along $\{n^{\prime}\}$ since it is bounded above by a $\mathbb{P}^{*}$ -weakly convergent sequence. Note that since $T_{q}>0$ almost surely, the entire sequence $\{nT_{q}\}$ diverges to $+\infty$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. So, for each $\omega$ in a $\mathbb{P}$ -measure one set and along the corresponding sequence $\{n^{\prime}\}$ (depending possibly on $\omega$ as discussed thus far in the proof), we have

[TABLE]

as $n^{\prime}\rightarrow\infty$ . In fact, a stronger statement is actually true – for each $u\in(0,1)$ , there exists $n(\omega)\geq 1$ such that $F^{*}_{q}(T_{q})\geq u$ for all $n>n(\omega)$ . This is because of the following: if there exists a infinite sequence $\{\widetilde{n}\}$ (possibly depending on $u$ and $\omega$ ) such that $F^{*}_{q}(T_{q})<u$ along $\{\widetilde{n}\}$ , one can find a further subsequence, say $\{\widecheck{n}\}$ (obtained in a similar way as discussed previously in the case of the original subsequence $\{n^{\prime}\}$ ) such that $F^{*}_{q}(T_{q})\rightarrow 1$ along this subsequence $\{\widecheck{n}\}$ of $\{\widetilde{n}\}$ . This would lead to a contradiction. Hence,

[TABLE]

Observe that $p^{*}_{q}=1-F^{*}_{q}(T_{q})$ . So, replacing $u$ by $1-u$ in the previous displayed equation, it follows that

[TABLE]

Consequently, $\mathbb{P}\{p^{*}_{q}\leq u\}\rightarrow 1$ as $n\rightarrow\infty$ for each $u\in(0,1)$ . This completes the proof of the second statement of the present theorem in case $H_{0,q}$ is not true but the global null is true.

Finally, consider the situation when the global null is not true, i.e. $H_{1}$ is true. In this case, it follows that both $\widehat{\Theta}_{q}$ and $\widehat{D}$ are strongly tight. A simple extension of the subsequence arguments provided in the previous situation (to accommodate for $\widehat{D}$ in addition to $\widehat{\Theta}_{q}$ ) carries over and proves the second statement of the present theorem.

∎

Lemma 3.

In the setting and notation of Theorem 3 and its proof, all of $t_{1}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})$ , $t_{2}-\mathbb{E}^{*}(V^{*}_{1}V^{*^{\top}}_{1})$ and $t_{3}$ converges to zero $\mathbb{P}^{*}$ -almost surely, $\mathbb{P}$ -almost surely, where $t_{1}=n^{-1}\sum_{j=1}^{n}U^{*}_{j}U^{*^{\top}}_{j}$ , $t_{2}=n^{-1}\sum_{j=1}^{n}V^{*}_{j}V^{*^{\top}}_{j}$ and $t_{3}=n^{-1}\sum_{j=1}^{n}(U^{*}_{j}V^{*^{\top}}_{j}+V^{*}_{j}U^{*^{\top}}_{j})$ .

Proof.

We will only give the proof of the convergence of $t_{1}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})$ . The proofs of the convergence of the other two terms are similar. Note that $t_{1}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})=n^{-1}\sum_{j=1}^{n}S_{j}$ , where $S_{j}=U^{*}_{j}U^{*^{\top}}_{j}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})$ has zero mean. We will show that

[TABLE]

$\mathbb{P}$ -almost surely for any choice of ${\bf a},{\bf b}\in\mathbb{R}^{L}$ . We will then be able to conclude the first convergence by using the Borel-Cantelli lemma. Define $s_{j}={\bf a}^{\top}S_{j}{\bf b}$ (omitting the dependence on ${\bf a}$ and ${\bf b}$ for simplicity of notation). Observe that

[TABLE]

since the other terms vanish by using the fact that the $s_{j}$ ’s are i.i.d. and have zero mean. Now,

[TABLE]

We have already derived the expression of the above variance term while deriving the weak convergence of $\sqrt{n}({\bf a}^{\top}\overline{K}_{\zeta,L}{\bf b}-{\bf a}^{\top}\widehat{K}{\bf b})$ in the proof of Theorem 3 (see Step (2c) for details). Since the above variance term converges $\mathbb{P}$ -almost surely to a constant (depending on ${\bf a}$ and ${\bf b}$ ), the first term in (5.19) is $O(n^{-2})$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. Next, observe that

[TABLE]

We have already shown in the proof of Theorem 3 (see Step (2c) for details) that the first term in (5.20) is bounded above by a quantity which converges to a constant $\mathbb{P}$ -almost surely. Further, $\widehat{\Theta}_{q}\widehat{K}_{W,L}^{-1}\widehat{\Theta}_{q}$ converges to $K_{X,L}K_{W,L}^{-1}K_{X,L}$ $\mathbb{P}$ -almost surely. These facts show that the first term in (5.19) is $O(n^{-3})$ as $n\rightarrow\infty$ $\mathbb{P}$ -almost surely. So, (5.18) is true $\mathbb{P}$ -almost surely. This completes the proof of the almost sure convergence of $t_{1}-\mathbb{E}^{*}(U^{*}_{1}U^{*^{\top}}_{1})$ . ∎

5.2 On the Critical Grid Size

As noted in Remark 2, the critical value $L_{\dagger}<\infty$ in Theorem 1 depends on the choice of hypothesis boundary $d$ , and the spectrum of $k_{X}$ . The purpose of this section is to show that, for a very wide variety of continuous kernels, we have

[TABLE]

Namely, we will show the following for continuous $k_{X}$ :

•

If $k_{X}$ is strictly positive definite, and hence certainly of infinite rank, then the bound $L_{\dagger}\leq 2d+1$ holds true for any placement of the pairwise distinct grid nodes (not just for regular grids) without any additional assumptions on the form of the eigenfunctions.

•

If $k_{X}$ is positive semidefinite (whether finite or infinite rank), and the Reproducing Kernel Hilbert Space (RKHS) of $k_{X}$ contains

–

the collection of monomials $\{1,x,x^{2},...,x^{d-1}\}$ , then the bound $L_{\dagger}\leq 2d+1$ holds true for any placement of the distinct grid nodes (not necessarily equi-spaced); if the monomials are replaced by $d$ linearly independent polynomials of highest degree greater than $d-1$ , the bound $L_{\dagger}\leq 2d+1$ still holds true for all but finitely many configurations of the grid nodes.

–

the collection of the first $d$ Fourier basis elements, then the bound $L_{\dagger}\leq 2d+1$ holds true for any placement of the grid nodes; if the first $d$ Fourier elements are replaced by $d$ arbitrary linearly independent trigonometric polynomials, the bound $L_{\dagger}\leq 2d+1$ still holds true for all but finitely many configurations of the grid nodes.

–

a collection $\{1,F(x),F^{2}(x),...,F^{d-1}(x)\}$ , where $F:[0,1]\rightarrow[0,1]$ is any strictly increasing function, then the bound $L_{\dagger}\leq 2d+1$ holds true for any placement of the distinct grid nodes (not necessarily equi-spaced); if the exponents are replaced by $d$ arbitrary exponents, the bound $L_{\dagger}\leq 2d+1$ still holds true for all but finitely many configurations of the grid nodes.

–

a collection of $d$ functions $\{h_{j}\}_{j=1}^{d}$ that are linearly independent on any subset $K\subseteq[0,1]$ of positive Lebesgue measure, then the bound $L_{\dagger}\leq 2d+1$ holds true for almost all configurations of the grid nodes. Collections of functions $\{h_{j}\}_{j=1}^{d}$ of this type are ubiquitous, and include collections of linearly independent splines, or more generally of linearly independent piecewise analytic functions; such collections need not be comprised of smooth functions alone. One can easily produce examples of collections that contain nowhere differentiable functions444To see a concrete case, take $\{h_{j}\}$ to be $d$ independent realisations of a standard Brownian motion on $[0,1]$ ..

Notice that that the eigenfunctions $\{\varphi_{n}\}$ of $k_{X}$ are by default elements of the RKHS of $k_{X}$ . So if the eigensystem of $k_{X}$ includes $d$ (orthonormalised) functions as described in the cases above, then certainly so does $\mathrm{RKHS}(k_{X})$ .

To show why the statements listed above hold true, let $r_{\mathrm{true}}\leq\infty$ be the true rank of $k_{X}$ . Since $k_{X}$ is continuous, it admits the Mercer expansion

[TABLE]

This yields

[TABLE]

on our grid points $\{t_{j}\}_{j=1}^{L}$ . It follows that the $L\times L$ matrix $K_{X,L}$ is represented as

[TABLE]

where the $k$ -th row of $U\in\mathbb{R}^{L\times r_{\mathrm{true}}}$ is comprised of the sequence $\{\lambda_{n}^{1/2}\varphi_{n}(t_{k})\}_{n=1}^{r_{\mathrm{true}}}$ , for $1\leq k\leq L$ . Schematically,

[TABLE]

where the horizontal dots signify that there may be infinitely or finitely many columns depending on whether $r_{\mathrm{true}}<\infty$ or $r_{\mathrm{true}}=\infty$ .

If $K_{X,L}^{A,B}$ is the submatrix of $K_{X,L}$ obtained by retaining rows in the index set $A\subseteq\{1,...,L\}$ and columns in the index set $B\subseteq\{1,...,L\}$ , then

[TABLE]

where $U^{A}$ (resp. $U^{B}$ ) represents the submatrix of $U$ obtained by retaining rows in the index set $A$ (resp. $B$ ). Formally, we can view the matrices $U^{A}$ and $U^{B}$ as linear operators from $(\ell_{2})^{d}$ into $\mathbb{R}^{d}$ . Continuity of $k_{X}$ ensures that they are indeed finite rank Hilbert-Schmidt555To see this, note that

$\|U^{A}\|_{\mathrm{HS}}^{2}=\mathrm{trace}[(U^{A})^{\top}U^{A}]=\mathrm{trace}[U^{A}(U^{A})^{\top}]=\mathrm{trace}\left[\{k_{X}(t_{i},t_{j})\}_{i,j\in A}\right]=\sum_{i\in A}k_{X}(t_{i},t_{i})<\infty.$

So $\mathrm{det}(K_{X,L}^{A,B})\neq 0$ for a pair of index sets $A,B\subseteq\{1,...,L\}$ of cardinality $d\leq L$ if and only $U^{A}$ and $U^{B}$ are both of full column rank $d$ . In summary, since $A$ and $B$ are arbitrary, we have the following implication:

$U^{A}$ of column rank $d$ for any $A\subset\{1,...,L\}$ of cardinality $d$ $\implies$ all $d$ -minors of $K_{X,L}$ are non-vanishing

We will show in Subsection 5.2.2 that $U^{A}(t_{1},...,t_{L})$ is indeed of full column rank $d$ for any index set $A$ of cardinality $d\leq r_{\mathrm{true}}$ for the scenarios described at the top of this Section. First, though we will show in Subsection 5.2.1 why $L_{\dagger}\leq 2d+1$ when the $d$ -minors of $K_{X,L}$ are non-vanishing.

5.2.1 Showing that $L_{\dagger}\leq 2d+1$ When the $d$ -Minors of $K_{X,L}$ Are Non-Vanishing

Said differently, let us show that whenever all order $d$ minors of $K_{X,L}$ can be guaranteed to be non-zero, the critical value satisfies $L_{\dagger}\leq 2d+1$ . We will do this by showing that when $L\geq 2d+1$ , each diagonal entry of $\{K_{X,L}(i,i)\}_{i=1}^{L}$ of $K_{X,L}$ is a (rational) function of some of the off-diagonal entries $\{K_{X,L}(i,j)\}_{i\neq j}$ (and thus the diagonal entries are uniquely imputed by the off-diagonal entries).

Assume first that $L=2r_{\mathrm{true}}+1$ exactly. Let $D$ be the $(r_{\mathrm{true}}+1)\times(r_{\mathrm{true}}+1)$ submatrix of $K_{X,L}$ obtained by retaining the last $(r_{\mathrm{true}}+1)$ rows and first $(r_{\mathrm{true}}+1)$ columns of $K_{X,L}$ . Partition $D$ into four blocks,

[TABLE]

where:

•

$C$ is the $r_{\mathrm{true}}\times r_{\mathrm{true}}$ submatrix of $D$ obtained by retaining the last $r_{\mathrm{true}}$ rows and last $r_{\mathrm{true}}$ columns of $K_{X,L}$ .

•

$u$ is the $r_{\mathrm{true}}\times 1$ row vector with the first $r_{\mathrm{true}}$ entries of the first row of $C$ .

•

$v$ is the $1\times r_{\mathrm{true}}$ column vector with the last $r_{\mathrm{true}}$ entries of the last column of $C$

•

$x\in\mathbb{R}$ is the middle element on the diagonal of $K_{X,L}$ , or equivalently the top right entry of the matrix $C$ .

Note that $\mathrm{det}(D)=0$ (because $\mathrm{rank}(K_{X,L})=r_{\mathrm{true}}$ ) whereas $\mathrm{det}(C)\neq 0$ (because we are operating in the regime where $r_{\mathrm{true}}$ -minors of $K_{X,L}$ are non-zero). It follows that

[TABLE]

showing that $x$ is a rational function of the entries of $C$ , $u$ , and $v$ . It follows that the middle element of $K_{X,L}$ is uniquely specified by the off-diagonal elements of $K_{X,L}$ .

Notice that any diagonal element of $K_{X,L}$ can be brought to the middle position of the diagonal by means of the conjugation $\Pi K_{X,L}\Pi^{\top}$ , with $\Pi$ a suitable permutation matrix. This operation maps diagonal elements onto diagonal elements, and preserves the property that $r_{\mathrm{true}}$ -minors of $K_{X,L}$ are non-vanishing. It follows that the diagonal elements of $K_{X,L}$ are uniquely determined by its off diagonal elements when $L=2r_{\mathrm{true}}+1$ .

For any $L>2r_{\mathrm{true}}+1$ one can apply the exact same procedure working with the top-left $(2r_{\mathrm{true}}+1)\times 2r_{\mathrm{true}}+1$ submatrix of $K_{X,L}$ instead of the entire matrix $K_{X,L}$ , and using permutations to bring the remaining diagonal elements in-and-out of the said submatrix. It follows that $L_{\dagger}\leq 2r_{\mathrm{true}}+1$ .

5.2.2 Covariance Spectra Guaranteeing That The $d$ -Minors of $K_{X,L}$ Are Non-Vanishing

In any scenario, we need to show (in the notation introduced in the beginning of the main Section) that

$U^{A}(t_{1},\ldots,t_{L})$ is of full column rank $d$ for any index set $A\subset\{1,...,L\}$ of cardinality $d$

We will do this by means of verifying an equivalent condition stated in the next lemma:

Lemma 4.

Let $k_{X}$ be continuous, $\{t_{1},...,t_{L}\}$ a collection of nodes, and $A\subset\{1,...,L\}$ an index set of cardinality $d$ . The matrix $U^{A}(t_{1},\ldots,t_{L})$ is of full column rank $d$ if and only if there exist $d$ functions $h_{1}(\cdot),\ldots,h_{d}(\cdot)\in\mathrm{RKHS}(k_{X})$ such that the matrix $\{h_{j}(t_{i})\}_{i\in A,1\leq j\leq d}$ is non-singular.

Proof.

Assume first that there are $d$ functions $h_{j}\in\mathrm{RKHS}(k_{X})$ such that $\{h_{j}(t_{i})\}$ is non-singular. The function $h_{j}$ being in the RKHS is equivalent the existence of a square summable sequence $\{\theta_{j,n}\}_{n\geq 1}$ such that $h_{j}=\sum_{n\geq 1}\theta_{j,n}\lambda_{n}^{1/2}\varphi_{n}$ . Using the Cauchy-Schwarz inequality and Mercer’s theorem, the series can be seen to converge uniformly and absolutely, as its square is bounded by

[TABLE]

Hence, we may write $h_{j}(t_{i})=\sum_{n\geq 1}\theta_{j,n}\lambda_{n}^{1/2}\varphi_{n}(t_{i})$ , which shows that the range of $U^{A}$ contains $d$ linearly independent vectors – namely the columns of the $d\times d$ non-singular matrix $\{h_{j}(t_{i})\}$ . It follows that $\mathrm{rank}(U^{A})=d$ .

To prove the converse, assume that $\mathrm{rank}(U^{A})=d$ . Then $\mathrm{rank}(U^{A}(U^{A})^{\top})=d$ too. But $U^{A}(U^{A})^{\top}=\{k_{X}(t_{i},t_{j})\}_{i,j\in A}$ . Define $h_{j}(\cdot)=k_{X}(\cdot,x_{j})$ for $j\in A$ . It follows that there exist $d$ functions $h_{j}\in\mathrm{RKHS}(k_{X})$ such that $\{h_{j}(t_{i})\}$ is non-singular.

∎

Let us now revisit the scenarios enumerated earlier, in light of the Lemma:

Case where $k_{X}$ is strictly positive definite. In this case, define $h_{j}(\cdot)=k_{X}(\cdot,t_{j})$ , and notice that each such function is an element of the RKHS of $k_{X}$ . Indeed, the matrix $\{h_{j}(t_{i})\}$ is simply the matrix $\{k_{X}(t_{i},t_{j})\}_{i,j\in A}$ which is strictly positive definite by strict positive definiteness of $k_{X}$ for any $d$ pairwise distinct nodes $\{t_{j}\}_{j\in A}$ .

Cases where $k_{X}$ is positive semidefinite.

•

If $\{1,x,x^{2},...,x^{d-1}\}\in\mathrm{RKHS}(k_{X})$ , we can define $h_{j}(t_{i})=t_{i}^{j-1}$ . This is a $d\times d$ Vandermonde matrix, and hence non-singular for any pairwise distinct $d$ -tuple $\{t_{i}\}$ . Clearly, the same discussion applies if $\{1,F(x),F^{2}(x),...,F^{d-1}(x)\}\in\mathrm{RKHS}(k_{X})$ for any for $F$ strictly increasing, by defining $h_{j}(t_{i})=[F(t_{i})]^{j-1}$ , and noting that $F(t_{i})=\tau_{i}$ simply yields a different grid of distinct points, so that $h_{j}(\tau_{i})$ is again a Vandermonde matrix over distinct nodes.

•

If $\mathrm{RKHS}(k_{X})$ contains $d$ linearly independent polynomials $\{h_{j}\}_{j=1}^{d}$ of highest degree greater than $d-1$ , the matrix is $\{h_{j}(t_{i})\}$ is a polynomial matrix of generalised Vandermonde type, with a determinant proportional to $Q_{A}(\{t_{m}\}_{m\in A})\prod_{\{(i,j)\in A\times A:\,i<j\}}(t_{i}-t_{j})$ , where $Q_{A}$ is a finite degree polynomial. Hence, provided the grid points are distinct, this determinant vanishes nowhere but at the finitely many $\{t_{i}\}_{i\in A}$ satisfying polynomial restrictions dictated by the root structure of $Q_{A}$ . Since there are finitely many index sets $A\subset\{1,\ldots,L\}$ of cardinality $d$ , there are also finitely many corresponding polynomials $Q_{A}$ , and hence only a finite number of grids $\{t_{i}\}_{i=1}^{L}$ for which $Q_{A}(\{t_{j}\}_{j\in A})$ , for some choice of $A$ , vanishes. The same discussion applies if $\mathrm{RKHS}(k_{X})$ contains $d$ linearly independent polynomials of highest degree greater than $d-1$ , each composed with a monotone map $F$ , by switching to the grid $\tau_{i}=F(t_{i})$ .

•

If $\mathrm{RKHS}(k_{X})$ contains the collection of the first $d$ Fourier basis elements, or $d$ linearly independent trigonometric polynomials, the same discussion can be repeated as in the polynomial case, except with trigonometric polynomials (seen as polynomials of pairwise distinct unit modulus complex arguments).

•

If $\mathrm{RKHS}(k_{X})$ contains a collection of $d$ functions $\{h_{j}\}_{j=1}^{d}$ that are linearly independent on any subset $K\subseteq[0,1]$ of positive Lebesgue measure, we claim that $\mathrm{det}(\{h_{j}(t_{i})\}_{1\leq j\leq d,i\in A})$ is non-vanishing for almost all $d$ -tuples $\{t_{i}\}_{i\in A}$ . This will require a more lengthy argument, and to relax the indexing notation, we will write $\{t_{j}\}_{j\in A}=\{x_{j}\}_{j=1}^{d}$ . For $q\in\{1,...,d\}$ , write $H_{q}=\{h_{j}(x_{i})\}_{i,j=1}^{q}$ .To show that $\mathrm{det}(H_{d})$ is non-vanishing for almost all $d$ -tuples $\{x_{1},...,x_{d}\}$ , we will use induction:

We will first prove that $\mathrm{det}[H_{1}(x_{1})]\neq 0$ almost everywhere on $[0,1]$ . 2. 2.

Then we will prove that if $\mathrm{det}[H_{q-1}(x_{1},...,x_{q-1})]\neq 0$ almost everywhere on $[0,1]^{q-1}$ , implies that $\mathrm{det}[H_{q}(x_{1},...,x_{q})]\neq 0$ almost everywhere on $[0,1]^{q}$ , for any $2\leq q\leq d$ .

Step 1: Case $q=1$ . We need to show that $\mathrm{det}[H_{1}(x_{1})]\neq 0$ for almost all $x_{1}\in[0,1]$ . Equivalently, that $h_{1}(y)$ cannot vanish on a set $K\subseteq[0,1]$ of positive Lebesgue measure. Since the $d$ functions $\{h_{j}\}_{j=1}^{d}$ are linearly independent on any set of positive Lebesgue measure, $h_{1}(y)$ cannot vanish uniformly on such a set.

Step 2: Induction step. Now take $2\leq q\leq d$ , and suppose that $\mathrm{det}(H_{q-1})\neq 0$ almost everywhere on $[0,1]^{q-1}$ , but that $\mathrm{det}(H_{q})=0$ for all $(x_{1},...,x_{q})\in G_{q}$ , for some $G_{q}\subseteq[0,1]^{q}$ of positive $q$ -Lebesgue measure. We will obtain a contradiction. Note that,

[TABLE]

where $G^{x_{1},...,x_{q-1}}=\{y\in[0,1](x_{1},...,x_{q-1},y)\in G_{q}\}\subseteq[0,1]$ is the $(x_{1},...,x_{q-1})$ -section of $G_{q}$ . It follows that $\mathrm{Leb}_{1}(G^{x_{1},...,x_{q-1}})>0$ for all $(x_{1},...,x_{q-1})$ in a set $G_{q-1}\subseteq[0,1]^{q-1}$ of positive $(q-1)$ -Lebesgue measure, i.e. $\mathrm{Leb}_{q-1}(G_{q-1})>0$ .

With this observation in mind, we use the Leibniz formula for the determinant to translate the statement that

[TABLE]

into the equivalent statement

[TABLE]

where $\mathrm{Sym}\{1,...,q\}$ denotes the group of permutations on $\{1,...,q\}$ , and $\mathrm{sgn}(\pi)$ is the signature of a permutation $\pi$ . Thus, for any $(q-1)$ -tuple $(x_{1},\ldots,x_{q-1})\in G_{q-1}$ , we may view the last expression as a function of the last coordinate $y=x_{q}$ , and write

[TABLE]

Regrouping the summations now yields

[TABLE]

where for $1\leq i\leq q$ , the mapping $\rho_{i}(j)$ gives the rank (in the sense of sequentially increasing order) of any $j\in\{1,...,q\}\setminus\{i\}$ , thus providing a bijection between $\{1,...,q\}\setminus\{i\}$ and $\{1,...,q-1\}$ .

But we have observed that $G^{x_{1},...,x_{q-1}}$ has positive $\mathrm{Leb}_{1}$ -measure for any $(x_{1},...,x_{q-1})\in G_{d-1}$ , and the $\{h_{i}\}_{i=1}^{q}$ are linearly independent on any set of positive $\mathrm{Leb}_{1}$ measure. Hence, it must be that

[TABLE]

for any $(q-1)$ -tuple $(x_{1},...,x_{q-1})\in G_{q-1}$ , where $\mathrm{Leb}_{q-1}(G_{q-1})>0$ . But now notice that

[TABLE]

because $\rho_{q}(j)=j$ for any $j\in\{1,...,q-1\}$ . We have thus arrived at a contradiction of our inductive induction assumption that $H_{q-1}(x_{1},...,x_{q-1})\neq 0$ on any set of positive $\mathrm{Leb}_{q-1}$ -measure.

5.3 On the Invertibility of the Hessian $\nabla^{2}\Psi$

The purpose of this section is to further analyse Assumption (H), used to deduce the large sample distribution of $nT_{q}$ under $H_{0,q}$ ,

Assumption (H):

Under $H_{0,q}$ , there exists a factorisation $K_{X,L}=C_{0}C_{0}^{\top}$ , where $C_{0}\in\mathbb{R}^{L\times q}$ , such that $\mathrm{det}(\nabla^{2}\Psi(C_{0}))\neq 0$ .

In particular, we will show:

That Assumption (H), is satisfied if Assumption (E) below holds true:

Assumption (E):

The $q$ leading eigenvectors of $K_{X,L}$ have non-zero entries. 2. 2.

That Assumption (E) is satisfied in the scenarios mentioned in Remark 2 and listed in detail at the beginning of Section 5.2, for almost all grids $t_{1}<...<t_{L}$ .

To show the first point, choose $C$ to be

[TABLE]

i.e. exactly as in Equation (5.22), which reduces to an equation for an $L\times q$ matrix under $H_{0,q}$ . Mercer’s theorem implies that, indeed, $K_{X,L}=CC^{\top}$ . Let

[TABLE]

be the singular value decomposition of $C$ , where $\Gamma$ is diagonal, and $V$ and $W$ are orthogonal. Define

[TABLE]

and note that

[TABLE]

In particular note that the first and second line imply that $V$ has the leading $q$ eigenvectors of $K_{X,L}$ as its columns. Our aim will be to show that choosing the $L\times q$ factor $H$ of $K_{X,L}$ yields

[TABLE]

provided $H$ has non-zero entries – equivalently, provided $V$ has non-zero entries, i.e. Assumption (E) holds true. Note that the form of the Hessian at any $C\in\mathbb{R}^{L\times q}$ has been shown to be

[TABLE]

where $\circ$ is the Hadamard product, $P_{m}$ is the $m\times m$ matrix containing [math]’s on the diagonal and $1$ ’s everywhere else, and $M$ is the order $(L,q)$ commutation matrix, i.e. the unique permutation matrix satisfying

[TABLE]

for any $R\in\mathbb{R}^{L\times q}$ . Plugging in $H$ , the Hessian $\nabla^{2}\Psi(H)$ reduces to

[TABLE]

because $HH^{\top}=K_{X,L}$ and $H^{\top}H=\Gamma^{2}$ is a diagonal matrix. Therefore, it suffices to show that

[TABLE]

To this aim, we will make use of two Lemmas, the first of which probes the structure of $(H^{\top}\otimes H)M$ :

Lemma 5.

The $(Lq)\times(Lq)$ matrix $(H^{\top}\otimes H)M$ is a $q\times q$ block matrix of $L\times L$ rank 1 blocks $\{H_{ij}\}_{i,j=1}^{q}$ , defined as

[TABLE]

i.e. $H_{j}$ is the $j$ th column of $H$ .

Proof.

Let $v,u\in\mathbb{R}^{Lq}$ and define $A,B$ to be the $L\times q$ matrices such that $v=\mathrm{vec}(A)$ and $u=\mathrm{vec}(B)$ . Then, for $\langle\cdot,\cdot\rangle_{F}$ the Frobenius inner product, and recalling that $M$ is the order $(L,q)$ commutation matrix, we may write

[TABLE]

Let $G$ be the stipulated block matrix. We will now show that

[TABLE]

thus establishing that $G=(H^{\top}\otimes H)M$ , by arbitrary choice of $u,v$ . To this aim, partition $A$ and $B$ as

[TABLE]

where $A_{j},B_{j}\in\mathbb{R}^{L}$ are the $j$ th columns of $A$ and $B$ , respectively. This partitions the coordinates of $u$ and $v$ into groups of $L$ ,

[TABLE]

We can now calculate

[TABLE]

and

[TABLE]

so that

[TABLE]

The $i$ th diagonal element of the last expression is seen to be equal to $\sum_{j=1}^{q}B_{i}^{\top}H_{j}A_{j}^{\top}H_{i}$ . Consequently,

[TABLE]

Now we turn our attention to $v^{\top}Gv$ which is seen to be

[TABLE]

Upon observing that the last line coincides with expression (5.29), the proof is complete.

∎

The second Lemma is a standard fact about Hadamard products, stated here without proof for completeness (see, e.g. Horn and Johnson, (2012)).

Lemma 6.

Let $x,y\in\mathbb{R}^{m}$ and $A\in\mathbb{R}^{m\times m}$ . Then,

[TABLE]

for $\Delta_{v}$ the $m\times m$ diagonal matrix with the elements of $v\in\mathbb{R}^{m}$ on its diagonal.

Using the fact that the entries of $H$ are assumed non-zero, and armed with the last two Lemmas, we will now show that $\mathrm{det}\Big{\{}P_{qL}\circ\{(H^{\top}\otimes H)M\}\Big{\}}\neq 0$ . Clearly, it suffices to show that $\mathrm{det}\Big{\{}Q[P_{qL}\circ\{(H^{\top}\otimes H)M\}]Q^{\top}\Big{\}}\neq 0$ for any non-singular $(Lq)\times(Lq)$ matrix $Q$ . Define $Q$ as a block diagonal matrix comprised of $q^{2}$ blocks of dimension $L\times L$ ,

[TABLE]

The matrix $Q$ is well defined since the entries of $H$ are all non-zero, and is clearly of full rank. Moreover, by Lemma 5,

[TABLE]

We claim that the last matrix is equal to $P_{qL}$ . To show this, we will show that:

•

Diagonal blocks equal $\bm{P_{L}}$ . The typical diagonal block is of the form $Q_{i}(P\circ H_{i,i})Q_{i}^{\top}=Q_{i}(P\circ H_{i}H_{i}^{\top})Q_{i}^{\top}=Q_{i}\Delta_{H_{i}}P_{L}\Delta_{H_{i}}Q_{i}^{\top}$ , where $\Delta_{H_{i}}$ is the diagonal matrix whose diagonal contains the elements of the column vector $H_{i}$ , and we have made use of Lemma 6 to obtain the last equality. Now we calculate

[TABLE]

where $E$ is known as the exchange matrix of order $L$ . Noting that $\Delta_{H_{i}}Q_{i}^{\top}=(Q_{i}\Delta_{H_{i}})^{\top}$ and $E^{\top}=E$ , we conclude that the typical diagonal block equals $EP_{L}E=P_{L}$ .

•

Off-diagonal blocks equal $\mathbf{1}_{L}\mathbf{1}_{L}^{\top}$ . The typical off-diagonal block is of the form $Q_{i}H_{i,j}Q_{j}^{\top}=Q_{i}H_{i}H_{j}^{\top}Q_{j}^{\top}=(Q_{i}H_{i})(Q_{j}H_{j})^{\top}$ , and the latter equals

[TABLE]

In summary, we have shown that

[TABLE]

provided $H$ has non-zero entries. Now we use the fact that $\mathrm{det}(P_{qL})=(qL-1)(-1)^{qL-1}$ (which can be readily checked by row reduction), to conclude that the determinant is non-zero provided $H$ has non-zero entries, or, equivalently provided that $H=V\Gamma$ has non-zero entries. In summary, we have established:

Assumption (E) $\implies$ Assumption (H)

Now we move to the second point, i.e. showing that Assumption (E) is satisfied in the scenarios listed in the beginning of Section 5.2, for almost all grids $t_{1}<...<t_{L}$ . To see this, recall the definition of $H$ as

[TABLE]

where the function $h_{j}$ is a linear combination of the scaled eigenfunctions,

[TABLE]

Now suppose that the collection $\{\varphi_{1},...,\varphi_{q}\}$ remains linearly independent when restricted to any $[0,1]$ -subset of positive Lebesgue measure (as is the case in all of the scenarios listed at the top of Section 5.2). Then $\{h_{1},...,h_{q}\}$ have the same property. To see this, let $G\subset[0,1]$ be any subset of positive measure and assume that $\alpha_{1}h_{1}(u)+\ldots\alpha_{q}h_{q}(u)=(h_{1}(u),\ldots,h_{q}(u))\bm{\alpha}=0$ for all $u\in G$ . By definition, this implies that $(\lambda^{1/2}_{1}\varphi_{1}(u),\ldots,\lambda^{1/2}_{q}\varphi_{q}(u))(W\bm{\alpha})=0$ on $G$ , which can only happen if $W\bm{\alpha}=0$ , because $\{\varphi_{1},...,\varphi_{q}\}$ are linearly independent on $G$ . Since $W$ is orthogonal, $W\bm{\alpha}=0$ now implies that $\alpha_{1}=...=\alpha_{q}=0$ . In particular, this property implies that $h_{j}(u)\neq 0$ almost everywhere on $[0,1]$ . Therefore, $H$ has non-zero entries for almost all grids $\{t_{1},...,t_{L}\}$ , or equivalently, Assumption (E) is satisfied for almost all grids.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Amini and Wainwright, (2012) Amini, A. A. and Wainwright, M. J. (2012). Sampled forms of functional PCA in reproducing kernel Hilbert spaces. Ann. Statist. , 40(5):2483–2510.
2Bai and Ng, (2002) Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica , 70(1):191–221.
3Carey et al., (1998) Carey, J., Liedo, P., Müller, H., Wang, J., and Chiou, J. (1998). Relationship of age patterns of fecundity to mortality, longevity, and lifetime reproduction in a large cohort of mediterranean fruit fly females. J. of Gerontology - Biological Sciences , 53:245–251.
4Chen and Wainwright, (2015) Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. Tech. Report ar Xiv:1509.03025.
5Ferraty and Vieu, (2006) Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis . Springer Series in Statistics. Springer, New York. Theory and practice.
6Hall and Vial, (2006) Hall, P. and Vial, C. (2006). Assessing the finite dimensionality of functional data. J. R. Stat. Soc. Ser. B Stat. Methodol. , 68(4):689–705.
7Horn, (1965) Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika , 30(2):179–185.
8Horn and Johnson, (2012) Horn, R. A. and Johnson, C. R. (2012). Matrix analysis . Cambridge university press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Testing for the Rank of a Covariance Operator

Abstract

keywords:

keywords:

Contents

1 Introduction

2 Methodology

2.1 Problem Statement and Background

Proposition 1**.**

2.2 Identifiability

Theorem 1** (Identifiability).**

Remark 1** (Notation).**

Remark 2** (Critical Grid Size).**

2.3 The Testing Procedure

2.4 Asymptotic Theory

Remark 3** (On The Hessian Condition).**

Theorem 2** (Asymptotic Distribution of the Test Statistic).**

2.5 Bootstrap Calibration

Remark 4** (Bootstrap Heuristic).**

Theorem 3** (Bootstrap Validity).**

Remark 5**.**

2.6 Practical Implementation

2.6.1 Hypothesis Boundary, Grid Size, Bootstrap Parameters

2.6.2 Computation

3 Simulation study

3.1 Homoskedastic errors

3.2 Heteroskedastic errors

3.3 Spiked functional data

3.4 Infinite dimensional models

4 Data Analysis

5 Appendix

5.1 Proofs of Formal Statements

Lemma 1**.**

Proof.

Corollary 1**.**

Proof.

Proof of Proposition 1.

Lemma 2**.**

Proof.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Lemma 3**.**

Proof.

5.2 On the Critical Grid Size

5.2.1 Showing that L†≤2d+1L_{\dagger}\leq 2d+1L†​≤2d+1 When the ddd-Minors of KX,LK_{X,L}KX,L​ Are Non-Vanishing

5.2.2 Covariance Spectra Guaranteeing That The ddd-Minors of KX,LK_{X,L}KX,L​ Are Non-Vanishing

Lemma 4**.**

Proof.

5.3 On the Invertibility of the Hessian ∇2Ψ\nabla^{2}\Psi∇2Ψ

Lemma 5**.**

Proof.

Lemma 6**.**

Proposition 1.

Theorem 1 (Identifiability).

Remark 1 (Notation).

Remark 2 (Critical Grid Size).

Remark 3 (On The Hessian Condition).

Theorem 2 (Asymptotic Distribution of the Test Statistic).

Remark 4 (Bootstrap Heuristic).

Theorem 3 (Bootstrap Validity).

Remark 5.

Lemma 1.

Corollary 1.

Lemma 2.

Lemma 3.

5.2.1 Showing that $L_{\dagger}\leq 2d+1$ When the $d$ -Minors of $K_{X,L}$ Are Non-Vanishing

5.2.2 Covariance Spectra Guaranteeing That The $d$ -Minors of $K_{X,L}$ Are Non-Vanishing

Lemma 4.

5.3 On the Invertibility of the Hessian $\nabla^{2}\Psi$

Lemma 5.

Lemma 6.