A kernel- and optimal transport- based test of independence between   covariates and right-censored lifetimes

David Rindt; Dino Sejdinovic; David Steinsaltz

arXiv:1906.03866·math.ST·November 3, 2020

A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes

David Rindt, Dino Sejdinovic, David Steinsaltz

PDF

1 Repo

TL;DR

This paper introduces optHSIC, a nonparametric independence test for censored lifetimes using optimal transport to handle censoring, enabling more flexible detection of dependencies than traditional methods.

Contribution

The paper presents a novel kernel-based independence test that effectively manages right-censored data through optimal transport, extending the applicability of dependence testing in survival analysis.

Findings

01

optHSIC controls type 1 error under independent censoring

02

It has greater power than Cox regression against various alternatives

03

Effective even when censoring depends on covariates

Abstract

We propose a nonparametric test of independence, termed optHSIC, between a covariate and a right-censored lifetime. Because the presence of censoring creates a challenge in applying the standard permutation-based testing approaches, we use optimal transport to transform the censored dataset into an uncensored one, while preserving the relevant dependencies. We then apply a permutation test using the kernel-based dependence measure as a statistic to the transformed dataset. The type 1 error is proven to be correct in the case where censoring is independent of the covariate. Experiments indicate that optHSIC has power against a much wider class of alternatives than Cox proportional hazards regression and that it has the correct type 1 control even in the challenging cases where censoring strongly depends on the covariate.

Tables11

Table 1. Table 1 : An overview of the type 1 error results obtained for optHSIC in the case when C ⟂ ⟂ X perpendicular-to absent perpendicular-to 𝐶 𝑋 C\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and when C ⟂ ⟂ X 𝐶 perpendicular-to absent perpendicular-to 𝑋 C\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X . B 𝐵 B denotes the number of permutations. The p 𝑝 p -value is defined in Algorithm 3 .

Correct type

1 error rate

in theory:

P_{H_{0}} ​ (p \leq α) \leq α .

Distribution

p

under

H_{0}

Correct type 1

error rate

in simulations

C ⟂ ⟂ X

Yes

(Theorem 5.1)

Uniform ​ [\frac{1}{B + 1}, \dots, \frac{B + 1}{B + 1}]

(Theorem 5.1)

Yes

(Sec 7.1)

C ​ ⟂ ⟂ ​ X

Unknown

Unkown, but experiments

suggest approximately

Uniform ​ [\frac{1}{B + 1}, \dots, \frac{B + 1}{B + 1}]

(Figure 10 Appendix)

Yes

(Sec 7.1)

Table 2. Table 2 : The 8 scenarios in which we simulate the type 1 error rate. N 10 ( 0 , Σ ) subscript 𝑁 10 0 Σ N_{10}(0,\Sigma) is a 10-dimensional Gaussian random variable, with 0 0 denoting a vector of zeros of length 10, and Σ = M M T Σ 𝑀 superscript 𝑀 𝑇 \Sigma=MM^{T} where M 𝑀 M is a 10 × 10 10 10 10\times 10 matrix of i.i.d N ( 0 , 1 ) 𝑁 0 1 N(0,1) entries. Note that C 𝐶 C depends on X 𝑋 X in D.3-8.

D.	X	$T \| X$	$C \| X$	$% δ = 1$
1	$Unif [- 1, 1]$	$Unif [0, 1]$	$Unif [0, 1.5]$	$66 %$
2	$Unif [- 1, 1]$	$Exp (mean = 5 / 2)$	$Exp (mean = 5 / 3)$	$40 %$
3	$Unif [- 1, 1]$	$Exp (mean = 2 / 3)$	$Exp (mean = \exp (X))$	$60 %$
4	$Unif [- 1, 1]$	$Exp (mean = 1.6)$	$Exp (mean = \exp (9 X^{2}))$	$60 %$
5	$Unif [- 1, 1]$	$Exp (mean = 0.9)$	$Weib (shape = 1.75 X + 3.25)$	$60 %$
6	$Unif [- 1, 1]$	$Exp (mean = 0.9)$	$1 + X$	$60 %$
7	$N_{10} (0, Σ)$	$Exp (mean = 0.6)$	$Exp (mean = \exp (1^{T} X))$	$60 %$
8	$N_{10} (0, Σ)$	$Exp (mean = 0.6)$	$Exp (mean = \exp (X_{1} / 8))$	$60 %$

Table 3. Table 3 : The rejection rate of optHSIC for the distributions D.1-8 of Table 2 . Note the type 1 error rate is very close to the level α = 0.05 𝛼 0.05 \alpha=0.05 as desired, also in the scenarios where C ⟂ ⟂ X 𝐶 perpendicular-to absent perpendicular-to 𝑋 C\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X .

$n =$	$40$	$80$	$120$	$160$	$200$	$240$	$280$	$320$	$360$	$400$
D.1	0.047	0.053	0.051	0.050	0.054	0.051	0.045	0.056	0.049	0.048
D.2	0.049	0.051	0.057	0.044	0.053	0.050	0.047	0.049	0.052	0.048
D.3	0.052	0.052	0.048	0.049	0.044	0.050	0.048	0.054	0.054	0.048
D.4	0.050	0.054	0.049	0.054	0.054	0.054	0.050	0.053	0.050	0.046
D.5	0.050	0.056	0.056	0.051	0.053	0.051	0.046	0.055	0.050	0.049
D.6	0.050	0.049	0.048	0.047	0.048	0.057	0.050	0.050	0.048	0.050
D.7	0.050	0.055	0.047	0.048	0.056	0.054	0.054	0.052	0.051	0.048
D.8	0.041	0.050	0.051	0.048	0.056	0.049	0.054	0.050	0.054	0.052

Table 4. Table 4 : The 8 scenarios in which we test the power in the main text. N 10 ( 0 , Σ ) subscript 𝑁 10 0 Σ N_{10}(0,\Sigma) is a 10-dimensional Gaussian random variable, with 0 0 denoting a vector of zeros of length 10, and Σ = M M T Σ 𝑀 superscript 𝑀 𝑇 \Sigma=MM^{T} where M 𝑀 M is a 10 × 10 10 10 10\times 10 matrix of i.i.d N ( 0 , 1 ) 𝑁 0 1 N(0,1) entries. In these distributions about 60 % percent 60 60\% of the individuals is observed. Distributions with dependent censoring and varying censoring percentages are discussed in Section A.10.2 .

D.	X	$T \| X$	$C \| X$
1	$N (0, 1)$	$Exp (mean = \exp (X / 6))$	$Exp (mean = 1.5)$
2	$N (0, 1)$	$Exp (mean = \exp (X^{2} / 5)$	$Exp (mean = 2.25)$
3	$Unif [- 1, 1]$	$Weib (shape = 1.75 X + 3.25)$	$Exp (mean = 1.75)$
4	$Unif [- 1, 1]$	$N (100 - X, 2 X + 5.5)$	$82 + Exp (mean = 35)$
5	$N_{10} (0, Σ)$	$Exp (mean = \exp (1^{T} X / 30))$	$Exp (mean = 1.5)$
6	$N_{10} (0, Σ)$	$Exp (mean = \exp (X_{1} / 8))$	$Exp (mean = 1.5)$
7	$N_{10} (0, Σ)$	$Exp (mean = \exp (X_{1}^{2} / 4)$	$Exp (mean = 10)$
8	$N_{10} (0, Σ)$	$Exp (mean = \exp (X_{1}^{2} / 4 + X_{2} / 7))$	$Exp (mean = 8)$

Table 5. Table 5 : An example dataset D 𝐷 D for which we will demonstrate the transformation.

$i$	$x_{i}$	$z_{i}$	$δ_{i}$
1	$1.3$	13	1
2	$0.5$	22	0
3	$0.3$	24	1
4	$- 1.1$	45	1
5	$- 0.9$	81	0

Table 6. Table 6 : The rejection rate of zHSIC in against the distributions D.1-8 of Table 2 .

$n =$	$40$	$80$	$120$	$160$	$200$	$240$	$280$	$320$	$360$	$400$
D.1	0.048	0.050	0.051	0.052	0.052	0.049	0.047	0.054	0.051	0.049
D.2	0.048	0.051	0.047	0.047	0.049	0.051	0.048	0.054	0.050	0.047
D.3	0.243	0.461	0.630	0.774	0.860	0.909	0.953	0.970	0.985	0.991
D.4	0.142	0.232	0.343	0.487	0.610	0.734	0.812	0.880	0.932	0.959
D.5	0.075	0.116	0.142	0.168	0.210	0.243	0.278	0.316	0.357	0.397
D.6	0.932	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
D.7	0.308	0.594	0.761	0.856	0.906	0.937	0.960	0.971	0.980	0.984
D.8	0.078	0.122	0.152	0.211	0.264	0.307	0.355	0.388	0.439	0.467

Table 7. Table 7 : The rejection rate of wHSIC in against the distributions D.1-8 of Table 2 .

$n =$	$40$	$80$	$120$	$160$	$200$	$240$	$280$	$320$	$360$	$400$
D.1	0.045	0.047	0.049	0.050	0.049	0.051	0.049	0.054	0.048	0.049
D.2	0.050	0.047	0.048	0.047	0.050	0.048	0.047	0.045	0.048	0.049
D.3	0.079	0.166	0.235	0.326	0.410	0.466	0.540	0.597	0.658	0.700
D.4	0.161	0.204	0.232	0.270	0.309	0.350	0.394	0.456	0.506	0.549
D.5	0.057	0.084	0.110	0.131	0.163	0.191	0.216	0.258	0.274	0.299
D.6	0.267	0.672	0.900	0.971	0.990	0.998	0.999	0.999	1.000	1.000
D.7	0.071	0.107	0.143	0.192	0.260	0.305	0.369	0.412	0.480	0.527
D.8	0.057	0.078	0.081	0.095	0.120	0.142	0.153	0.162	0.177	0.190

Table 8. Table 8 : The rejection rate of the Cox proportional hazards likelihood ratio test in against the distributions D.1-8 of Table 2 .

$n =$	$40$	$80$	$120$	$160$	$200$	$240$	$280$	$320$	$360$	$400$
D.1	0.056	0.054	0.050	0.047	0.055	0.048	0.048	0.055	0.050	0.051
D.2	0.055	0.056	0.055	0.050	0.055	0.051	0.049	0.050	0.052	0.050
D.3	0.057	0.056	0.051	0.053	0.054	0.051	0.046	0.057	0.048	0.050
D.4	0.057	0.061	0.050	0.051	0.054	0.058	0.049	0.053	0.051	0.051
D.5	0.058	0.058	0.055	0.052	0.053	0.053	0.048	0.054	0.048	0.049
D.6	0.053	0.051	0.051	0.050	0.046	0.048	0.057	0.053	0.054	0.055
D.7	0.148	0.094	0.084	0.062	0.063	0.061	0.058	0.055	0.060	0.058
D.8	0.143	0.086	0.074	0.064	0.067	0.058	0.059	0.058	0.061	0.060

Table 9. Table 9 : The parametrized distributions to test the power under different censoring rates. Here Σ 10 = M M T subscript Σ 10 𝑀 superscript 𝑀 𝑇 \Sigma_{10}=MM^{T} where M 𝑀 M is a 10 × 10 10 10 10\times 10 matrix of i.i.d. standard normal entries. M is sampled once and then kept fixed. The parameter θ 𝜃 \theta varies such that 20 , 40 , 60 , 80 , 100 % 20 40 60 80 percent 100 20,40,60,80,100\% of the individuals are observed (i.e. Δ = 1 Δ 1 \Delta=1 ). The sample size is n = 200 𝑛 200 n=200 in each case.

D.	$Z \| X$	$C \| X$	X
1	$Exp (mean = \exp (X / 5))$	$Exp (mean = θ)$	$N (0, 1)$
2	$Exp (mean = \exp (X^{2}) / 5)$	$Exp (mean = θ \exp (X))$	$N (0, 1)$
3	$Weib (shape = 1.75 X + 3.25)$	$Exp (mean = θ X^{2})$	$Unif [- 1, 1]$
4	$N (mean = 100 - X, var = 2 X + 5.5)$	$82 + Exp (mean = θ)$	$Unif [- 1, 1]$
5	$Exp (mean = \exp (1^{T} X / 30))$	$Exp (mean = θ)$	$N_{10} (0, cov = Σ_{10})$
6	$Exp (mean = \exp (X_{4} / 7))$	$Exp (mean = θ \exp (1^{T} X / 30))$	$N_{10} (0, cov = Σ_{10})$
7	$Exp (mean = \exp (X_{4}^{2} / 20))$	$Exp (mean = θ \exp (X_{2}^{2}) / 20)$	$N_{10} (0, cov = Σ_{10})$
8	$Exp (mean = \exp (X_{10}^{2} + 2 X_{8}) / 20)$	$Exp (mean = θ \exp (X_{2} / 7))$	$N_{10} (0, cov = Σ_{10})$

Table 10. Table 10 : The rejection rates of the various methods against distributions D.1-D.8 given in Table 9 . When C ⟂ ⟂ X 𝐶 perpendicular-to absent perpendicular-to 𝑋 C\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X , we only show rejection rates of the CPH test and optHSIC, because wHSIC and zHSIC have high inflated rejection rates due to the dependency of C 𝐶 C and X 𝑋 X . The top row shows the percentage of observed events ( Δ = 1 ) Δ 1 (\Delta=1) .

$% Δ = 1$		$20 %$	$40 %$	$60 %$	$80 %$	$100 %$
D.1	Cph	0.243	0.422	0.565	0.705	0.753
	optHSIC	0.229	0.382	0.501	0.634	0.699
	wHSIC	0.038	0.062	0.182	0.395	0.703
	zHSIC	0.066	0.168	0.329	0.525	0.701
D.2	Cph	0.180	0.267	0.268	0.225	0.108
	optHSIC	0.087	0.171	0.258	0.378	0.686
D.3	Cph	0.073	0.056	0.107	0.223	0.288
	optHSIC	0.242	0.177	0.399	0.886	0.968
D.4	Cph	0.187	0.091	0.064	0.046	0.039
	optHSIC	0.346	0.224	0.275	0.509	0.779
	wHSIC	0.138	0.285	0.452	0.654	0.770
	zHSIC	0.105	0.172	0.274	0.410	0.759
D.5	Cph	0.315	0.487	0.610	0.705	0.836
	optHSIC	0.268	0.439	0.546	0.629	0.775
	wHSIC	0.055	0.072	0.169	0.409	0.786
	zHSIC	0.083	0.229	0.362	0.605	0.760
D.6	Cph	0.461	0.732	0.834	0.916	0.939
	optHSIC	0.396	0.681	0.801	0.876	0.952
D.7	Cph	0.055	0.068	0.077	0.078	0.107
	optHSIC	0.043	0.100	0.134	0.289	0.669
D.8	Cph	0.162	0.313	0.431	0.517	0.572
	optHSIC	0.164	0.335	0.498	0.619	0.916

Table 11. Table 11 : The 4 scenarios in which in which we perform two-sample tests. T 1 subscript 𝑇 1 T_{1} is 0.43 0.43 0.43 w.p. 0.75 0.75 0.75 and 1.39 + Exp ( 1 ) 1.39 Exp 1 1.39+\text{Exp}(1) w.p. 0.25. Note that in D.4 the null hypothesis holds.

D.	$T_{0}$	$T_{1}$	$C_{0}$	$C_{1}$	% Observed
1	$Exp (1)$	$Exp (1 / 1.6)$	$Exp (1 / 2)$	$Exp (1 / 2)$	60 %
2	$Weib (1, 5)$	$Weib (1, 1.5)$	$Exp (1 / 2)$	$Exp (1 / 2)$	60 %
3	$Exp (1)$	(0.43, 1.39+ $Exp (1))$	$1 + Exp (1 / 2)$	$1 + Exp (1 / 2)$	90 %
4	$Exp (1)$	$Exp (1)$	$Exp (2)$	None	65 %

Equations223

⟨ f, k (x, \cdot)⟩ = f (x) for all f \in H, x \in X .

⟨ f, k (x, \cdot)⟩ = f (x) for all f \in H, x \in X .

MMD (P, Q) : = ∣∣ μ_{P} - μ_{Q} ∣ ∣_{H_{X}},

MMD (P, Q) : = ∣∣ μ_{P} - μ_{Q} ∣ ∣_{H_{X}},

∣∣ μ_{P} - μ_{Q} ∣ ∣_{H_{X}} = f \in H_{X} : ∥ f ∥ \leq 1 sup E_{P} f (X) - E_{Q} f (X)

∣∣ μ_{P} - μ_{Q} ∣ ∣_{H_{X}} = f \in H_{X} : ∥ f ∥ \leq 1 sup E_{P} f (X) - E_{Q} f (X)

K ((x, y), (x^{'}, y^{'})) : = k (x, x^{'}) l (y, y^{'}) .

K ((x, y), (x^{'}, y^{'})) : = k (x, x^{'}) l (y, y^{'}) .

HSIC (X, Y) : = ∣∣ μ_{P_{X Y}} - μ_{P_{X} P_{Y}} ∣ ∣_{H}^{2}

HSIC (X, Y) : = ∣∣ μ_{P_{X Y}} - μ_{P_{X} P_{Y}} ∣ ∣_{H}^{2}

HSIC (X, Y) = E_{X Y} E_{X^{'} Y^{'}} k (X, X^{'}) l (Y, Y^{'}) + E_{X X^{'}} k (X, X^{'}) E_{Y Y^{'}} l (Y, Y^{'}) - 2 E_{X Y} E_{X^{'} Y^{''}} k (X, X^{'}) l (Y, Y^{''}) .

HSIC (X, Y) = E_{X Y} E_{X^{'} Y^{'}} k (X, X^{'}) l (Y, Y^{'}) + E_{X X^{'}} k (X, X^{'}) E_{Y Y^{'}} l (Y, Y^{'}) - 2 E_{X Y} E_{X^{'} Y^{''}} k (X, X^{'}) l (Y, Y^{''}) .

\displaystyle\operatorname{HSIC}(D)\coloneqq\bigg{|}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}K((x_{i},y_{i}),\cdot)-\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}K((x_{i},y_{j}),\cdot)\bigg{|}\bigg{|}_{H}^{2}.

\displaystyle\operatorname{HSIC}(D)\coloneqq\bigg{|}\bigg{|}\frac{1}{n}\sum_{i=1}^{n}K((x_{i},y_{i}),\cdot)-\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}K((x_{i},y_{j}),\cdot)\bigg{|}\bigg{|}_{H}^{2}.

HSIC (D)

HSIC (D)

- \frac{2}{n ^{3}} i, j, r = 1 \sum n k (x_{i}, x_{j}) l (y_{i}, y_{r}) .

(HSIC (D), HSIC (π_{1} D), HSIC (π_{2} D), \dots, HSIC (π_{B} D))

(HSIC (D), HSIC (π_{1} D), HSIC (π_{2} D), \dots, HSIC (π_{B} D))

k (x, x^{'}) = (∥ x ∥ + ∥ x^{'} ∥ - ∥ x - x^{'} ∥),

k (x, x^{'}) = (∥ x ∥ + ∥ x^{'} ∥ - ∥ x - x^{'} ∥),

l (y, y^{'}) = (∣ y ∣ + ∣ y^{'} ∣ - ∣ y - y^{'} ∣),

l (y, y^{'}) = (∣ y ∣ + ∣ y^{'} ∣ - ∣ y - y^{'} ∣),

P_{ij} \in R^{∣ A ∣ \times ∣ B ∣} min i = 1 \sum ∣ A ∣ j = 1 \sum ∣ B ∣ P_{ij} m (a_{i}, b_{j})

P_{ij} \in R^{∣ A ∣ \times ∣ B ∣} min i = 1 \sum ∣ A ∣ j = 1 \sum ∣ B ∣ P_{ij} m (a_{i}, b_{j})

P_{ij}

P_{ij}

j = 1 \sum ∣ B ∣ P_{ij}

i = 1 \sum ∣ A ∣ P_{ij}

[T (D), π_{1} (T (D)), \dots, π_{B} (T (D))] = d [T (D), T_{1} (π_{1} D), \dots, T_{B} (π_{B} D)]

[T (D), π_{1} (T (D)), \dots, π_{B} (T (D))] = d [T (D), T_{1} (π_{1} D), \dots, T_{B} (π_{B} D)]

\displaystyle P_{H_{0}}\bigl{(}p\leq\alpha)\leq\alpha.

\displaystyle P_{H_{0}}\bigl{(}p\leq\alpha)\leq\alpha.

\bigl{(}H(D),H(\pi_{1}D),\dots,H(\pi_{B}D)\bigr{)}

\bigl{(}H(D),H(\pi_{1}D),\dots,H(\pi_{B}D)\bigr{)}

P_{H_{0}} (p \leq α) \leq α .

P_{H_{0}} (p \leq α) \leq α .

\bigl{(}H(D),H(\pi_{1}D),\dots,H(\pi_{B}D)\bigr{)}

\bigl{(}H(D),H(\pi_{1}D),\dots,H(\pi_{B}D)\bigr{)}

P_{H_{0}} (p \leq α) \leq α .

P_{H_{0}} (p \leq α) \leq α .

g (t, x) = P (C > t ∣ X = x) .

g (t, x) = P (C > t ∣ X = x) .

W = {0 \frac{1}{g ( Z , X )} if Δ = 0 if Δ = 1.

W = {0 \frac{1}{g ( Z , X )} if Δ = 0 if Δ = 1.

\hat{S}_{T} (z_{k}) = i = 1 \prod k (\frac{n - i}{n - i + 1})^{δ_{i}} .

\hat{S}_{T} (z_{k}) = i = 1 \prod k (\frac{n - i}{n - i + 1})^{δ_{i}} .

w_{k} = \hat{P} (T = z_{k})

w_{k} = \hat{P} (T = z_{k})

w_{k} = \frac{1}{n} \frac{1}{P ^ ( C > z _{k} )}

w_{k} = \frac{1}{n} \frac{1}{P ^ ( C > z _{k} )}

\displaystyle\text{wHSIC}(D)\coloneqq\bigg{|}\bigg{|}\sum_{i=1}^{n}w_{i}K\bigl{(}(x_{i},z_{i}),\cdot\bigr{)}-\sum_{i=1}^{n}\sum_{j=1}^{n}w_{i}w_{j}K\bigl{(}(x_{i},y_{j}),\cdot\bigr{)}\bigg{|}\bigg{|}^{2}_{H}.

\displaystyle\text{wHSIC}(D)\coloneqq\bigg{|}\bigg{|}\sum_{i=1}^{n}w_{i}K\bigl{(}(x_{i},z_{i}),\cdot\bigr{)}-\sum_{i=1}^{n}\sum_{j=1}^{n}w_{i}w_{j}K\bigl{(}(x_{i},y_{j}),\cdot\bigr{)}\bigg{|}\bigg{|}^{2}_{H}.

wHSIC (D) = tr (H_{w} K H_{w} L),

wHSIC (D) = tr (H_{w} K H_{w} L),

(wHSIC (D), wHSIC (π_{1} D), wHSIC (π_{2} D), \dots, wHSIC (π_{B} D))

(wHSIC (D), wHSIC (π_{1} D), wHSIC (π_{2} D), \dots, wHSIC (π_{B} D))

(HSIC (\tilde{D}), HSIC (π_{1} \tilde{D}), HSIC (π_{2} \tilde{D}), \dots, HSIC (π_{B} \tilde{D}))

(HSIC (\tilde{D}), HSIC (π_{1} \tilde{D}), HSIC (π_{2} \tilde{D}), \dots, HSIC (π_{B} \tilde{D}))

P\leftarrow\blockarray{cccccc}x_{1}&x_{2}x_{3}x_{4}x_{5}\\ \block{(ccccc)c}0.20.0.0.0.x_{1}\\ 0.0.20.0.0.x_{2}\\ 0.0.0.20.0.x_{3}\\ 0.0.0.0.20.x_{4}\\ 0.0.0.0.0.2x_{5}\\

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidrindt/kernel_logrank_python_code
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

**A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes

**

D. Rindt, D. Sejdinovic and D. Steinsaltz.

Department of Statistics, University of Oxford, UK

1 Abstract

We propose a nonparametric test of independence, termed optHSIC, between a covariate and a right-censored lifetime. Because the presence of censoring creates a challenge in applying the standard permutation-based testing approaches, we use optimal transport to transform the censored dataset into an uncensored one, while preserving the relevant dependencies. We then apply a permutation test using the kernel-based dependence measure as a statistic to the transformed dataset. The type 1 error is proven to be correct in the case where censoring is independent of the covariate. Experiments indicate that optHSIC has power against a much wider class of alternatives than Cox proportional hazards regression and that it has the correct type 1 control even in the challenging cases where censoring strongly depends on the covariate.

2 Introduction

We propose a nonparametric test of independence between a possibly multidimensional covariate and a right-censored lifetime. Existing approaches to this problem suffer from several limitations: if we cluster the continuous covariate into groups, and then test for equality of lifetime distributions among the groups, the results will depend on the arbitrary choice of boundaries between the groups, while the spread of covariates within groups reduces power. Alternatively, one might fit a (semi-)parametric regression model, and test whether the regression coefficient corresponding to the covariate differs significantly from zero. The most commonly used such method is the Cox proportional hazards (CPH) model, which makes two assumptions (Cox (1972)): first, the hazard function must factorize into a function of time and a function of the covariate (the proportional hazards or relative risk condition); second, the effect of a covariate on the logarithm of the hazard function must be linear. Although this is a flexible model, in some cases these assumptions are violated. More complicated hazards are found for example when studying the relationship between body mass index and mortality - Zajacova and Burgard (2012) reports $U$ - or $V$ -shaped hazards - or between diastolic blood pressure and various health outcomes (Lip et al. (2019)).

Since distance- and kernel-based approaches have been used successfully for independence testing on uncensored data (Székely and Rizzo (2009), Gretton et al. (2008)), it is natural to investigate whether these methods can be extended to the case of right-censored lifetimes. To this end we propose applying optimal transport to transform the censored dataset into an uncensored dataset in such a way that, 1) the new uncensored dataset preserves the dependencies of the original dataset, and 2) we can apply a standard permutation test to the new dataset with test statistic given by Distance Covariance (DCOV) (Székely and Rizzo (2009)) or, equivalently, the Hilbert–Schmidt Independence Criterion (HSIC) (Gretton et al. (2008)).

Progress in kernel-based independence testing for censored data is further motivated by the fact that in the simpler context of uncensored data the corresponding methods have been further developed into tests of conditional independence, mutual independence, and have been applied to causal inference (Zhang et al. (2011), Pfister et al. (2018)), and in particular detection of confounders. While the present work does not propose methods for testing conditional or mutual independence, we believe independence testing is a first step towards those ends. Additionally, since our method allows for multidimensional covariates, one can first test for a dependency based on the full multidimensional covariate, and then test whether the dependency remains when certain sub-dimensions are omitted from the covariate.

Section 3 overviews relevant concepts in survival analysis, distance- and kernel-based independence testing, and optimal transport. Section 4 proposes a transformation of the data based on optimal transport. Section 5 introduces our testing procedure named optHSIC. Although we have not yet been able to prove control of the type 1 error rate in full generality, we do show the type 1 error rate to be correct in the case where censoring is independent of the covariate. Furthermore we obtain very promising results in simulation studies, showing correct type 1 error control even under censoring that depends strongly on the covariate. Section 6 explores alternative kernel-based approaches under the additional assumption that censoring is independent of the covariate. These methods serve as benchmarks for the power performance of optHSIC. Section 7 compares the power and type 1 error of all tests and CPH regression in simulated data.

3 Background Material

3.1 Right-Censored Lifetimes

Let $T\in\mathbb{R}_{\geq 0}$ be a lifetime subject to right-censoring, so that we do not observe $T$ directly, but instead observe $Z\coloneqq\min\{T,C\}$ for some censoring time $C\in\mathbb{R}_{\geq 0}$ , as well as the indicator $\Delta\coloneqq 1\{C>T\}$ . We further observe a covariate vector $X\in\mathbb{R}^{d}$ , where $\mathbb{R}^{d}$ is equipped with the Borel sigma algebra. In total, for an i.i.d. sample of size $n$ , we thus obtain the data $D\coloneqq((x_{i},z_{i},\delta_{i}))_{i=1}^{n}\in(\mathbb{R}^{d}\times\mathbb{R}_{\geq 0}\times\{0,1\})^{n}$ .

The main goal of this paper is developing a test of $H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ versus $H_{1}:X\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ based on the sample $D$ . Throughout this paper we will make the following assumptions.

Assumption 1: We assume that conditional on $\{X_{i}\}_{i=1}^{n}$ , the random variables $\{(T_{i},C_{i})\}_{i=1}^{n}$ are mutually independent.

Assumption 2: We assume that $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T|X.$

Denote $F_{T|X}(t|x)=P(T\leq t|X=x)$ and $F_{C|X}(t|x)=P(C\leq t|X=x)$ . We assume $F_{T|X}$ has a density $f_{T|X}$ . Let $S_{T|X}(t|x)=1-F_{T|X}(t|X)$ . We define the hazard rate of an individual with covariate $x$ to be $\lambda_{T|X}(t|x)=f_{T|X}(t|x)/S_{T|X}(t|x).$ The Cox proportional hazards model (CPH) assumes that the hazard rate can be written as $\lambda_{T|X}(t|x)=\lambda(t)\exp(\beta^{T}x)$ for some baseline hazard $\lambda(t)$ and a vector $\beta\in\mathbb{R}^{d}$ . This model enables estimation of $\beta$ and testing the significance of the difference of entries of $\beta$ from zero. The CPH model is the most commonly used regression method in survival analysis.

A last important concept is that of the ‘individuals at risk at a time $t$ ’. By this we mean the set $\{i:Z_{i}\geq t\}$ . We also use the notation $[a_{1},\dots,a_{k}]$ for a multiset (a set with potentially repeated elements) with elements $a_{1},\dots,a_{k}$ . We refer, for example, to the multiset $\operatorname{AR}_{t}=[x_{i}:i\in\{1,\dots,n\},Z_{i}\geq t]$ , the multiset of ‘covariates at risk at time $t$ ’.

3.2 Independence testing using kernels

Kernel methods have been successfully used for nonparametric independence- and two-sample testing (Gretton et al. (2008), Gretton et al. (2012)). We now give some of the relevant background in kernel methods.

Definition 3.1.

(Reproducing Kernel Hilbert Space)(B. Schölkopf (2001)) Let $\mathcal{X}$ be a non-empty set and $H$ a Hilbert space of functions $f:\mathcal{X}\to\mathbb{R}$ endowed with dot product $\langle\cdot,\cdot\rangle$ . Then $H$ is called a reproducing kernel Hilbert (RKHS) space if there exists a function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ with the following properties.

$k$ * satisfies the reproducing property*

[TABLE] 2. 2.

$k$ * spans $H$ , that is, $H=\overline{\mathrm{span}\{k(x,\cdot)\ |x\in{\mathcal{X}}\}}$ where the bar denotes the completion of the space.*

Let $\mathcal{X}$ together with a sigma-algebra be a measurable space and let $H_{\mathcal{X}}$ be an RKHS on $\mathcal{X}$ with reproducing kernel $k$ . Let $P$ be a probability measure on $\mathcal{X}$ . If $E_{P}\sqrt{k(X,X)}<\infty$ , then there exists an element $\mu_{P}\in H_{\mathcal{X}}$ such that $E_{P}f(X)=\langle f,\mu_{P}\rangle$ for all $f\in H_{\mathcal{X}}$ (Gretton et al. (2012)), where the notation $E_{P}f(X)$ is defined to be $\int f(x)P(dx)$ . The element $\mu_{P}$ is called the mean embedding of $P$ in $H_{\mathcal{X}}$ . Given a second distribution $Q$ on $\mathcal{X}$ , for which a mean embedding exists, we can measure the dissimilarity of $P$ and $Q$ by the distance between their mean embeddings in $H_{\mathcal{X}}$ :

[TABLE]

which is also called the Maximum Mean Discrepancy ( $\operatorname{MMD}$ ). The name comes from the following equality (Gretton et al. (2012)),

[TABLE]

showing that MMD is an integral probability metric. Given a sample $\{x_{i}\}_{i=1}^{n}$ and the empirical distribution, $\sum_{i=1}^{n}\delta({x_{i}})/n$ , where $\delta(x)$ denotes the Dirac measure at $x$ , the corresponding mean embedding is given by $\sum_{i=1}^{n}k(x_{i},\cdot)/n.$

Suppose now that $\mathcal{Y}$ together with some sigma algebra is a second measurable space, and let $H_{\mathcal{Y}}$ be an RKHS on $\mathcal{Y}$ with kernel $l$ . Let $X$ be a random variable in $\mathcal{X}$ with law $P_{X}$ and similarly let $Y$ be a random variable in $\mathcal{Y}$ with law $P_{Y}$ . Finally let $P_{XY}$ denote the joint distribution on $\mathcal{X}\times\mathcal{Y}$ equipped with the product sigma-algebra. We let $H$ denote the RKHS on $\mathcal{X}\times\mathcal{Y}$ with kernel

[TABLE]

In Gretton et al. (2008) it was proposed that the dependence of $X$ and $Y$ could be quantified by the following measure:

Definition 3.2.

The Hilbert–Schmidt independence criterion (HSIC) of $X$ and $Y$ is defined by

[TABLE]

where $P_{X}P_{Y}$ denotes the product measure of $P_{X}$ and $P_{Y}$ .

Let $(X,Y)$ , $(X^{\prime},Y^{\prime})$ and $(X^{\prime\prime},Y^{\prime\prime})$ be three mutually independent copies of the same random variable with law $P_{XY}$ . Using the reproducing property and the definition of mean embeddings, it can be shown that

[TABLE]

Now assume we are given a sample $D=((x_{i},y_{i}))_{i=1}^{n}$ of independent observations of the random pair $(X,Y)$ . An empirical estimate of HSIC $(X,Y)$ can be obtained by measuring the distance between the embedding of the empirical distribution of the data and the embedding of the product of the marginal empirical distributions. That is, we define $\operatorname{HSIC}(D)$ by

[TABLE]

Using the reproducing property of the kernel and the definition of $K$ in terms of $k$ and $l$ , $\operatorname{HSIC}(D)$ can be shown to equal

[TABLE]

While the biased $\operatorname{HSIC}(D)$ defined above is the most commonly used estimator in the literature (Székely and Rizzo (2009), Gretton et al. (2008)), unbiased estimators of $\operatorname{HSIC}(X,Y)$ exist too (Song et al. (2012)). The bias of $\operatorname{HSIC}(D)$ is $O(n^{-1})$ , (Gretton et al. (2008)) and for appropriate choices of kernels, the permutation test with the biased statistic and the permutation test with the unbiased statistic are consistent tests - see section 3.2.1 for details. Both tests also have correct type 1 error rate. Because the biased $\operatorname{HSIC}(D)$ is more commonly encountered in the literature and has a slightly easier analytic form, we use $\operatorname{HSIC}(D)$ throughout our paper. The following algorithm shows how $\operatorname{HSIC}$ is commonly combined with a permutation test for independence testing.

3.2.1 Choice of kernel

Throughout this paper we assume the covariates take values in the Euclidean space $\mathcal{X}=\mathbb{R}^{d}$ for some $d\in\mathbb{N}$ . Note that in our case $\mathcal{Y}=\mathbb{R}_{\geq 0}$ . We let both $k$ and $l$ be instances of the covariance kernel of Brownian motion. That is,

[TABLE]

and

[TABLE]

where $\|\cdot\|$ denotes the Euclidean norm. See Sejdinovic et al. (2012) for a discussion of this kernel.

The reason for this choice is three-fold. Firstly, under this choice of kernels, $\operatorname{HSIC}$ coincides with Distance Covariance (DCOV) (Székely and Rizzo (2009)). The equivalence between $\operatorname{HSIC}$ and DCOV was proved in Sejdinovic et al. (2012). DCOV is a well studied measure of dependence and if $X,Y$ are random variables with compact support, then $\text{DCOV}(X,Y)=0$ if and only if $X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y$ . Furthermore, the permutation test with test statistic DCOV is consistent in the sense that, for each distribution $P_{XY}$ with compact support and such that $X\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y$ , the probability that the permutation test rejects the null hypothesis converges to 1, as the sample size converges to infinity (Rindt et al. (2020)). Secondly, unlike other potential kernels, such as the Gaussian kernel, the covariance kernel of Brownian motion spares us the need to select a bandwidth. Thirdly, our test relies on optimal transport to transform the original data into a transformed dataset. The optimal transport procedure, discussed in the next section, requires a metric with respect to which similarity is preserved. It appears natural to measure the dependency in the transformed dataset using the same metric as the one underlying the transformation.

3.3 Optimal transport

Let $A=[a_{1},\dots,a_{|A|}]$ and $B=[b_{1},\dots,b_{|B|}]$ be multisets of length $|A|$ and $|B|$ with $a_{i},b_{j}\in\mathbb{R}^{d}$ . Let $v^{A}$ be a distribution vector on $A$ , by which we mean: $v^{A}=(v^{A}_{1},\dots,v^{A}_{|A|})\in\mathbb{R}^{|A|}$ so that $v^{A}_{i}\geq 0$ and $\sum_{i=1}^{|A|}v^{A}_{i}=1$ . Let $v^{B}$ be a distribution vector on $B$ . Let $m(a,b)$ be a metric on $\mathbb{R}^{d}$ . Then the earth mover’s distance (EMD) between $v^{A}$ and $v^{B}$ is defined as

[TABLE]

subject to

[TABLE]

Let $U$ be a random variable taking values in $A$ with distribution $v^{A}$ , and $U^{\prime}$ a random variable taking values in $B$ with distribution $v^{B}$ . In the next section we will write ‘let $P$ be the joint distribution that solves the optimal transport problem between $U$ and $U^{\prime}$ ’. By that we mean $P(U=a_{i},U^{\prime}=b_{j})\coloneqq P_{ij}$ for $i=1,\dots,|A|,j=1,\dots,|B|$ , where $P_{ij}$ is the matrix that minimizes Equation (2). Note that $P$ is a valid joint distribution of $U$ and $U^{\prime}$ . Furthermore, for any $i$ with $v^{A}_{i}>0$ it holds that $P(U^{\prime}=b_{j}|U=a_{i})=P_{ij}/v^{A}_{i}$ .

4 Data transformation based on optimal transport

4.1 Objective of the algorithm

To use HSIC for an independence test of $X$ and $T$ , one requires the explicit values $l(T_{i},T_{j})$ for $i,j=1,\dots,n$ . Since for right-censored data $T_{i}$ may be known only in terms of a lower bound, there is no straightforward way to apply these methods to right-censored data. In this section we propose a transformation of the original, censored dataset, into a synthetic dataset consisting of $n$ observed events. The algorithm uses optimal transport and its goal is twofold: first, it should return a dataset to which we can apply a permutation test with test-statistic HSIC, and obtain correct $p$ -values under the null hypothesis; second, it should return a dataset in which the dependence between $X$ and $T$ is similar to the dependence in the original dataset. Indeed, the transformation of the data is of independent interest: we use the standard permutation test with test statistic HSIC/DCOV, but other statistics could be considered. We do believe that the main value of the transformation lies in how it can be straightforwardly combined with permutation testing on the transformed data and we expect it may be difficult to use the transformation for other purposes.

4.2 Definition of the transformation

To define the algorithm, we use the following notation. Let $A=[a_{1},\dots,a_{|A|}]$ be a multiset where $a_{i}\in\mathbb{R}^{d}$ for $i=1,\dots,|A|$ . We define $\text{Uniform}(A)=\sum_{i=1}^{|A|}\delta(a_{i})/|A|$ , to be the uniform distribution over the elements, where $\delta(a)$ is the Dirac measure at $a$ . By $B=\text{Remove}(a,A)$ we set $B$ to be the multiset that remains when one instance of $a$ is removed from $A$ , and by $B=\text{Add}(a,A)$ we set $B$ to be the multiset that consists of the elements of $A$ , with an added element $a$ . We further assume that we are given a sample $D=((x_{i},z_{i},\delta_{i}))_{i=1}^{n}$ so that $z_{1}<z_{2}<\dots<z_{n}$ : that is, we have $n$ distinct event times and they are labeled in increasing order. For a variable $a$ we use $a\leftarrow b$ to change the value of $a$ to $b$ . In computing the optimal transport coupling as defined in Section 3.3, we use the Euclidean metric $\|x-y\|=\sqrt{\sum_{i=1}^{d}(x_{i}-y_{i})^{2}}$ for $x\in\mathbb{R}^{d}$ . The proposed transformation is defined in Algorithm 2.

4.3 Comments on the transformation algorithm

We now give a verbal explanation of Algorithm 2.

Initialization: The input is simply $D=((x_{i},z_{i},\delta_{i}))_{i=1}^{n}$ where $z_{1}<\dots<z_{n}$ . We initialize an empty transformed dataset $\tilde{D}$ , to which we will add $n$ observations of the form $(x_{i},t_{j})$ for some $i,j\in\{1,\dots,n\}$ in the following way. We will loop over the times $z_{i}$ from $i=1,\dots,n$ . At each time, the multiset $\operatorname{AR}$ lists the covariates at risk in the dataset $D$ and $\operatorname{AR}$ is initialized as the multiset containing all covariates. The multiset $\operatorname{L}$ will list all covariates that have not been added to $\tilde{D}$ yet. Indeed, as $\tilde{D}$ is initialized empty, $\operatorname{L}$ initially contains all $n$ covariates. The variable $d$ will count the number of observed $\delta=1$ events and is initialized 0.

First loop from $i=1,\dots,n-1$ : At time $z_{i}$ we distinguish two cases: if $\delta_{i}=0$ we leave $\operatorname{L}$ and $\tilde{D}$ unchanged, and simply remove one instance of $x_{i}$ from $\operatorname{AR}$ . If $\delta_{i}=1$ we add $1$ to $d$ (to count the number of observed events). We also select a covariate $\tilde{x}$ from $\operatorname{L}$ as follows: First a joint distribution of $\text{Uniform}(\operatorname{AR})$ and $\text{Uniform}(\operatorname{L})$ is computed using optimal transport. This is a matrix $P$ of size $|\operatorname{AR}|\times|\operatorname{L}|$ . This distribution is then conditioned on the event that $\text{Uniform}(\operatorname{AR})=x_{i}$ , yielding a distribution over $\operatorname{L}$ . We sample from this distribution to obtain the covariate $\tilde{x}$ . The pair $(\tilde{x},z_{i})$ is then added to $\tilde{D}$ , and the covariate $\tilde{x}$ is removed from $\operatorname{L}$ . There are now $d$ observations in $\tilde{D}$ and there are $n-d$ covariates left in $\operatorname{L}$ . We also remove $x_{i}$ from $\operatorname{AR}$ .

When we finish the first loop: It is now the case that $\operatorname{AR}=[x_{n}]$ (note the previous loop went up to $i=n-1$ ) and the multiset $\operatorname{L}$ contains $n-d$ covariates. In the second loop, that runs from $d+1$ to $n$ , each remaining covariate in $\operatorname{L}$ is combined with the time $z_{n}$ and added to $\tilde{D}$ . After this loop, $\operatorname{L}$ is empty, and $\tilde{D}$ contains all $n$ covariates $x_{1},\dots,x_{n}$ , each associated with a time. The transformed dataset $\tilde{D}$ is returned.

Consider the special case in which there is no censoring and $\delta_{i}=1$ for $i=1,\dots,n$ . Then it is easy to verify that in the first loop $\operatorname{AR}=\operatorname{L}$ at each step $i$ , so that the optimal transport algorithm returns a scaled identity matrix. Consequently, at step $i$ the covariate $\tilde{x}$ is equal to $x_{i}$ with probability 1, and the final output is $\tilde{D}=D$ . A nontrivial example is worked out in Section A.1.

4.4 Intuition behind the transformation

Before we prove properties of the proposed transformation, we briefly comment on the intuition behind the transformation. To this end, first consider a permutation test in the absence of censoring, when we simply observe $D=((x_{i},t_{i}))_{i=1}^{n}$ . The permuted datasets can be generated as follows: loop through the events in order of time, and to each time, associate a covariate that you have not associated to any earlier time. Comparing the original dataset with the datasets obtained in this manner is justified for the following reason: Under the null hypothesis a sample can be generated by firstly sampling $t_{i}$ for $i=1,\dots n$ i.i.d. and independently sampling $x_{i}$ for $i=1,\dots n$ i.i.d., and secondly looping through the times in order, and associating to each time a covariate that has not yet been chosen at a previous time. The original dataset and the permuted datasets are thus equal in distribution: intuitively, the permutation test checks whether the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.

It is not obvious how to translate this to censored data. Due to censoring, it may not be true that the $i$ -th event covariate is chosen uniformly from the covariates that have not had an observed event time before. For this reason survival analysis often compares the $i$ –th event covariate with the covariates at risk (not failed and not censored) just before the $i$ –th event. If the null hypothesis holds, then intuitively it holds that the $i$ -th event covariate is chosen uniformly from the covariates at risk at time $z_{i}$ . That is, if $\operatorname{AR}$ denotes the multiset of covariates at risk and $x_{i}$ is the event covariate, we would like to test if $x_{i}$ was chosen uniformly from $\operatorname{AR}$ .

To do so using a permutation test, our algorithm couples $\text{Uniform}(\operatorname{AR})$ to a uniform choice from those that have not yet been chosen in the synthetic dataset: $\text{Uniform}(\operatorname{L})$ (see Algorithm 2). In this manner, when the null hypothesis is true, (intuitively) it holds that in the synthetic dataset the covariates are chosen uniformly from those not chosen before. But this last statement is exactly our intuition behind a permutation test: namely, we use a permutation test to see if the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.

The intuition discussed thus far related to the workings of our transformation under the null hypothesis, and had at its core that $\text{Uniform}(\operatorname{AR})$ is coupled to $\text{Uniform}(\operatorname{L})$ in Algorithm 2. However there are many couplings between these two distributions. The reason we choose the coupling that solves the optimal transport problem is the following: assume the alternative hypothesis holds and that given $\operatorname{AR}_{i}$ , certain covariates have a higher hazard rate than others. Then we would like this bias towards these covariates to be visible in $\tilde{D}$ . In other words we would like $\tilde{x}$ to be close to $\tilde{x}_{i}$ in Algorithm 2. That is precisely what the optimal transport coupling achieves.

Finally, including the remaining covariates in the transformed dataset at the last time ensures the permuted datasets correctly reflect all alternative choices that can be made when covariates are chosen at random from those not chosen before. The association of these remaining covariates to the last event time reflects they have not been selected at each of the earlier times, which may indicate they have had less risk of having an event up to that point.

The goal of the transformation is thus twofold: firstly, ensuring that under the null hypothesis a permutation test is appropriate on the transformed dataset, which motivates the coupling of $\text{Uniform}(\operatorname{AR})$ and $\text{Uniform}(\operatorname{L})$ ; secondly, given the first goal, ensure covariates in $\tilde{D}$ associated to times $z_{i}$ for $\delta_{i}=1$ are as close as possible to the covariates associated to $z_{i}$ in $D$ , which motivates the chosen coupling to be the coupling that solves the optimal transport problem defined in Section 3.3.

An interesting alternative approach would be to compare $x_{i}$ with $\text{Uniform}(\operatorname{AR}_{i})$ directly, without first transforming the data. We do not see, however, how to combine that approach with a permutation test, or with another procedure to test for significance.

5 Applying HSIC to the transformed dataset: optHSIC

We have thus far described how to transform the dataset. For the hypothesis test of $H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ , we propose to apply a permutation test with test statistic DCOV/HSIC to the transformed dataset. This approach is summarized in Algorithm 3.

5.1 Computational cost of optHSIC

The algorithm optHSIC consists of two parts: First, the dataset is transformed. Second, a permutation test with test statistic HSIC/DCOV is performed. The computational cost of the transformation may be high: the solution of the earth mover’s distance has $n^{3}\log(n)$ complexity (Shirdhonkar and Jacobs (2008)). Since the earth mover’s distance is computed for each $i=1,\dots,n$ such that $\delta_{i}=1$ , complexity of the transformation is excessively high. Luckily, fast approximations of the earth mover’s distance exist that are linear in time (e.g. Shirdhonkar and Jacobs (2008)), making the transformation $O(n^{2})$ . Also in the case where the covariates are 1-dimensional, the optimal transport problem has a simple solution (Cohen (1999)). Different algorithms and/or approximations also exist for the EMD with other metrics (Shirdhonkar and Jacobs (2008)). The second step, computation of HSIC is also an $O(n^{2})$ -operation for which large-scale approximations can be made (Zhang et al. (2018)). Furthermore, both the computation of HSIC and the permutation test computations can be easily parallelized. In our simulations we did not use approximations of the earth mover’s distance nor of HSIC, as for samples up to about 500 the optHSIC test can be performed in about a second on an ordinary PC.

5.2 Theoretical results on optHSIC

We say a test with $p$ -value $p$ has correct type 1 error rate if under the null hypothesis $P_{H_{0}}(p\leq\alpha)\leq\alpha$ . The main theoretical result we obtained for optHSIC is that the type 1 error rate is correct when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . Unfortunately, we have not been able to prove other important results such as (asymptotically) correct type 1 error rate when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , or consistency (power converging to 1 for each alternative hypothesis). However, as Section 7 details, extensive simulations demonstrate that optHSIC achieves correct type 1 error rate also when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . Simulations further show that optHSIC is able to detect a wide range of dependencies between $X$ and $T$ without losing much power, relative to the CPH likelihood ratio test, even when the CPH model assumptions holds. Obtaining further theoretical results is an important future challenge. In the uncensored case a permutation test with covariance kernel of Brownian motion is consistent for distributions with compact support and has correct type 1 error rate (Rindt et al. (2020)). The difficulty of extending these proofs to optHSIC is that optHSIC requires sequential analysis of the optimal transport distributions, breaking independence between observations in the transformed dataset.

We first prove an auxiliary result: namely, although we propose to permute the transformed dataset, this is equivalent to permuting the original dataset, and then transforming the permuted datasets.

Lemma 5.1.

Let $\pi_{1},\dots,\pi_{B}$ be independent uniform random permutations, and let $T,T_{1},\dots,T_{B}$ be independent optHSIC transformations.

[TABLE]

Proof.

See Appendix. ∎

Note that this implies that in the definition of optHSIC we could have equally permuted $D$ first, and then applied the transformation to each of the permuted datasets. However this is computationally more expensive. Lemma 5.1 enables us to show that optHSIC has correct type 1 error rate when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ .

Theorem 5.1.

Let $D=((X_{i},Z_{i},\Delta_{i}))_{i=1}^{n}$ be an i.i.d sample. Assume that $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . Let $p$ be the $p$ -value resulting from applying optHSIC (Algorithm 3) to $D$ with $B$ permutations and level $\alpha\in[0,1]$ . If the null hypothesis holds, i.e. $T\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , then $p\sim\text{Uniform}[\frac{1}{B+1},\dots,\frac{B+1}{B+1}]$ and in particular it holds that

[TABLE]

Proof.

See Appendix. ∎

Table 1 summarizes our findings of the type 1 error of optHSIC.

5.3 Using multiple transformations

Algorithm 2 defines the transformation used in optHSIC. In line 7, a covariate is sampled from the multiset $\operatorname{L}$ with distribution $v$ . The sampled element may differ for different iterations of the algorithm. Consequently, given a fixed dataset $D$ , the transformed dataset will look different for different iterations of the algorithm. The optHSIC algorithm proposes to use only one transformation, resulting in a single $p$ -value.

Say that instead of applying the optHSIC algorithm once, we repeat the optHSIC algorithm $m$ times, resulting in $m$ $p$ -values $p_{1},\dots,p_{m}$ . An example of the distribution of $p$ -values for a fixed dataset is given in Figure 5.

Figure 5 shows that in the used dataset, where $39\%$ of observations is censored and $n=300$ , the variance in the obtained $p$ -values is not excessively high, and in particular all $p$ -values would have led to the decision to reject the null-hypothesis.

We also studied several techniques one may use to combine $p$ -values, typically at the cost of being more conservative when there is little variance among $p$ -values, like in the example above. This is discussed in Section A.9.

6 Alternative approaches when censoring is independent of the covariate

We are not aware of fully nonparametric methods to test independence between right-censored times and continuous covariates. When $C$ is independent of $X$ the challenge is mitigated, since in that case any statistic can be combined with a permutation test that permutes the covariates (see Theorem 6.2). Using that approach, this Section proposes some alternative tests in the case when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ that will serve as benchmarks for studying power performance of optHSIC in addition to the Cox proportional hazards likelihood ratio test. In Section 7 we compare the performance of these methods with optHSIC.

Before we propose alternative test statistics, we state formally why standard permutation tests can be used when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . We first state a Theorem on the correctness of the type 1 error rate of permutation tests without censoring. We include the proof following Berrett and Samworth (2019) in the Appendix for completeness.

Theorem 6.1.

Let $D=(X_{i},Y_{i})_{i=1}^{n}$ be an i.i.d. sample of size $n$ from distribution $P_{XY}$ on $\mathcal{X}\times\mathcal{Y}$ . Denote $\pi D=(X_{\pi(i)},Y_{i})_{i=1}^{n}$ . Let $\pi_{1},\dots\pi_{B}$ be permutations sampled uniformly and independently from $S_{n}$ . Let $H$ be any statistic of the data. Let $R$ be the rank of the first coordinate in the vector:

[TABLE]

when ties are broken at random and where $R=1$ is the rank of the largest element and $R=B+1$ the rank of the smallest element. Define $p\coloneqq R/(B+1)$ . Then under the null hypothesis $H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ it holds that: $p\sim\text{Uniform}[\frac{1}{B+1},\dots,\frac{B+1}{B+1}]$ . So in particular for $\alpha\in[0,1]$

[TABLE]

Proof.

See Appendix. ∎

In survival analysis we apply the above, setting $Y=(Z,\Delta)$ , where $Z=\min\{T,C\}$ and $\Delta=1\{T\leq C\}$ .

Theorem 6.2.

Assume $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and $T\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}C|X$ . Write $D=(X_{i},Z_{i},\Delta_{i})_{i=1}^{n}$ , and let $\pi_{1},\dots,\pi_{B}$ be sampled i.i.d. uniformly from $S_{n}$ . Write $\pi D:=(X_{\pi(i)},Z_{i},\Delta_{i})_{i=1}^{n}$ . Let $H(D)$ be any statistic of the data. Let $R$ be the rank of the first coordinate in the vector:

[TABLE]

where ties are broken at random and $R=1$ is the rank of the largest element and $R=B+1$ is the rank of the smallest element. Define $p\coloneqq R/(B+1)$ . Then under the null hypothesis $H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ it holds that $p\sim\text{Uniform}[\frac{1}{B+1},\dots,\frac{B+1}{B+1}]$ . So in particular for $\alpha\in[0,1]$

[TABLE]

Proof.

See Appendix. ∎

Having shown that we can use a permutation test on any test statistic when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , the next two sections explore meaningful measures of dependency on right-censored data. Indeed such methods may lead to type 1 errors when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ .

6.1 wHSIC

HSIC relies on estimating the mean embedding of the joint distribution $P_{XT}$ . The estimated embedding is then compared to the embedding of the product of marginal distributions. In the uncensored case the empirical distribution equals $\sum_{i=1}^{n}\delta_{(x_{i},t_{i})}/n$ which has corresponding mean embedding $\sum_{i=1}^{n}K((x_{i},t_{i}),\cdot)/n$ .

Since we do not observe the $t_{i}$ ’s we could consider replacing the empirical distribution by a weighted version $\sum_{i=1}^{n}w_{i}\delta_{(x_{i},z_{i})}/n$ where we try to find weights $w_{i}$ such that $\sum_{i=1}^{n}w_{i}f(x_{i},z_{i})\approx Ef(X,T)$ for every measurable function $f$ such that the expectation exists. A natural idea is to give an observed point $(x,z,\delta)$ a weight of zero if it is censored, and a weight equal to the inverse probability of being uncensored otherwise. This can be motivated by the following lemma, that applies also if $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . We first define

[TABLE]

The following lemma proposes a weight function $W$ in terms of $g$ .

Lemma 6.1.

Let $f$ be an $(X,T)$ - measurable function such that $Ef(X,T)<\infty$ and define $W$ by

[TABLE]

Then $EWf(X,Z)=Ef(X,T).$

Proof.

See Appendix. ∎

We would thus like to use weights $w_{i}=0\ \text{if}\ \delta_{i}=0$ and $1/{ng(x_{i},z_{i})}\ \text{if}\ \delta=1.$ As we will be working under the assumption $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ we may estimate ${1/P(C>t|X)}={1/P(C>t)}$ using Kaplan–Meier weights, which we define now. Assume that there are no ties in the data and that $z_{i}<z_{i+1}$ for $i=1,\dots,n$ . Then the Kaplan–Meier survival curve is given by:

[TABLE]

Kaplan–Meier weights are then defined by

[TABLE]

One can also estimate the survival probability of the censoring time using Kaplan–Meier weights (simply replace $\delta_{i}$ by $1-\delta_{i}$ in the above formula). The following lemma shows that the weights $w_{k}$ defined above, correspond up to a constant of $1/n$ , to the inverse of the estimated probability of being uncensored.

Lemma 6.2.

Let $\hat{S}_{C}(z_{k})=\hat{P}(C>z_{k})$ denote the Kaplan–Meier estimate of the survival probability of the censoring distribution. Then:

[TABLE]

Proof.

See Appendix. ∎

We use these weights to define the statistic wHSIC as the RKHS distance between the embedding of $\sum_{i=1}^{n}w_{i}\delta_{(x_{i},z_{i})}$ and the embedding of the product of the marginals, $\sum_{i=1}^{n}\sum_{j=1}^{n}w_{i}w_{j}\delta_{x_{i}}\delta_{z_{j}}.$ That is:

[TABLE]

Here $K\bigl{(}(x,y),(x^{\prime},y^{\prime})\bigr{)}=k(x,x^{\prime})l(y,y^{\prime})$ . The following theorem shows how to compute $\operatorname{wHSIC}$ efficiently.

Theorem 6.3.

(Computation of wHSIC) Given a dataset $D=\left((x_{i},y_{i})\right)_{i=1}^{n}$ with a weight vector $w\in\mathbb{R}^{n}$ ,

[TABLE]

where $K_{ij}=k(x_{i},x_{j})$ and $L_{ij}=l(y_{i},y_{j})$ and $H_{w}=D_{w}-ww^{\top}$ , where $D_{w}=\text{diag}(w)$ , a diagonal matrix with $(D_{w})_{ii}=w_{i}$ .

Proof.

See Appendix. ∎

It is not difficult to see wHSIC has the same computational time as (uncensored) HSIC.

6.2 zHSIC

If $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , then any dependence between $X$ and $Z$ must be due to dependence between $X$ and $T$ . Hence we may estimate $\operatorname{HSIC}(X,Z)$ to measure the strength of the dependence. This test is, indeed, expected to yield false rejections if $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ fails to hold, and to lack power when a large portion of the events is censored. We include this test mainly to see how the power of optHSIC and wHSIC compare with this approach.

7 Numerical evaluation of the methods

We generate data from various distributions of $X$ , $T$ and $C$ to compare the power and type 1 error rate of optHSIC, wHSIC, zHSIC and CPH. CPH stands for the Cox proportional hazards likelihood ratio test. In each scenario, we let the sample size range from $n=40$ to $n=400$ in intervals of $40$ . To obtain $p$ -values in the three HSIC-based methods we use a permutation test with 1999 permutations. We reject the null hypothesis if our obtained $p$ -value is less than $0.05$ . In the main paper, we present results of rejection rates under distributions that are chosen such that approximately $60\%$ of the observations are observed ( $\delta=1$ ). We investigate the rejection rates under varying censoring regimes in Section A.10.2 of the Appendix.

7.1 Type 1 error rate

We begin by investigating the type 1 error rate. The distributions in which we test the type 1 error rate are found in Table 2. In these distributions the null hypothesis holds, i.e. $X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T$ and we consider both cases where $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , as well as the case of multidimensional covariates. We estimate the rejection rates by sampling 5000 times from each distribution for each sample size and applying the different tests to the samples. The obtained rejection rates of optHSIC are found in Table 3. The type 1 error rate of the remaining methods is found in Section A.10.1. The most important finding is that optHSIC is found to have correct type 1 error rate of $\alpha=0.05$ both when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . In contrast, wHSIC and zHSIC yield many false rejections when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , as expected, but have the correct type 1 error rate when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . Investigation of the $p$ -values of optHSIC furthermore showed that $p$ -values are distributed approximately according to $\text{Uniform}[\frac{1}{B+1},\dots,\frac{B+1}{B+1}]$ under the null hypothesis, even when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ (see Figure 10 in Section A.10.1).

7.2 Comparison of power

To compare the power of the four methods we consider a number of distributions in which $T\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . For each distribution we let the sample size range from $n=40$ to $n=400$ in steps of $40$ . At each sample size we take 1000 samples to estimate the rejection rate. In these distributions censoring is independent of the covariate such that the methods wHSIC and zHSIC do not have inflated rejection rate due to dependencies between $C$ and $X$ . Distributions with varying censoring rates and dependent censoring are investigated in Section A.10.2. Parameters are chosen so that that $60\%$ of the observations is uncensored ( $60\%$ has $\delta_{i}=1$ ), and rejection rates take a range of values (i.e. to exclude trivial distributions so that each method rejects with probability 1 for each sample size.).

7.2.1 Power for 1-dimensional covariates

In distributions 1-4 of Table 4 the covariate is 1-dimensional. Scatterplots of the samples and rejection rates are displayed in Figure 8. Note how in D.1 the CPH assumption holds, so the CPH method suits this example very well. We find however that the rejection rate of optHSIC is very similar to that of the CPH likelihood ratio test (first row, right of Figure 8). D.2 features a case in which hazard is highest in the middle, and lower for extreme values of the covariate. The CPH likelihood ratio test does not have power to detect this relationship, and optHSIC is the most powerful method. In D.3 and D.4 the hazard functions of different covariates cross each other. In this case, wHSIC is the top-performing method, but optHSIC is also able to detect the more complicated relationship, while CPH is not able to do so.

7.2.2 Power for multidimensional covariates

In D.5-8 of Table 4 the covariates are multidimensional. Figure 9 shows the rejection rates of the four methods, both as the dimension is fixed at $10$ , and the sample size increases (left column) and when the sample size is fixed at 200, and the dimension increases (right column). In D.5 and D.6 the CPH assumption holds. Again we see that the power of optHSIC is relatively close to the power of the CPH likelihood ratio test, which is specifically designed for these assumptions. This holds both when the dependence is on a single sub-dimension of the covariate (D.6) and when the dependence is on all covariates (D.5). In D.7 and D.8 a non-linear term is present in the hazard rate. Together with zHSIC, optHSIC is now the best performing method. Note that as dimension increases, the dependence in D.6-D.8 becomes harder to detect, whereas the dependence in D.5 becomes easier to detect since in the latter $T$ depends on all covariates.

We also investigated the power when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and as the censoring percentage varies from $0\%$ to $80\%$ . Distributions and rejection rates are presented in Section A.10.2. The main observations are that, although all methods lose power as the censoring rate increases, the relative performance of the four methods remains similar to the results presented in the main text.

We thus find that for continuous covariates optHSIC is able to detect a wider range of dependencies than the CPH likelihood ratio test, while not losing much power when the CPH assumptions hold. This is true both when the covariate is 1-dimensional and when the covariate is multidimensional, even when the dependence arises from a lower-dimensional subspace of the covariate. Importantly, unlike the methods wHSIC and zHSIC, the type 1 error rate of optHSIC is found to be of the correct level both when $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and when $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ .

7.3 Binary covariates

Lastly, we did experiments with binary covariates, in which case independence testing is equivalent to two-sample testing. For two-sample testing there are other alternative approaches, including Ditzhaus and Friedrich (2020) and Matabuena (2019). Furthermore, the wHSIC approach can be modified firstly to allow for the computation of weights within groups, and secondly a permutation test can be constructed that is valid also when censoring differs between groups. This permutation test was proposed by Wang et al. (2010). This special case of wHSIC coincides with the approach discussed in Matabuena (2019) for two-sample testing.

Our main finding is that optHSIC has a decent all-round performance for two-sample testing. However, especially when the distributions meet additional assumptions used in semi-parametric tests, these semi-parametric tests outperform optHSIC. We also note that, as optHSIC relies on choosing ‘similar covariates’ during the transformation of Algorithm 2, it is not ideally suited for the two-sample case, as there is a chance of choosing the opposite covariate.

We also provide a ‘failure mode’ of optHSIC (we thank a reviewer for this example): a scenario in which optHSIC performs much worse than the CPH test. In this example, covariates are binary (i.e. there are two groups), group sizes are very unequal, $90\%$ of observations is censored, censoring appears only in the large group, and survival in the two groups is identical until beyond the censoring has occurred. See A.12 for a detailed description.

A full presentation of the experiments and methods used for the case of binary covariates is given in Section A.11. We conclude that the main value of optHSIC lies in independence testing with continuous covariates.

8 Discussion

The main contribution of this paper is the proposal of optHSIC, combining a novel way of using optimal transport to transform right-censored datasets, with nonparametric permutation-based independence testing using HSIC/DCOV. We have shown optHSIC has power against a wider range of alternatives than the commonly used CPH model, while forfeiting little power when the CPH assumptions are satisfied exactly. Extensive numerical simulations suggest that the approach is well calibrated, yielding reliable $p$ -values even when censoring strongly depends on the covariate. Under the assumption that censoring does not depend on the covariate, we have proven correct type 1 error rate of optHSIC. Theoretical guarantees for the type 1 error under dependent censoring are a topic of future work. We furthermore proposed reweighting the original dataset, and measuring the distance between the resulting weighted mean embeddings. While these methods showed some promise, they do rely on the very strong assumption that $C$ and $X$ are independent. An interesting future challenge is to develop nonparametric tests for covariates and right-censored times for mutual independence testing and conditional independence testing, two methods used in causal inference, and which are relevant to many problems in biostatistics.

9 Supplementary materials and code

Supplementary materials contain a worked out example of the transformation proposed in Algorithm 2, proofs of all results and details on additional experiments. They also contain methods of combining $p$ -values from multiple transformation of the same dataset. Code to replicate the experiments and of the tests is available on www.github.com/davidrindt/opthsic.

10 Acknowledgment

We thank the reviewers and editor for their helpful comments.

Appendix A Appendix

A.1 Example of transformation

Let $D$ be the following dataset consisting of $n=5$ individuals:

We initialize $\tilde{D}\leftarrow[\ ]$ to be an empty multiset and set $\operatorname{L}\leftarrow[1.3,0.5,0.3,-1.1,-0.9]$ and $\operatorname{AR}\leftarrow[1.3,0.5,0.3,-1.1,-0.9]$ . We loop over the events $i=1,\dots,4$ .

At the first time $z_{1}=13$ it holds that $\delta_{1}=1$ . We compute the joint distribution $P_{UU^{\prime}}$ that solves the optimal transport problem between $U\sim\text{Uniform}(\operatorname{AR})$ and $U^{\prime}=\text{Uniform}(\operatorname{L})$ . Since it holds that $\operatorname{AR}=\operatorname{L}$ , it follows that $P_{UU^{\prime}}$ is the matrix:

[TABLE]

Conditioning $P$ on $U=x_{1}$ yields $v\leftarrow[1,0,0,0,0]$ , corresponding to the first row of $P$ . Sampling from $\operatorname{L}$ with distribution $v$ yields $\tilde{x}\leftarrow 1.3=x_{1}$ with probability 1. We update $\tilde{D}\leftarrow[(1.3,13)]$ . We also replace $\operatorname{L}\leftarrow[0.5,0.3,-1.1,-0.9]$ and $\operatorname{AR}\leftarrow[0.5,0.3,-1.1,-0.9]$ . We move to the next event time.

At $z_{2}=22$ we note that $\delta_{2}=0$ , so we only remove $x_{2}=0.5$ from $\operatorname{AR}$ and update $\operatorname{AR}\leftarrow[0.3,-1.1,-0.9]$ , while leaving $\operatorname{L}$ and $\tilde{D}$ unchanged.

At the third event $z_{3}=24$ it holds that $\delta_{3}=1$ and $\operatorname{AR}=[0.3,-1.1,-0.9]$ and $\operatorname{L}=[0.5,0.3,-1.1,-0.9]$ . We couple a random variable $U\sim\text{Uniform}(\operatorname{AR})$ and $U^{\prime}=\text{Uniform}(\operatorname{L})$ using optimal transport. The resulting distribution equals:

[TABLE]

We condition this distribution on $U=x_{3}=0.3$ . This corresponds to the first row of $P$ , and the resulting distribution over $\operatorname{L}$ equals: $v\leftarrow[0.75,0.25,0,0]$ . We now sample a point from this distribution and, suppose, it turns out to be $\tilde{x}\leftarrow 0.5=x_{2}$ , which has $75\%$ chance. We update $\tilde{D}\leftarrow[(1.3,13),(0.5,24)]$ . We also replace $\operatorname{L}\leftarrow[0.3,-1.1,-0.9]$ and $\operatorname{AR}\leftarrow[-1.1,-0.9]$ before moving to the next event.

At $i=4$ it holds that $z_{4}=45$ and $\delta_{4}=1$ . We note $\operatorname{AR}=[-1.1,-0.9]$ and $\operatorname{L}=[0.3,-1.1,-0.9]$ . We couple a random variable $U\sim\text{Uniform}(\operatorname{AR})$ and $U^{\prime}=\text{Uniform}(\operatorname{L})$ using optimal transport. The resulting distribution equals:

[TABLE]

We condition $P$ on $U=x_{4}=-1.1$ , resulting in $v\leftarrow[0,0.67,0.33]$ . In this case our sample turns out to be $\tilde{x}\leftarrow-1.1=x_{4}$ , which happens with probability $0.67$ . Hence $\tilde{D}\leftarrow[(1.3,13),(0.5,24),(-1.1,45)]$ . We also replace $\operatorname{L}\leftarrow[0.3,-0.9]$ and $\operatorname{AR}\leftarrow[-0.9]$ .

We have now finished the loop $i=1,\dots,4$ . Since $z_{5}=81$ and $\operatorname{L}\leftarrow[0.3,-0.9]$ we add the two datapoints $(0.3,81)$ and $(-0.9,81)$ to $\tilde{D}$ . The finalized transformed dataset equals

[TABLE]

A.2 Proof of Lemma 5.1

Let $D=((x_{i},z_{i},\delta_{i}))_{i=1}^{n}$ where $z_{i}$ is increasing, and assume for convenience there are no ties in $z$ . Denote by $k\coloneqq|\{i:\delta_{i}=1\}|$ the number of observed events. We do not view $D$ as random in this section. Applying the optimal transport algorithm results in a random, transformed dataset, which we denote by $T(D)$ . Note that the times and covariates in $T(D)$ are not random, since they are determined by $D$ , but the way in which they are paired up in the transformation $T$ may be random. The same set of times and covariates is obtained in $\pi(T(D))$ and $T(\pi(D))$ for any $\pi\in S_{n}$ . Denote the times in $T(D)$ by $t_{1}\leq\dots\leq t_{n}$ and define a standard pairing $\tilde{D}=\left((x_{i},t_{i})\right)_{i=1}^{n}$ ,. We will often use that $T(D),\pi(T(D)),T(\pi(D))$ are all permutations (possibly random) of $\tilde{D}$ . Finally, define $h:\{1,\dots,k\}\to\{1,\dots,n\}$ , so that $t_{i}=z_{h(i)}$ , which says that the $i$ -th observed event is the $h(i)$ -th overall event. As a last piece of notation, we will use $\Pi$ to denote a uniform random permutation, and $\pi$ to be a specific instance of a permutation. In particular we denote $\Pi_{i}=\Pi(i)$ and $\Pi_{1:h(i)-1}=[\Pi_{1},\dots,\Pi_{h(i)-1}]$ . This corresponds to the covariates in the permuted dataset $\Pi(D)$ until just before the time of the $i$ -th observed event.

We prove the theorem by showing that the left- and right-hand sides of Lemma 5.1 are both equal in distribution to

[TABLE]

This is done in separate lemmas.

Lemma A.1.

[TABLE]

Proof.

By the above remarks we see that $T(D)=\Pi^{D}(\tilde{D})$ for some random permutation $\Pi^{D}$ . (Note: The randomness in $\Pi^{D}$ comes from the transformation $T$ , not from the dataset $D$ , which is fixed.) It suffices to show that

[TABLE]

This is easy to see by conditioning on $\Pi^{D}$ . Let $\pi^{0},\dots\pi^{B}$ be arbitrary permutations. Then

[TABLE]

Since $(\Pi^{1},\dots,\Pi^{B})$ are independent uniform permutations, this is the same as

[TABLE]

∎

We now consider the effect of first permuting and then transforming the data.

Lemma A.2.

Let $\Pi$ be a uniformly chosen permutation of $S_{n}$ and let $T$ be defined through optHSIC. It holds that

[TABLE]

Proof.

By the comments above, we can define a random permutation $\Sigma$ by $\Sigma(\tilde{D})\coloneqq T(\Pi(D))$ . We wish to show that $P(\Sigma=\sigma)=1/n!$ for all $\sigma\in S_{n}$ . To do so, we will condition on events of the form

[TABLE]

which determines the covariates in the permuted dataset up to (just before) the time of the $i$ -th observed event. We also condition on $\Sigma_{1:i-1}$ , fixing the covariates in the transformed dataset, up to the $i$ -th observed event. Note that this conditioning fixes the coupling defined in the optimal transport algorithm. Namely, we let $\tilde{Y}$ and $\tilde{X}$ be the coupled random variables resulting from optimal transport between choosing uniformly from the covariates indexed by $[n]\setminus\{\sigma_{1:i-1}\}$ and choosing uniformly from the covariates indexed by $[n]\setminus\{\pi_{1:h(i)-1}\}$ respectively. Then, the transformation samples $U^{\prime}$ conditional on $U=x_{\Pi_{h(i)}}$ . Because $\Pi$ is a uniformly chosen permutation, given the events we conditioned on so far, $U$ is uniformly chosen from the covariates indexed by $[n]\setminus\{\pi_{1:h(i)-1}\}$ . By the definition of the coupling, $U^{\prime}$ is thus uniform on the covariates indexed by $[n]\setminus\{\sigma_{1:i-1}\}$ . That is, $\Sigma_{i}$ is chosen uniformly from $[n]\setminus\{\sigma_{1:i-1}\}$ . Mathematically, for any $\sigma,\pi\in S_{n}$ so that the conditioning event has nonzero probability, it holds that

[TABLE]

Having shown that, conditioned on what happened in both the permuted dataset, and the synthetic dataset, the new synthetic covariate is chosen uniformly from those not chosen before, we aim to derive a recurrence relation so as to apply this result at each successive time. To this end note that

[TABLE]

where we use the previously established equality in the first equality. This allows us to compute

[TABLE]

Since the indices $\Sigma_{(k+1):n}$ are added in uniform random order by definition of the transformation algorithm, this concludes the lemma. ∎

Lemma A.3.

[TABLE]

Proof.

The left hand side can be written as $[\Pi^{D}(\tilde{D}),\Sigma^{1}(\tilde{D}),\dots,\Sigma^{B}(\tilde{D})]$ . The lemma above shows that the $\Sigma^{i}$ for $i\geq 1$ have the correct distributions. We only need to show they and $\Pi^{D}$ are a sequence of mutually independent permutations. But $\Sigma^{i}$ is determined completely by $\Pi^{i}$ and $T_{i}$ , and $\Pi^{D}$ is determined by $T$ . The proof follows since all these variables are mutually independent.∎

Lemma A.1 and A.3 together prove the theorem.

A.3 Proof of Theorem 5.1

The proof of Theorem 6.2 shows that, if $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , then

[TABLE]

is an exchangeable vector. In particular, if $T,T_{1}\dots,T_{B}$ are independent identically distributed transformations of the data, then also

[TABLE]

is exchangeable. We let $T$ be the transformation of the data using the transformation of the data. By Lemma 5.1 the above vector is equal in distribution to

[TABLE]

implying that the latter is also exchangeable. For an arbitrary statistic $H$ ,

[TABLE]

is thus exchangeable too. In particular, the rank of the first entry is uniformly distributed on $1,\dots,B+1$ , which proves the theorem.

A.4 Proof of Theorem 6.1

Proof.

This proof is based on the proof of Lemma 3 of Berrett and Samworth (2019). Since $H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y$ implies that $(X_{i},Y_{j})\stackrel{{\scriptstyle d}}{{=}}(X_{i},Y_{i})$ it is easy to see that $\pi D\stackrel{{\scriptstyle d}}{{=}}\text{D}$ , for any permutation $\pi$ . Writing $\Pi^{0}=id$ , and $\Pi^{1},\dots,\Pi^{B}$ for i.i.d. uniform permutations, we aim to show that, for any permutation $\sigma$ of $\{0,1,\dots,B\}$

[TABLE]

that is, that the random vector on the left is exchangeable. We observed above that the first entries are equal in distribution. It remains to show that the other entries of the right-hand side are uniform and independently chosen permutations of the first entry. Indeed, writing $\Pi^{\sigma_{0}}(D)=\tilde{D}$ , we can rewrite the right-hand side as:

[TABLE]

So it remains to show that $\bigl{(}\Pi^{\sigma_{j}}(\Pi^{\sigma_{0}})^{-1},1\leq j\leq B\bigr{)}$ are independent uniformly chosen permutations of $S_{n}$ . If $\sigma_{0}=0$ , then $\Pi^{\sigma_{0}}=id$ and $\tilde{D}=D$ and the result is obvious. Now assume that $\sigma_{i}=0$ for $i\geq 1$ .

[TABLE]

It follows that the vector

[TABLE]

is indeed exchangeable. Letting $H$ denote any arbitrary function on data, it follows that:

[TABLE]

is also exchangeable. If we break ties at random, this implies that every ordering of the $B+1$ elements is equally likely. In particular, the rank of an individual element is uniformly distributed on $\{1,\dots,B+1\}$ , and the result follows.

∎

A.5 Proof of Theorem 6.2

Proof.

When we assume that $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ then, under the null hypothesis $H_{0}:T\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , it follows that the pair $(T,C)\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . As $(Z,D)$ is $(T,C)$ –measurable, also $(Z,D)\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . If we write $Y=(Z,D)$ , then Theorem 6.1 applies.

∎

A.6 Proof of Lemma 6.1

The following computation shows that $EWf(X,Z)=Ef(X,T)$ for all functions $f$ . We denote the distribution of $(X,T,C)$ on $\mathcal{X}\times\mathbb{R}_{\geq 0}\times\mathbb{R}_{\geq 0}$ by $\mu_{XTC}$ . As we are assuming independence of $T$ and $C$ given $X$ we can decompose $\mu_{XTC}=\mu_{XT}\times\mu_{C|X}$ .

[TABLE]

where the penultimate equality follows because $\int_{t}^{\infty}\mu_{C|x}(dc)=\mathbb{P}(C>t|X=x)=g(t,x)$ .

A.7 Proof of Lemma 6.2

Estimating the survival of the censoring distribution amounts to replacing $\delta$ by $1-\delta$ in the Kaplan Meier Survival curve. This yields:

[TABLE]

Thus the probability of being uncensored by time $z_{k}$ equals:

[TABLE]

Note now that

[TABLE]

for points that are uncensored. That is, Kaplan–Meier weights equal a re-scaled inverse of the probability of being uncensored by that time.

A.8 Proof of Theorem 6.3

Proof.

The squared norm, written as the inner product with itself, can be expanded into three terms $a_{1}+a_{2}-2a_{3}$ that we compute in turn. We denote by $A\circ B$ the entrywise product of the matrices $A$ and $B$ . Using the Hadamard product property $\alpha^{\top}\left(A\circ B\right)\beta=\operatorname{tr}\left(D_{\alpha}AD_{\beta}B^{\top}\right)$ where $D_{\alpha}=\text{diag}(\alpha)$ , $D_{\beta}=\text{diag}(\beta)$ , we have the following identities:

[TABLE]

As the entrywise product is symmetric in its arguments, we see that also

[TABLE]

Thus the weighted HSIC is

[TABLE]

with $H_{w}=\left(D_{w}-ww^{\top}\right)$ . In the standard HSIC case $w=\frac{1}{n}(1,1,\dots,1)\coloneqq 1_{n}$ and, $D=\frac{1}{n}I$ , so that $H_{w}=\frac{1}{n}I-1_{n}1_{n}^{\top}$ is the standard (scaled) centering matrix. ∎

A.9 Using multiple transformations

We list 4 ways of combining $p$ -values.

Method 1:

Use a Bonferroni correction and reject $H_{0}$ if for the smallest $p$ -value, denoted by $p_{(1)}$ , it holds that $p_{(1)}\leq\alpha/m$ .

Method 2:

Make the following (random) rejection decision: reject $H_{0}$ with probability $\sum_{i=1}^{m}1\{p_{i}\leq\alpha\}/m$ , and accept $H_{0}$ otherwise.

Method 3:

Fix $\beta\leq\alpha$ and reject if $\sum_{i=1}^{m}1\{p_{i}\leq\beta\}/m\geq\beta/\alpha$ . For example, reject if $\sum_{i=1}^{m}1\{p_{i}\leq 3\alpha/4\}/m\geq 3/4$ .

Method 4:

Reject if $2\sum_{i=1}^{m}p_{i}/m\leq\alpha$ . This has the advantage that it results in a quantity that can be used as a $p$ -value: $2\sum_{i=1}^{m}p_{i}/m$ .

Throughout this section, assume that the null hypothesis holds. Let $p$ be the $p$ -value resulting from sampling a dataset $D$ once, followed by running optHSIC once (so exactly one transformation and one permutation test on the transformed data). We aim to show that the methods 1 and 2 above have correct type 1 error under the assumption that $P_{H_{0}}(p\leq\alpha)\leq\alpha$ for $\alpha\in[0,1]$ which we proved for $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and expect to be (approximately) true for $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . We aim to show that methods 3 and 4 have asymptotically (as the number of $p$ -values goes to infinity) correct type 1 error rate under the assumption that $p\sim\text{Unif}[\frac{1}{B+1},\dots,\frac{B+1}{B+1}]$ , which we proved for $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ and expect to be (approximately) true also for $C\not\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ . See Table 1. While method 2 is less conservative, it is a random rejection decision which is less desirable. We can imagine Method 3 being not too conservative when $\beta=3\alpha/4$ .

A.9.1 Method 1

Assume it holds that $P_{H_{0}}(p\leq\alpha)\leq\alpha$ (see comments at the start of the section). Let $p_{(1)},\dots,p_{(m)}$ be the $p$ -values obtained from applying optHSIC $m$ times to $D$ , in ascending order. The Bonferroni correction procedure rejects $H_{0}$ if $p_{(1)}\leq\alpha/m$ . This has the correct type 1 error probability because by the union bound under the null hypothesis

[TABLE]

A.9.2 Method 2

Assume it holds that $P_{H_{0}}(p\leq\alpha)\leq\alpha$ (see comments at the start of the section). Given the $p$ -values $p_{1},\dots,p_{m}$ , the second method makes a random rejection decision in the following way: Reject $H_{0}$ with probability $\sum_{i=1}^{m}1\{p_{i}\leq\alpha\}/m$ , and accept $H_{0}$ otherwise. This has correct type 1 error because

[TABLE]

A.9.3 Method 3

Fix $\beta\leq\alpha$ and reject if $\sum_{i=1}^{m}1\{p_{i}\leq\beta\}/m\geq\beta/\alpha$ . An example would be to set $\beta=\alpha/2$ in which case we reject if $\sum_{i=1}^{m}1\{p_{i}\leq\alpha/2\}\geq\frac{1}{2}$ . Assume $P_{H_{0}}(p\leq\alpha)\leq\alpha$ .

This is an approximate method. The ‘ideal’ and practically impossible method is to reject if $P(p\leq\beta|D)\geq\beta/\alpha$ . We show that this ‘ideal’ method has the correct type 1 error:

[TABLE]

where $A$ is the event that

[TABLE]

Assume by contradiction that $P_{H_{0}}(A)>\alpha$ . Then it must hold that

[TABLE]

which contradicts that $P_{H_{0}}(p\leq\beta)\leq\beta$ . Hence it must hold that $P_{H_{0}}(A)\leq\alpha$ . Because in practice $P(p\leq\beta|D)$ is unkown, we can estimate it by $\sum_{i=1}^{m}1\{p_{i}\leq\beta\}/m$ and reject if $\sum_{i=1}^{m}1\{p_{i}\leq\beta\}/m\geq\beta/\alpha$ . Since $\sum_{i=1}^{m}1\{p_{i}\leq\beta\}/m\to P(p\leq\beta|D)$ as $m\to\infty$ it is easy to see the approximate method is asymptotically correct.

A.9.4 Method 4

Method 4 is to reject if $2\sum_{i=1}^{m}p_{i}/m\leq\alpha$ . This is an approximation of the ‘ideal’ and practically impossible method of rejecting $H_{0}$ if $D$ is such that $E(p|D)\leq\alpha/2$ . We assume it holds that $p\sim\text{Uniform}[0,1]$ : if we prove it under the assumption $p\sim\text{Uniform}[0,1]$ the result also follows under the assumption $p\sim\text{Uniform}[\frac{1}{B+1},\dots,\frac{1}{B+1}]$ since the latter distribution corresponds to a more conservative test. We now show that this ‘ideal’ method has the correct type 1 error rate. Note

[TABLE]

where $A$ is the event that

[TABLE]

Define the following family of distributions:

[TABLE]

where

[TABLE]

We verify that the family $M_{A}$ and the set $A$ satisfy three conditions:

Condition 1: For all $\mu_{D}\in M_{A}$ it holds that

[TABLE]

Condition 2: For all $\mu_{D}\in M_{A}$ it holds that

[TABLE]

by definition of $A$ .

Condition 3: For all $0\leq a\leq b\leq 1$

[TABLE]

We now define the $\nu_{A}$ to be an ‘average’ of the distributions in $M_{A}$ :

[TABLE]

It is easy to see $\nu_{A}$ satisfies condition 1:

[TABLE]

To see $\nu_{A}$ satisfies condition 2 note that

[TABLE]

To see $\nu_{A}$ satisfies condition 3 note that

[TABLE]

Here $E^{\prime}_{H_{0}}$ denotes expectation with respect to $D^{\prime}$ and $E_{H_{0}}$ with respect to $D$ . Note condition 1 says that $\nu_{A}$ is a probability measure, condition $2$ says its expectation is less than $\alpha/2$ and condition 3 that $v_{A}$ is dominated by the measure defined by the uniform density $1/P_{H_{0}}(A)$ .

Thus, if $P_{H_{0}}(A)=\beta$ , then $\nu_{A}$ satisfies the three conditions above, with $\beta$ in the third condition. We now show that there is a maximum value $\beta^{\star}$ so that if $P_{H_{0}}(A)=\beta>\beta^{\star}$ , then it is impossible for any distribution $\nu$ to satisfy the three conditions above.

We first show $\beta^{\star}\geq\alpha.$ Assume that $P_{H_{0}}(A)=\alpha$ . If we let $\nu_{\alpha}$ be the uniform probability measure on $[0,\alpha]$ , then it is clear that the first two conditions are met: it is a valid probability distribution (condition 1), the expectation is exactly $\alpha/2$ (condition 2). The third condition is met since for $0\leq a\leq b\leq\alpha$ it holds that

[TABLE]

because by assumption $P_{H_{0}}(A)=\alpha$ .

We now need to show that if $\beta>\alpha$ there does not exist a distribution $\nu$ that satisfies the three conditions. To that end, note first that if $\nu$ satisfies the three conditions for $\beta$ , then it also satisfies the conditions for any $\beta^{\prime}$ such that $\beta^{\prime}<\beta$ (note $P_{H_{0}}(A)$ only appears in the third condition). So in particular such $\nu$ would have to satisfy the conditions also with $\beta=\alpha$ . However, to change the distribution $\nu_{\alpha}$ defined above, one cannot place more mass in the region $[0,\alpha]$ by condition 3, which says $\nu$ needs to be dominated by the measure defined by the uniform density $1/P_{H_{0}}(A)$ . On the other hand, if one removes mass from $[0,\alpha]$ then one automatically increases the mean of the distribution, which violates condition 2, since the mean of $\nu_{\alpha}=\alpha/2$ . We conclude that $\beta^{\star}=\alpha$ . The type 1 error of the ‘ideal’ method is thus at most $\alpha$ .

Since $\sum_{i=1}^{m}p_{i}/m\to E(p|D)$ as $m\to\infty$ it is easy to see the approximate method has asymptotically correct type 1 error rate. This method has the advantage that it results in a combined $p$ -value: $2\sum_{i=1}^{m}p_{i}/m$ , whereas the other methods only lead to rejection decisions. The $p$ -value will be conservative if there is little randomness in the dataset. In the case there is no censoring, the $p$ -value is a factor 2 bigger than necessary. However, the Bonferroni correction would result in a $p$ -value that is a factor $m$ bigger than necessary (where $m$ , the number of transformations used, which may be much larger than $2$ ).

A.10 Tables

A.10.1 Type 1 error rates

A.10.2 Rejection rate under varying censoring regimes

A.11 Binary covariates

As a special case of independence testing we consider the case of a single binary covariate, i.e., $X\in\{0,1\}$ . If one groups the data by covariate, then testing independence of $T$ and $X$ is equivalent to testing equality of lifetime distribution between the two groups. This is known as two-sample testing on right censored data. Popular approaches to this challenge are the logrank test and various weighted logrank tests. optHSIC can be applied to this problem without any adjustments, while wHSIC can be improved in this case in two ways: first, the weights can be estimated even when the censoring distribution differs between the two groups; and second, there exists an alternative permutation strategy that, experiments show, seems to control the type 1 error effectively even under dependent censoring. These adjustments are described in Section A.11.1 and Section A.11.2 respectively. We omit consideration of zHSIC, as it is fundamentally more limited, given the larger number of available methods.

A.11.1 wHSIC for two-sample testing

Let $P_{0}$ and $P_{1}$ denote the distribution of $T|X=0$ and $T|X=1$ respectively. Let the total sample be $D=\left((x_{i},z_{i},\delta_{i})\right)_{i=1}^{n}$ as before, and write $\left((z_{i}^{0},\delta^{0}_{i})\right)_{i=1}^{n_{0}}$ and $\left((z_{i}^{1},\delta^{1}_{i})\right)_{i=1}^{n_{1}}$ for the event times and indicators of individuals with covariate $X=0$ and $X=1$ respectively. We want to asses if $P_{0}=P_{1}$ . We again use the covariance kernel of Brownian motion. If all of the $n$ times were observed ( $\delta=1$ ), we could measure the difference in empirical distributions between both groups by the MMD between the two distributions:

[TABLE]

Similar to Section 6.1, when some observations are censored, we might reweight the empirical distributions, and instead compare the weighted empirical distributions

[TABLE]

We propose that the weights $w_{i}$ are computed by the Kaplan–Meier weights within each group. The test statistic thus becomes:

[TABLE]

This statistic was also, independently, proposed by Matabuena (2019), and can be seen as a special case of wHSIC in the case of binary covariates. Under the hypothesis that $C\mathchoice{\mathrel{\hbox to0.0pt{$ \displaystyle\perp $\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$ \textstyle\perp $\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptstyle\perp $\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$ \scriptscriptstyle\perp $\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X$ , one can obtain $p$ -values using a permutation test, resulting in the following algorithm. Section A.11.2 provides an alternative permuation strategy under dependent censoring, that was proposed by Wang et al. (2010). It was proposed in the context of the logrank test, but can equally be used for other statistics.

A.11.2 ipxHSIC

This subsection overviews a test we name ipxHSIC, which uses the same statistic $\text{wHSIC}(D)$ defined in Section A.11.1 above, but a different permutation strategy that is robust against differences in the censoring distributions of both groups. The permutation strategy was proposed in Wang et al. (2010) to provide reliable $p$ -values for the logrank statistic in the case of small or unequal sample sizes. In fact Wang et al. (2010) propose two permutation strategies: the first one, which they call ‘ipz’ (section 2.1.1), permutes group membership and the second, which they call ‘ipt’(section 2.1.2), permutes survival times. These permutation strategies were proposed in the context of logrank tests - but can equally be applied to other statistics, such as wHSIC. The first strategy, which permutes the covariates, is referred to in their work as ‘ipz’ since the procedure first imputes several unobserved times, and then permutes the covariate, which in their work is denoted by $z$ . We refer to it as ‘ipx’, as our covariate is denoted by $x$ . The algorithm uses the Kaplan–Meier estimator to estimate three distributions: 1) $G^{0}$ , the censoring distribution in group 0, based on the data observed in group 0; 2) $G^{1}$ , the censoring distribution in group 1, based on the data observed in group 1; 3) the distribution of the lifetimes $F$ based on the pooled dataset containing both groups. With these estimates, a new dataset is constructed, consisting of $n$ observations, each consisting of a covariate, an event time, and two censoring times, one for each censoring distribution. This larger dataset is then permuted, and transformed back to a censored dataset. Wang et al. (2010) describe the algorithm in full detail. This method thus combines the wHSIC statistic with an alternative permutation strategy. Because this method relies on explicitly estimating censoring distributions in each group, it is difficult to extend this to the continuous case, where for each covariate we only have one individual in the study with that exact covariate.

A.11.3 Numerical comparison of methods in the two-sample case

We generate data from four different distributions for each of $X$ , $T$ , and $C$ to compare the power and type 1 error of the proposed methods optHSIC, wHSIC, ipxHSIC to the power and type 1 error of the classic logrank test and a weighted logrank test proposed by Ditzhaus and Friedrich (2020). The classical logrank test is known to have low power against certain alternatives, such as crossing survival curves. A weighted logrank test assigns weights to data, giving the logrank test power against different alternatives. In Ditzhaus and Friedrich (2020) a combination of weights is proposed, so as to achieve power against a wider class of alternatives. In particular Ditzhaus and Friedrich (2020) propose a combination of two sets of weights, corresponding to proportional and crossing hazards. As this section mostly serves to provide an example of our methods, we simulate fewer scenarios than in Section 7. In each scenario we let the $n$ values range from $n=20$ to $n=400$ in intervals of $20$ . To obtain $p$ -values in the three HSIC based methods as well as the weighted logrank test we use a permutation test with 1999 permutations. We reject the null hypothesis if our obtained $p$ -value is less than $0.05$ .

A.12 Example of data with binary covariates in which optHSIC does not perform well

Consider the following case. Group $X=0$ contains 1050 individuals. Group $X=1$ contains 50 individuals. Up to time $t=50$ , no events occur. At time $t=50$ , 1000 individuals of group $X=0$ are censored. There are now 50 individuals remaining in each of the groups. The $50$ individuals of group $X=0$ have event time $100+\text{Exp}(\text{mean}=2)$ and the $50$ individuals of group $X=1$ have event time $100+\text{Exp}(\text{mean}=1)$ . In this example we find the logrank test to have power of $89\%$ and optHSIC to have power of only $12\%$ .

What happens is the following: At time $t=100$ there are 100 individuals at risk. The individuals of group $X=1$ are likely to have their event first, due to the higher rate in the corresponding exponential distribution. Because in group $X=0$ 1000 individuals have been censored, the optimal transport map has a high chance of choosing $\tilde{x}=0$ when $x_{i}=1$ . So while in the resulting dataset a slight bias will remain towards individuals in group $X=1$ having their event first, this bias is much less clear than before the transformation. (We thank a reviewer for proposing this scenario.)

There are several characteristics that make the difference in this example so large. Firstly, as mentioned before, optimal transport relies on the ability to choose a ‘similar covariate’. When covariates are binary it may happen that $\tilde{x}=0$ while $x_{i}=1$ . Secondly, in this case all the censoring happens in group $X=0$ , causing optimal transport to send mass from group $X=1$ to group $X=0$ . Furthermore, the censoring rate is high ( $91\%$ of all individuals). Lastly, before the censoring occurs there is no evidence of a difference in distribution.

A.12.1 Comments on two-sample simulations

The results show that the logrank test and the weighted logrank test have little power in scenario 2 and 3 and scenario 3 respectively, even though large differences between the samples are present. The logrank is designed to detect differences as in scenario 1, and the weighted logrank is designed to detect differences as in scenario 1 and 2, sacrificing power slightly compared to the logrank test in the first. Scenario 3 is designed to defeat the weighted logrank test, since we constructed an extreme version of an early crossing survival curve, and the test does not contain weights for early crossing. The kernel methods are fully nonparametric, but do lose power in certain scenarios, most notably in Scenario 2 and the example provided. We believe optHSIC is not ideally suited to the case of binary covariates, since optimal transport relies on choosing a ‘similar’ covariate. Furthermore, while there are no fully nonparametric alternatives for independence testing for continuous covariates, there are more alternative two-sample tests. We thus believe the main value of optHSIC lies in the case of continuous covariates.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1B. Schölkopf (2001) B. Schölkopf, A. S. (2001): Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , Massachusetts: MIT press, 1 edition.
2Berrett and Samworth (2019) Berrett, T. B. and R. J. Samworth (2019): “Nonparametric independence testing via mutual information,” Biometrika , 106, 547–566.
3Cohen (1999) Cohen, S. (1999): Finding color and shape patterns in images , 1620, Stanford University, Department of Computer Science.
4Cox (1972) Cox, D. R. (1972): “Regression models and life‐tables,” Journal of the Royal Statistical Society: Series B (Methodological) , 34.2, 187–202.
5Ditzhaus and Friedrich (2020) Ditzhaus, M. and S. Friedrich (2020): “More powerful logrank permutation tests for two-sample survival data,” Journal of Statistical Computation and Simulation , 90, 2209–2227, URL https://doi.org/10.1080/00949655.2020.1773463 .
6Gretton et al. (2012) Gretton, A., K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2012): “A kernel two–sample test,” Journal of Machine Learning Research , 12, 723–773.
7Gretton et al. (2008) Gretton, A., K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola (2008): “A kernel statistical test of independence,” Advances in Neural Information Processing Systems , 585–592.
8Lip et al. (2019) Lip, S., L. E. Tan, P. Jeemon, L. Mc Callum, A. F. Dominiczak, and S. Padmanabhan (2019): “Diastolic blood pressure j-curve phenomenon in a tertiary-care hypertension clinic,” Hypertension , 74, 767–775.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

1 Abstract

2 Introduction

3 Background Material

3.1 Right-Censored Lifetimes

3.2 Independence testing using kernels

Definition 3.1**.**

Definition 3.2**.**

3.2.1 Choice of kernel

3.3 Optimal transport

4 Data transformation based on optimal transport

4.1 Objective of the algorithm

4.2 Definition of the transformation

4.3 Comments on the transformation algorithm

4.4 Intuition behind the transformation

5 Applying HSIC to the transformed dataset: optHSIC

5.1 Computational cost of optHSIC

5.2 Theoretical results on optHSIC

Lemma 5.1**.**

Proof.

Theorem 5.1**.**

Proof.

5.3 Using multiple transformations

6 Alternative approaches when censoring is independent of the covariate

Theorem 6.1**.**

Proof.

Theorem 6.2**.**

Proof.

6.1 wHSIC

Lemma 6.1**.**

Proof.

Lemma 6.2**.**

Proof.

Theorem 6.3**.**

Proof.

6.2 zHSIC

7 Numerical evaluation of the methods

7.1 Type 1 error rate

7.2 Comparison of power

7.2.1 Power for 1-dimensional covariates

7.2.2 Power for multidimensional covariates

7.3 Binary covariates

8 Discussion

9 Supplementary materials and code

10 Acknowledgment

Appendix A Appendix

A.1 Example of transformation

A.2 Proof of Lemma 5.1

Lemma A.1**.**

Proof.

Lemma A.2**.**

Proof.

Lemma A.3**.**

Proof.

A.3 Proof of Theorem 5.1

A.4 Proof of Theorem 6.1

Proof.

A.5 Proof of Theorem 6.2

Proof.

A.6 Proof of Lemma 6.1

A.7 Proof of Lemma 6.2

A.8 Proof of Theorem 6.3

Proof.

A.9 Using multiple transformations

A.9.1 Method 1

A.9.2 Method 2

A.9.3 Method 3

A.9.4 Method 4

A.10 Tables

A.10.1 Type 1 error rates

A.10.2 Rejection rate under varying censoring regimes

A.11 Binary covariates

A.11.1 wHSIC for two-sample testing

A.11.2 ipxHSIC

Definition 3.1.

Definition 3.2.

Lemma 5.1.

Theorem 5.1.

Theorem 6.1.

Theorem 6.2.

Lemma 6.1.

Lemma 6.2.

Theorem 6.3.

Lemma A.1.

Lemma A.2.

Lemma A.3.