TL;DR
This paper introduces optHSIC, a nonparametric independence test for censored lifetimes using optimal transport to handle censoring, enabling more flexible detection of dependencies than traditional methods.
Contribution
The paper presents a novel kernel-based independence test that effectively manages right-censored data through optimal transport, extending the applicability of dependence testing in survival analysis.
Findings
optHSIC controls type 1 error under independent censoring
It has greater power than Cox regression against various alternatives
Effective even when censoring depends on covariates
Abstract
We propose a nonparametric test of independence, termed optHSIC, between a covariate and a right-censored lifetime. Because the presence of censoring creates a challenge in applying the standard permutation-based testing approaches, we use optimal transport to transform the censored dataset into an uncensored one, while preserving the relevant dependencies. We then apply a permutation test using the kernel-based dependence measure as a statistic to the transformed dataset. The type 1 error is proven to be correct in the case where censoring is independent of the covariate. Experiments indicate that optHSIC has power against a much wider class of alternatives than Cox proportional hazards regression and that it has the correct type 1 control even in the challenging cases where censoring strongly depends on the covariate.
| D. | X | |||
|---|---|---|---|---|
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| 5 | ||||
| 6 | ||||
| 7 | ||||
| 8 |
| D.1 | 0.047 | 0.053 | 0.051 | 0.050 | 0.054 | 0.051 | 0.045 | 0.056 | 0.049 | 0.048 |
| D.2 | 0.049 | 0.051 | 0.057 | 0.044 | 0.053 | 0.050 | 0.047 | 0.049 | 0.052 | 0.048 |
| D.3 | 0.052 | 0.052 | 0.048 | 0.049 | 0.044 | 0.050 | 0.048 | 0.054 | 0.054 | 0.048 |
| D.4 | 0.050 | 0.054 | 0.049 | 0.054 | 0.054 | 0.054 | 0.050 | 0.053 | 0.050 | 0.046 |
| D.5 | 0.050 | 0.056 | 0.056 | 0.051 | 0.053 | 0.051 | 0.046 | 0.055 | 0.050 | 0.049 |
| D.6 | 0.050 | 0.049 | 0.048 | 0.047 | 0.048 | 0.057 | 0.050 | 0.050 | 0.048 | 0.050 |
| D.7 | 0.050 | 0.055 | 0.047 | 0.048 | 0.056 | 0.054 | 0.054 | 0.052 | 0.051 | 0.048 |
| D.8 | 0.041 | 0.050 | 0.051 | 0.048 | 0.056 | 0.049 | 0.054 | 0.050 | 0.054 | 0.052 |
| D. | X | ||
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | |||
| 5 | |||
| 6 | |||
| 7 | |||
| 8 |
| 1 | 13 | 1 | |
|---|---|---|---|
| 2 | 22 | 0 | |
| 3 | 24 | 1 | |
| 4 | 45 | 1 | |
| 5 | 81 | 0 |
| D.1 | 0.048 | 0.050 | 0.051 | 0.052 | 0.052 | 0.049 | 0.047 | 0.054 | 0.051 | 0.049 |
| D.2 | 0.048 | 0.051 | 0.047 | 0.047 | 0.049 | 0.051 | 0.048 | 0.054 | 0.050 | 0.047 |
| D.3 | 0.243 | 0.461 | 0.630 | 0.774 | 0.860 | 0.909 | 0.953 | 0.970 | 0.985 | 0.991 |
| D.4 | 0.142 | 0.232 | 0.343 | 0.487 | 0.610 | 0.734 | 0.812 | 0.880 | 0.932 | 0.959 |
| D.5 | 0.075 | 0.116 | 0.142 | 0.168 | 0.210 | 0.243 | 0.278 | 0.316 | 0.357 | 0.397 |
| D.6 | 0.932 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| D.7 | 0.308 | 0.594 | 0.761 | 0.856 | 0.906 | 0.937 | 0.960 | 0.971 | 0.980 | 0.984 |
| D.8 | 0.078 | 0.122 | 0.152 | 0.211 | 0.264 | 0.307 | 0.355 | 0.388 | 0.439 | 0.467 |
| D.1 | 0.045 | 0.047 | 0.049 | 0.050 | 0.049 | 0.051 | 0.049 | 0.054 | 0.048 | 0.049 |
| D.2 | 0.050 | 0.047 | 0.048 | 0.047 | 0.050 | 0.048 | 0.047 | 0.045 | 0.048 | 0.049 |
| D.3 | 0.079 | 0.166 | 0.235 | 0.326 | 0.410 | 0.466 | 0.540 | 0.597 | 0.658 | 0.700 |
| D.4 | 0.161 | 0.204 | 0.232 | 0.270 | 0.309 | 0.350 | 0.394 | 0.456 | 0.506 | 0.549 |
| D.5 | 0.057 | 0.084 | 0.110 | 0.131 | 0.163 | 0.191 | 0.216 | 0.258 | 0.274 | 0.299 |
| D.6 | 0.267 | 0.672 | 0.900 | 0.971 | 0.990 | 0.998 | 0.999 | 0.999 | 1.000 | 1.000 |
| D.7 | 0.071 | 0.107 | 0.143 | 0.192 | 0.260 | 0.305 | 0.369 | 0.412 | 0.480 | 0.527 |
| D.8 | 0.057 | 0.078 | 0.081 | 0.095 | 0.120 | 0.142 | 0.153 | 0.162 | 0.177 | 0.190 |
| D.1 | 0.056 | 0.054 | 0.050 | 0.047 | 0.055 | 0.048 | 0.048 | 0.055 | 0.050 | 0.051 |
| D.2 | 0.055 | 0.056 | 0.055 | 0.050 | 0.055 | 0.051 | 0.049 | 0.050 | 0.052 | 0.050 |
| D.3 | 0.057 | 0.056 | 0.051 | 0.053 | 0.054 | 0.051 | 0.046 | 0.057 | 0.048 | 0.050 |
| D.4 | 0.057 | 0.061 | 0.050 | 0.051 | 0.054 | 0.058 | 0.049 | 0.053 | 0.051 | 0.051 |
| D.5 | 0.058 | 0.058 | 0.055 | 0.052 | 0.053 | 0.053 | 0.048 | 0.054 | 0.048 | 0.049 |
| D.6 | 0.053 | 0.051 | 0.051 | 0.050 | 0.046 | 0.048 | 0.057 | 0.053 | 0.054 | 0.055 |
| D.7 | 0.148 | 0.094 | 0.084 | 0.062 | 0.063 | 0.061 | 0.058 | 0.055 | 0.060 | 0.058 |
| D.8 | 0.143 | 0.086 | 0.074 | 0.064 | 0.067 | 0.058 | 0.059 | 0.058 | 0.061 | 0.060 |
| D. | X | ||
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | |||
| 5 | |||
| 6 | |||
| 7 | |||
| 8 |
| D.1 | Cph | 0.243 | 0.422 | 0.565 | 0.705 | 0.753 |
| optHSIC | 0.229 | 0.382 | 0.501 | 0.634 | 0.699 | |
| wHSIC | 0.038 | 0.062 | 0.182 | 0.395 | 0.703 | |
| zHSIC | 0.066 | 0.168 | 0.329 | 0.525 | 0.701 | |
| D.2 | Cph | 0.180 | 0.267 | 0.268 | 0.225 | 0.108 |
| optHSIC | 0.087 | 0.171 | 0.258 | 0.378 | 0.686 | |
| D.3 | Cph | 0.073 | 0.056 | 0.107 | 0.223 | 0.288 |
| optHSIC | 0.242 | 0.177 | 0.399 | 0.886 | 0.968 | |
| D.4 | Cph | 0.187 | 0.091 | 0.064 | 0.046 | 0.039 |
| optHSIC | 0.346 | 0.224 | 0.275 | 0.509 | 0.779 | |
| wHSIC | 0.138 | 0.285 | 0.452 | 0.654 | 0.770 | |
| zHSIC | 0.105 | 0.172 | 0.274 | 0.410 | 0.759 | |
| D.5 | Cph | 0.315 | 0.487 | 0.610 | 0.705 | 0.836 |
| optHSIC | 0.268 | 0.439 | 0.546 | 0.629 | 0.775 | |
| wHSIC | 0.055 | 0.072 | 0.169 | 0.409 | 0.786 | |
| zHSIC | 0.083 | 0.229 | 0.362 | 0.605 | 0.760 | |
| D.6 | Cph | 0.461 | 0.732 | 0.834 | 0.916 | 0.939 |
| optHSIC | 0.396 | 0.681 | 0.801 | 0.876 | 0.952 | |
| D.7 | Cph | 0.055 | 0.068 | 0.077 | 0.078 | 0.107 |
| optHSIC | 0.043 | 0.100 | 0.134 | 0.289 | 0.669 | |
| D.8 | Cph | 0.162 | 0.313 | 0.431 | 0.517 | 0.572 |
| optHSIC | 0.164 | 0.335 | 0.498 | 0.619 | 0.916 |
| D. | % Observed | ||||
|---|---|---|---|---|---|
| 1 | 60 % | ||||
| 2 | 60 % | ||||
| 3 | (0.43, 1.39+ | 90 % | |||
| 4 | None | 65 % |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
**A kernel- and optimal transport- based test of independence between covariates and right-censored lifetimes
**
D. Rindt, D. Sejdinovic and D. Steinsaltz.
Department of Statistics, University of Oxford, UK
1 Abstract
We propose a nonparametric test of independence, termed optHSIC, between a covariate and a right-censored lifetime. Because the presence of censoring creates a challenge in applying the standard permutation-based testing approaches, we use optimal transport to transform the censored dataset into an uncensored one, while preserving the relevant dependencies. We then apply a permutation test using the kernel-based dependence measure as a statistic to the transformed dataset. The type 1 error is proven to be correct in the case where censoring is independent of the covariate. Experiments indicate that optHSIC has power against a much wider class of alternatives than Cox proportional hazards regression and that it has the correct type 1 control even in the challenging cases where censoring strongly depends on the covariate.
2 Introduction
We propose a nonparametric test of independence between a possibly multidimensional covariate and a right-censored lifetime. Existing approaches to this problem suffer from several limitations: if we cluster the continuous covariate into groups, and then test for equality of lifetime distributions among the groups, the results will depend on the arbitrary choice of boundaries between the groups, while the spread of covariates within groups reduces power. Alternatively, one might fit a (semi-)parametric regression model, and test whether the regression coefficient corresponding to the covariate differs significantly from zero. The most commonly used such method is the Cox proportional hazards (CPH) model, which makes two assumptions (Cox (1972)): first, the hazard function must factorize into a function of time and a function of the covariate (the proportional hazards or relative risk condition); second, the effect of a covariate on the logarithm of the hazard function must be linear. Although this is a flexible model, in some cases these assumptions are violated. More complicated hazards are found for example when studying the relationship between body mass index and mortality - Zajacova and Burgard (2012) reports - or -shaped hazards - or between diastolic blood pressure and various health outcomes (Lip et al. (2019)).
Since distance- and kernel-based approaches have been used successfully for independence testing on uncensored data (Székely and Rizzo (2009), Gretton et al. (2008)), it is natural to investigate whether these methods can be extended to the case of right-censored lifetimes. To this end we propose applying optimal transport to transform the censored dataset into an uncensored dataset in such a way that, 1) the new uncensored dataset preserves the dependencies of the original dataset, and 2) we can apply a standard permutation test to the new dataset with test statistic given by Distance Covariance (DCOV) (Székely and Rizzo (2009)) or, equivalently, the Hilbert–Schmidt Independence Criterion (HSIC) (Gretton et al. (2008)).
Progress in kernel-based independence testing for censored data is further motivated by the fact that in the simpler context of uncensored data the corresponding methods have been further developed into tests of conditional independence, mutual independence, and have been applied to causal inference (Zhang et al. (2011), Pfister et al. (2018)), and in particular detection of confounders. While the present work does not propose methods for testing conditional or mutual independence, we believe independence testing is a first step towards those ends. Additionally, since our method allows for multidimensional covariates, one can first test for a dependency based on the full multidimensional covariate, and then test whether the dependency remains when certain sub-dimensions are omitted from the covariate.
Section 3 overviews relevant concepts in survival analysis, distance- and kernel-based independence testing, and optimal transport. Section 4 proposes a transformation of the data based on optimal transport. Section 5 introduces our testing procedure named optHSIC. Although we have not yet been able to prove control of the type 1 error rate in full generality, we do show the type 1 error rate to be correct in the case where censoring is independent of the covariate. Furthermore we obtain very promising results in simulation studies, showing correct type 1 error control even under censoring that depends strongly on the covariate. Section 6 explores alternative kernel-based approaches under the additional assumption that censoring is independent of the covariate. These methods serve as benchmarks for the power performance of optHSIC. Section 7 compares the power and type 1 error of all tests and CPH regression in simulated data.
3 Background Material
3.1 Right-Censored Lifetimes
Let be a lifetime subject to right-censoring, so that we do not observe directly, but instead observe for some censoring time , as well as the indicator . We further observe a covariate vector , where is equipped with the Borel sigma algebra. In total, for an i.i.d. sample of size , we thus obtain the data .
The main goal of this paper is developing a test of H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T versus H_{1}:X\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T based on the sample . Throughout this paper we will make the following assumptions.
Assumption 1: We assume that conditional on , the random variables are mutually independent.
Assumption 2: We assume that C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T|X.
Denote and . We assume has a density . Let . We define the hazard rate of an individual with covariate to be The Cox proportional hazards model (CPH) assumes that the hazard rate can be written as for some baseline hazard and a vector . This model enables estimation of and testing the significance of the difference of entries of from zero. The CPH model is the most commonly used regression method in survival analysis.
A last important concept is that of the ‘individuals at risk at a time ’. By this we mean the set . We also use the notation for a multiset (a set with potentially repeated elements) with elements . We refer, for example, to the multiset , the multiset of ‘covariates at risk at time ’.
3.2 Independence testing using kernels
Kernel methods have been successfully used for nonparametric independence- and two-sample testing (Gretton et al. (2008), Gretton et al. (2012)). We now give some of the relevant background in kernel methods.
Definition 3.1**.**
(Reproducing Kernel Hilbert Space)(B. Schölkopf (2001)) Let be a non-empty set and a Hilbert space of functions endowed with dot product . Then is called a reproducing kernel Hilbert (RKHS) space if there exists a function with the following properties.
* satisfies the reproducing property*
[TABLE] 2. 2.
* spans , that is, where the bar denotes the completion of the space.*
Let together with a sigma-algebra be a measurable space and let be an RKHS on with reproducing kernel . Let be a probability measure on . If , then there exists an element such that for all (Gretton et al. (2012)), where the notation is defined to be . The element is called the mean embedding of in . Given a second distribution on , for which a mean embedding exists, we can measure the dissimilarity of and by the distance between their mean embeddings in :
[TABLE]
which is also called the Maximum Mean Discrepancy (). The name comes from the following equality (Gretton et al. (2012)),
[TABLE]
showing that MMD is an integral probability metric. Given a sample and the empirical distribution, , where denotes the Dirac measure at , the corresponding mean embedding is given by
Suppose now that together with some sigma algebra is a second measurable space, and let be an RKHS on with kernel . Let be a random variable in with law and similarly let be a random variable in with law . Finally let denote the joint distribution on equipped with the product sigma-algebra. We let denote the RKHS on with kernel
[TABLE]
In Gretton et al. (2008) it was proposed that the dependence of and could be quantified by the following measure:
Definition 3.2**.**
The Hilbert–Schmidt independence criterion (HSIC) of and is defined by
[TABLE]
where denotes the product measure of and .
Let , and be three mutually independent copies of the same random variable with law . Using the reproducing property and the definition of mean embeddings, it can be shown that
[TABLE]
Now assume we are given a sample of independent observations of the random pair . An empirical estimate of HSIC can be obtained by measuring the distance between the embedding of the empirical distribution of the data and the embedding of the product of the marginal empirical distributions. That is, we define by
[TABLE]
Using the reproducing property of the kernel and the definition of in terms of and , can be shown to equal
[TABLE]
While the biased defined above is the most commonly used estimator in the literature (Székely and Rizzo (2009), Gretton et al. (2008)), unbiased estimators of exist too (Song et al. (2012)). The bias of is , (Gretton et al. (2008)) and for appropriate choices of kernels, the permutation test with the biased statistic and the permutation test with the unbiased statistic are consistent tests - see section 3.2.1 for details. Both tests also have correct type 1 error rate. Because the biased is more commonly encountered in the literature and has a slightly easier analytic form, we use throughout our paper. The following algorithm shows how is commonly combined with a permutation test for independence testing.
3.2.1 Choice of kernel
Throughout this paper we assume the covariates take values in the Euclidean space for some . Note that in our case . We let both and be instances of the covariance kernel of Brownian motion. That is,
[TABLE]
and
[TABLE]
where denotes the Euclidean norm. See Sejdinovic et al. (2012) for a discussion of this kernel.
The reason for this choice is three-fold. Firstly, under this choice of kernels, coincides with Distance Covariance (DCOV) (Székely and Rizzo (2009)). The equivalence between and DCOV was proved in Sejdinovic et al. (2012). DCOV is a well studied measure of dependence and if are random variables with compact support, then if and only if X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y. Furthermore, the permutation test with test statistic DCOV is consistent in the sense that, for each distribution with compact support and such that X\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y, the probability that the permutation test rejects the null hypothesis converges to 1, as the sample size converges to infinity (Rindt et al. (2020)). Secondly, unlike other potential kernels, such as the Gaussian kernel, the covariance kernel of Brownian motion spares us the need to select a bandwidth. Thirdly, our test relies on optimal transport to transform the original data into a transformed dataset. The optimal transport procedure, discussed in the next section, requires a metric with respect to which similarity is preserved. It appears natural to measure the dependency in the transformed dataset using the same metric as the one underlying the transformation.
3.3 Optimal transport
Let and be multisets of length and with . Let be a distribution vector on , by which we mean: so that and . Let be a distribution vector on . Let be a metric on . Then the earth mover’s distance (EMD) between and is defined as
[TABLE]
subject to
[TABLE]
Let be a random variable taking values in with distribution , and a random variable taking values in with distribution . In the next section we will write ‘let be the joint distribution that solves the optimal transport problem between and ’. By that we mean for , where is the matrix that minimizes Equation (2). Note that is a valid joint distribution of and . Furthermore, for any with it holds that .
4 Data transformation based on optimal transport
4.1 Objective of the algorithm
To use HSIC for an independence test of and , one requires the explicit values for . Since for right-censored data may be known only in terms of a lower bound, there is no straightforward way to apply these methods to right-censored data. In this section we propose a transformation of the original, censored dataset, into a synthetic dataset consisting of observed events. The algorithm uses optimal transport and its goal is twofold: first, it should return a dataset to which we can apply a permutation test with test-statistic HSIC, and obtain correct -values under the null hypothesis; second, it should return a dataset in which the dependence between and is similar to the dependence in the original dataset. Indeed, the transformation of the data is of independent interest: we use the standard permutation test with test statistic HSIC/DCOV, but other statistics could be considered. We do believe that the main value of the transformation lies in how it can be straightforwardly combined with permutation testing on the transformed data and we expect it may be difficult to use the transformation for other purposes.
4.2 Definition of the transformation
To define the algorithm, we use the following notation. Let be a multiset where for . We define , to be the uniform distribution over the elements, where is the Dirac measure at . By we set to be the multiset that remains when one instance of is removed from , and by we set to be the multiset that consists of the elements of , with an added element . We further assume that we are given a sample so that : that is, we have distinct event times and they are labeled in increasing order. For a variable we use to change the value of to . In computing the optimal transport coupling as defined in Section 3.3, we use the Euclidean metric for . The proposed transformation is defined in Algorithm 2.
4.3 Comments on the transformation algorithm
We now give a verbal explanation of Algorithm 2.
Initialization: The input is simply where . We initialize an empty transformed dataset , to which we will add observations of the form for some in the following way. We will loop over the times from . At each time, the multiset lists the covariates at risk in the dataset and is initialized as the multiset containing all covariates. The multiset will list all covariates that have not been added to yet. Indeed, as is initialized empty, initially contains all covariates. The variable will count the number of observed events and is initialized 0.
First loop from : At time we distinguish two cases: if we leave and unchanged, and simply remove one instance of from . If we add to (to count the number of observed events). We also select a covariate from as follows: First a joint distribution of and is computed using optimal transport. This is a matrix of size . This distribution is then conditioned on the event that , yielding a distribution over . We sample from this distribution to obtain the covariate . The pair is then added to , and the covariate is removed from . There are now observations in and there are covariates left in . We also remove from .
When we finish the first loop: It is now the case that (note the previous loop went up to ) and the multiset contains covariates. In the second loop, that runs from to , each remaining covariate in is combined with the time and added to . After this loop, is empty, and contains all covariates , each associated with a time. The transformed dataset is returned.
Consider the special case in which there is no censoring and for . Then it is easy to verify that in the first loop at each step , so that the optimal transport algorithm returns a scaled identity matrix. Consequently, at step the covariate is equal to with probability 1, and the final output is . A nontrivial example is worked out in Section A.1.
4.4 Intuition behind the transformation
Before we prove properties of the proposed transformation, we briefly comment on the intuition behind the transformation. To this end, first consider a permutation test in the absence of censoring, when we simply observe . The permuted datasets can be generated as follows: loop through the events in order of time, and to each time, associate a covariate that you have not associated to any earlier time. Comparing the original dataset with the datasets obtained in this manner is justified for the following reason: Under the null hypothesis a sample can be generated by firstly sampling for i.i.d. and independently sampling for i.i.d., and secondly looping through the times in order, and associating to each time a covariate that has not yet been chosen at a previous time. The original dataset and the permuted datasets are thus equal in distribution: intuitively, the permutation test checks whether the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.
It is not obvious how to translate this to censored data. Due to censoring, it may not be true that the -th event covariate is chosen uniformly from the covariates that have not had an observed event time before. For this reason survival analysis often compares the –th event covariate with the covariates at risk (not failed and not censored) just before the –th event. If the null hypothesis holds, then intuitively it holds that the -th event covariate is chosen uniformly from the covariates at risk at time . That is, if denotes the multiset of covariates at risk and is the event covariate, we would like to test if was chosen uniformly from .
To do so using a permutation test, our algorithm couples to a uniform choice from those that have not yet been chosen in the synthetic dataset: (see Algorithm 2). In this manner, when the null hypothesis is true, (intuitively) it holds that in the synthetic dataset the covariates are chosen uniformly from those not chosen before. But this last statement is exactly our intuition behind a permutation test: namely, we use a permutation test to see if the dataset looks as if, at each time, a covariate is picked uniformly from those not chosen before.
The intuition discussed thus far related to the workings of our transformation under the null hypothesis, and had at its core that is coupled to in Algorithm 2. However there are many couplings between these two distributions. The reason we choose the coupling that solves the optimal transport problem is the following: assume the alternative hypothesis holds and that given , certain covariates have a higher hazard rate than others. Then we would like this bias towards these covariates to be visible in . In other words we would like to be close to in Algorithm 2. That is precisely what the optimal transport coupling achieves.
Finally, including the remaining covariates in the transformed dataset at the last time ensures the permuted datasets correctly reflect all alternative choices that can be made when covariates are chosen at random from those not chosen before. The association of these remaining covariates to the last event time reflects they have not been selected at each of the earlier times, which may indicate they have had less risk of having an event up to that point.
The goal of the transformation is thus twofold: firstly, ensuring that under the null hypothesis a permutation test is appropriate on the transformed dataset, which motivates the coupling of and ; secondly, given the first goal, ensure covariates in associated to times for are as close as possible to the covariates associated to in , which motivates the chosen coupling to be the coupling that solves the optimal transport problem defined in Section 3.3.
An interesting alternative approach would be to compare with directly, without first transforming the data. We do not see, however, how to combine that approach with a permutation test, or with another procedure to test for significance.
5 Applying HSIC to the transformed dataset: optHSIC
We have thus far described how to transform the dataset. For the hypothesis test of H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T, we propose to apply a permutation test with test statistic DCOV/HSIC to the transformed dataset. This approach is summarized in Algorithm 3.
5.1 Computational cost of optHSIC
The algorithm optHSIC consists of two parts: First, the dataset is transformed. Second, a permutation test with test statistic HSIC/DCOV is performed. The computational cost of the transformation may be high: the solution of the earth mover’s distance has complexity (Shirdhonkar and Jacobs (2008)). Since the earth mover’s distance is computed for each such that , complexity of the transformation is excessively high. Luckily, fast approximations of the earth mover’s distance exist that are linear in time (e.g. Shirdhonkar and Jacobs (2008)), making the transformation . Also in the case where the covariates are 1-dimensional, the optimal transport problem has a simple solution (Cohen (1999)). Different algorithms and/or approximations also exist for the EMD with other metrics (Shirdhonkar and Jacobs (2008)). The second step, computation of HSIC is also an -operation for which large-scale approximations can be made (Zhang et al. (2018)). Furthermore, both the computation of HSIC and the permutation test computations can be easily parallelized. In our simulations we did not use approximations of the earth mover’s distance nor of HSIC, as for samples up to about 500 the optHSIC test can be performed in about a second on an ordinary PC.
5.2 Theoretical results on optHSIC
We say a test with -value has correct type 1 error rate if under the null hypothesis . The main theoretical result we obtained for optHSIC is that the type 1 error rate is correct when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. Unfortunately, we have not been able to prove other important results such as (asymptotically) correct type 1 error rate when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, or consistency (power converging to 1 for each alternative hypothesis). However, as Section 7 details, extensive simulations demonstrate that optHSIC achieves correct type 1 error rate also when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. Simulations further show that optHSIC is able to detect a wide range of dependencies between and without losing much power, relative to the CPH likelihood ratio test, even when the CPH model assumptions holds. Obtaining further theoretical results is an important future challenge. In the uncensored case a permutation test with covariance kernel of Brownian motion is consistent for distributions with compact support and has correct type 1 error rate (Rindt et al. (2020)). The difficulty of extending these proofs to optHSIC is that optHSIC requires sequential analysis of the optimal transport distributions, breaking independence between observations in the transformed dataset.
We first prove an auxiliary result: namely, although we propose to permute the transformed dataset, this is equivalent to permuting the original dataset, and then transforming the permuted datasets.
Lemma 5.1**.**
Let be independent uniform random permutations, and let be independent optHSIC transformations.
[TABLE]
Proof.
See Appendix. ∎
Note that this implies that in the definition of optHSIC we could have equally permuted first, and then applied the transformation to each of the permuted datasets. However this is computationally more expensive. Lemma 5.1 enables us to show that optHSIC has correct type 1 error rate when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X.
Theorem 5.1**.**
Let be an i.i.d sample. Assume that C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. Let be the -value resulting from applying optHSIC (Algorithm 3) to with permutations and level . If the null hypothesis holds, i.e. T\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, then and in particular it holds that
[TABLE]
Proof.
See Appendix. ∎
Table 1 summarizes our findings of the type 1 error of optHSIC.
5.3 Using multiple transformations
Algorithm 2 defines the transformation used in optHSIC. In line 7, a covariate is sampled from the multiset with distribution . The sampled element may differ for different iterations of the algorithm. Consequently, given a fixed dataset , the transformed dataset will look different for different iterations of the algorithm. The optHSIC algorithm proposes to use only one transformation, resulting in a single -value.
Say that instead of applying the optHSIC algorithm once, we repeat the optHSIC algorithm times, resulting in -values . An example of the distribution of -values for a fixed dataset is given in Figure 5.
Figure 5 shows that in the used dataset, where of observations is censored and , the variance in the obtained -values is not excessively high, and in particular all -values would have led to the decision to reject the null-hypothesis.
We also studied several techniques one may use to combine -values, typically at the cost of being more conservative when there is little variance among -values, like in the example above. This is discussed in Section A.9.
6 Alternative approaches when censoring is independent of the covariate
We are not aware of fully nonparametric methods to test independence between right-censored times and continuous covariates. When is independent of the challenge is mitigated, since in that case any statistic can be combined with a permutation test that permutes the covariates (see Theorem 6.2). Using that approach, this Section proposes some alternative tests in the case when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X that will serve as benchmarks for studying power performance of optHSIC in addition to the Cox proportional hazards likelihood ratio test. In Section 7 we compare the performance of these methods with optHSIC.
Before we propose alternative test statistics, we state formally why standard permutation tests can be used when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. We first state a Theorem on the correctness of the type 1 error rate of permutation tests without censoring. We include the proof following Berrett and Samworth (2019) in the Appendix for completeness.
Theorem 6.1**.**
Let be an i.i.d. sample of size from distribution on . Denote . Let be permutations sampled uniformly and independently from . Let be any statistic of the data. Let be the rank of the first coordinate in the vector:
[TABLE]
when ties are broken at random and where is the rank of the largest element and the rank of the smallest element. Define . Then under the null hypothesis H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T it holds that: . So in particular for
[TABLE]
Proof.
See Appendix. ∎
In survival analysis we apply the above, setting , where and .
Theorem 6.2**.**
Assume C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and T\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}C|X. Write , and let be sampled i.i.d. uniformly from . Write . Let be any statistic of the data. Let be the rank of the first coordinate in the vector:
[TABLE]
where ties are broken at random and is the rank of the largest element and is the rank of the smallest element. Define . Then under the null hypothesis H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T it holds that . So in particular for
[TABLE]
Proof.
See Appendix. ∎
Having shown that we can use a permutation test on any test statistic when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, the next two sections explore meaningful measures of dependency on right-censored data. Indeed such methods may lead to type 1 errors when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X.
6.1 wHSIC
HSIC relies on estimating the mean embedding of the joint distribution . The estimated embedding is then compared to the embedding of the product of marginal distributions. In the uncensored case the empirical distribution equals which has corresponding mean embedding .
Since we do not observe the ’s we could consider replacing the empirical distribution by a weighted version where we try to find weights such that for every measurable function such that the expectation exists. A natural idea is to give an observed point a weight of zero if it is censored, and a weight equal to the inverse probability of being uncensored otherwise. This can be motivated by the following lemma, that applies also if C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. We first define
[TABLE]
The following lemma proposes a weight function in terms of .
Lemma 6.1**.**
Let be an - measurable function such that and define by
[TABLE]
Then
Proof.
See Appendix. ∎
We would thus like to use weights and As we will be working under the assumption C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X we may estimate using Kaplan–Meier weights, which we define now. Assume that there are no ties in the data and that for . Then the Kaplan–Meier survival curve is given by:
[TABLE]
Kaplan–Meier weights are then defined by
[TABLE]
One can also estimate the survival probability of the censoring time using Kaplan–Meier weights (simply replace by in the above formula). The following lemma shows that the weights defined above, correspond up to a constant of , to the inverse of the estimated probability of being uncensored.
Lemma 6.2**.**
Let denote the Kaplan–Meier estimate of the survival probability of the censoring distribution. Then:
[TABLE]
Proof.
See Appendix. ∎
We use these weights to define the statistic wHSIC as the RKHS distance between the embedding of and the embedding of the product of the marginals, That is:
[TABLE]
Here K\bigl{(}(x,y),(x^{\prime},y^{\prime})\bigr{)}=k(x,x^{\prime})l(y,y^{\prime}). The following theorem shows how to compute efficiently.
Theorem 6.3**.**
(Computation of wHSIC) Given a dataset with a weight vector ,
[TABLE]
where and and , where , a diagonal matrix with .
Proof.
See Appendix. ∎
It is not difficult to see wHSIC has the same computational time as (uncensored) HSIC.
6.2 zHSIC
If C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, then any dependence between and must be due to dependence between and . Hence we may estimate to measure the strength of the dependence. This test is, indeed, expected to yield false rejections if C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X fails to hold, and to lack power when a large portion of the events is censored. We include this test mainly to see how the power of optHSIC and wHSIC compare with this approach.
7 Numerical evaluation of the methods
We generate data from various distributions of , and to compare the power and type 1 error rate of optHSIC, wHSIC, zHSIC and CPH. CPH stands for the Cox proportional hazards likelihood ratio test. In each scenario, we let the sample size range from to in intervals of . To obtain -values in the three HSIC-based methods we use a permutation test with 1999 permutations. We reject the null hypothesis if our obtained -value is less than . In the main paper, we present results of rejection rates under distributions that are chosen such that approximately of the observations are observed (). We investigate the rejection rates under varying censoring regimes in Section A.10.2 of the Appendix.
7.1 Type 1 error rate
We begin by investigating the type 1 error rate. The distributions in which we test the type 1 error rate are found in Table 2. In these distributions the null hypothesis holds, i.e. X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T and we consider both cases where C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, as well as the case of multidimensional covariates. We estimate the rejection rates by sampling 5000 times from each distribution for each sample size and applying the different tests to the samples. The obtained rejection rates of optHSIC are found in Table 3. The type 1 error rate of the remaining methods is found in Section A.10.1. The most important finding is that optHSIC is found to have correct type 1 error rate of both when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. In contrast, wHSIC and zHSIC yield many false rejections when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, as expected, but have the correct type 1 error rate when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. Investigation of the -values of optHSIC furthermore showed that -values are distributed approximately according to under the null hypothesis, even when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X (see Figure 10 in Section A.10.1).
7.2 Comparison of power
To compare the power of the four methods we consider a number of distributions in which T\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. For each distribution we let the sample size range from to in steps of . At each sample size we take 1000 samples to estimate the rejection rate. In these distributions censoring is independent of the covariate such that the methods wHSIC and zHSIC do not have inflated rejection rate due to dependencies between and . Distributions with varying censoring rates and dependent censoring are investigated in Section A.10.2. Parameters are chosen so that that of the observations is uncensored ( has ), and rejection rates take a range of values (i.e. to exclude trivial distributions so that each method rejects with probability 1 for each sample size.).
7.2.1 Power for 1-dimensional covariates
In distributions 1-4 of Table 4 the covariate is 1-dimensional. Scatterplots of the samples and rejection rates are displayed in Figure 8. Note how in D.1 the CPH assumption holds, so the CPH method suits this example very well. We find however that the rejection rate of optHSIC is very similar to that of the CPH likelihood ratio test (first row, right of Figure 8). D.2 features a case in which hazard is highest in the middle, and lower for extreme values of the covariate. The CPH likelihood ratio test does not have power to detect this relationship, and optHSIC is the most powerful method. In D.3 and D.4 the hazard functions of different covariates cross each other. In this case, wHSIC is the top-performing method, but optHSIC is also able to detect the more complicated relationship, while CPH is not able to do so.
7.2.2 Power for multidimensional covariates
In D.5-8 of Table 4 the covariates are multidimensional. Figure 9 shows the rejection rates of the four methods, both as the dimension is fixed at , and the sample size increases (left column) and when the sample size is fixed at 200, and the dimension increases (right column). In D.5 and D.6 the CPH assumption holds. Again we see that the power of optHSIC is relatively close to the power of the CPH likelihood ratio test, which is specifically designed for these assumptions. This holds both when the dependence is on a single sub-dimension of the covariate (D.6) and when the dependence is on all covariates (D.5). In D.7 and D.8 a non-linear term is present in the hazard rate. Together with zHSIC, optHSIC is now the best performing method. Note that as dimension increases, the dependence in D.6-D.8 becomes harder to detect, whereas the dependence in D.5 becomes easier to detect since in the latter depends on all covariates.
We also investigated the power when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and as the censoring percentage varies from to . Distributions and rejection rates are presented in Section A.10.2. The main observations are that, although all methods lose power as the censoring rate increases, the relative performance of the four methods remains similar to the results presented in the main text.
We thus find that for continuous covariates optHSIC is able to detect a wider range of dependencies than the CPH likelihood ratio test, while not losing much power when the CPH assumptions hold. This is true both when the covariate is 1-dimensional and when the covariate is multidimensional, even when the dependence arises from a lower-dimensional subspace of the covariate. Importantly, unlike the methods wHSIC and zHSIC, the type 1 error rate of optHSIC is found to be of the correct level both when C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and when C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X.
7.3 Binary covariates
Lastly, we did experiments with binary covariates, in which case independence testing is equivalent to two-sample testing. For two-sample testing there are other alternative approaches, including Ditzhaus and Friedrich (2020) and Matabuena (2019). Furthermore, the wHSIC approach can be modified firstly to allow for the computation of weights within groups, and secondly a permutation test can be constructed that is valid also when censoring differs between groups. This permutation test was proposed by Wang et al. (2010). This special case of wHSIC coincides with the approach discussed in Matabuena (2019) for two-sample testing.
Our main finding is that optHSIC has a decent all-round performance for two-sample testing. However, especially when the distributions meet additional assumptions used in semi-parametric tests, these semi-parametric tests outperform optHSIC. We also note that, as optHSIC relies on choosing ‘similar covariates’ during the transformation of Algorithm 2, it is not ideally suited for the two-sample case, as there is a chance of choosing the opposite covariate.
We also provide a ‘failure mode’ of optHSIC (we thank a reviewer for this example): a scenario in which optHSIC performs much worse than the CPH test. In this example, covariates are binary (i.e. there are two groups), group sizes are very unequal, of observations is censored, censoring appears only in the large group, and survival in the two groups is identical until beyond the censoring has occurred. See A.12 for a detailed description.
A full presentation of the experiments and methods used for the case of binary covariates is given in Section A.11. We conclude that the main value of optHSIC lies in independence testing with continuous covariates.
8 Discussion
The main contribution of this paper is the proposal of optHSIC, combining a novel way of using optimal transport to transform right-censored datasets, with nonparametric permutation-based independence testing using HSIC/DCOV. We have shown optHSIC has power against a wider range of alternatives than the commonly used CPH model, while forfeiting little power when the CPH assumptions are satisfied exactly. Extensive numerical simulations suggest that the approach is well calibrated, yielding reliable -values even when censoring strongly depends on the covariate. Under the assumption that censoring does not depend on the covariate, we have proven correct type 1 error rate of optHSIC. Theoretical guarantees for the type 1 error under dependent censoring are a topic of future work. We furthermore proposed reweighting the original dataset, and measuring the distance between the resulting weighted mean embeddings. While these methods showed some promise, they do rely on the very strong assumption that and are independent. An interesting future challenge is to develop nonparametric tests for covariates and right-censored times for mutual independence testing and conditional independence testing, two methods used in causal inference, and which are relevant to many problems in biostatistics.
9 Supplementary materials and code
Supplementary materials contain a worked out example of the transformation proposed in Algorithm 2, proofs of all results and details on additional experiments. They also contain methods of combining -values from multiple transformation of the same dataset. Code to replicate the experiments and of the tests is available on www.github.com/davidrindt/opthsic.
10 Acknowledgment
We thank the reviewers and editor for their helpful comments.
Appendix A Appendix
A.1 Example of transformation
Let be the following dataset consisting of individuals:
We initialize to be an empty multiset and set and . We loop over the events .
At the first time it holds that . We compute the joint distribution that solves the optimal transport problem between and . Since it holds that , it follows that is the matrix:
[TABLE]
Conditioning on yields , corresponding to the first row of . Sampling from with distribution yields with probability 1. We update . We also replace and . We move to the next event time.
At we note that , so we only remove from and update , while leaving and unchanged.
At the third event it holds that and and . We couple a random variable and using optimal transport. The resulting distribution equals:
[TABLE]
We condition this distribution on . This corresponds to the first row of , and the resulting distribution over equals: . We now sample a point from this distribution and, suppose, it turns out to be , which has chance. We update . We also replace and before moving to the next event.
At it holds that and . We note and . We couple a random variable and using optimal transport. The resulting distribution equals:
[TABLE]
We condition on , resulting in . In this case our sample turns out to be , which happens with probability . Hence . We also replace and .
We have now finished the loop . Since and we add the two datapoints and to . The finalized transformed dataset equals
[TABLE]
A.2 Proof of Lemma 5.1
Let where is increasing, and assume for convenience there are no ties in . Denote by the number of observed events. We do not view as random in this section. Applying the optimal transport algorithm results in a random, transformed dataset, which we denote by . Note that the times and covariates in are not random, since they are determined by , but the way in which they are paired up in the transformation may be random. The same set of times and covariates is obtained in and for any . Denote the times in by and define a standard pairing ,. We will often use that are all permutations (possibly random) of . Finally, define , so that , which says that the -th observed event is the -th overall event. As a last piece of notation, we will use to denote a uniform random permutation, and to be a specific instance of a permutation. In particular we denote and . This corresponds to the covariates in the permuted dataset until just before the time of the -th observed event.
We prove the theorem by showing that the left- and right-hand sides of Lemma 5.1 are both equal in distribution to
[TABLE]
This is done in separate lemmas.
Lemma A.1**.**
[TABLE]
Proof.
By the above remarks we see that for some random permutation . (Note: The randomness in comes from the transformation , not from the dataset , which is fixed.) It suffices to show that
[TABLE]
This is easy to see by conditioning on . Let be arbitrary permutations. Then
[TABLE]
Since are independent uniform permutations, this is the same as
[TABLE]
∎
We now consider the effect of first permuting and then transforming the data.
Lemma A.2**.**
Let be a uniformly chosen permutation of and let be defined through optHSIC. It holds that
[TABLE]
Proof.
By the comments above, we can define a random permutation by . We wish to show that for all . To do so, we will condition on events of the form
[TABLE]
which determines the covariates in the permuted dataset up to (just before) the time of the -th observed event. We also condition on , fixing the covariates in the transformed dataset, up to the -th observed event. Note that this conditioning fixes the coupling defined in the optimal transport algorithm. Namely, we let and be the coupled random variables resulting from optimal transport between choosing uniformly from the covariates indexed by and choosing uniformly from the covariates indexed by respectively. Then, the transformation samples conditional on . Because is a uniformly chosen permutation, given the events we conditioned on so far, is uniformly chosen from the covariates indexed by . By the definition of the coupling, is thus uniform on the covariates indexed by . That is, is chosen uniformly from . Mathematically, for any so that the conditioning event has nonzero probability, it holds that
[TABLE]
Having shown that, conditioned on what happened in both the permuted dataset, and the synthetic dataset, the new synthetic covariate is chosen uniformly from those not chosen before, we aim to derive a recurrence relation so as to apply this result at each successive time. To this end note that
[TABLE]
where we use the previously established equality in the first equality. This allows us to compute
[TABLE]
Since the indices are added in uniform random order by definition of the transformation algorithm, this concludes the lemma. ∎
Lemma A.3**.**
[TABLE]
Proof.
The left hand side can be written as . The lemma above shows that the for have the correct distributions. We only need to show they and are a sequence of mutually independent permutations. But is determined completely by and , and is determined by . The proof follows since all these variables are mutually independent.∎
Lemma A.1 and A.3 together prove the theorem.
A.3 Proof of Theorem 5.1
The proof of Theorem 6.2 shows that, if C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, then
[TABLE]
is an exchangeable vector. In particular, if are independent identically distributed transformations of the data, then also
[TABLE]
is exchangeable. We let be the transformation of the data using the transformation of the data. By Lemma 5.1 the above vector is equal in distribution to
[TABLE]
implying that the latter is also exchangeable. For an arbitrary statistic ,
[TABLE]
is thus exchangeable too. In particular, the rank of the first entry is uniformly distributed on , which proves the theorem.
A.4 Proof of Theorem 6.1
Proof.
This proof is based on the proof of Lemma 3 of Berrett and Samworth (2019). Since H_{0}:X\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y implies that it is easy to see that , for any permutation . Writing , and for i.i.d. uniform permutations, we aim to show that, for any permutation of
[TABLE]
that is, that the random vector on the left is exchangeable. We observed above that the first entries are equal in distribution. It remains to show that the other entries of the right-hand side are uniform and independently chosen permutations of the first entry. Indeed, writing , we can rewrite the right-hand side as:
[TABLE]
So it remains to show that \bigl{(}\Pi^{\sigma_{j}}(\Pi^{\sigma_{0}})^{-1},1\leq j\leq B\bigr{)} are independent uniformly chosen permutations of . If , then and and the result is obvious. Now assume that for .
[TABLE]
It follows that the vector
[TABLE]
is indeed exchangeable. Letting denote any arbitrary function on data, it follows that:
[TABLE]
is also exchangeable. If we break ties at random, this implies that every ordering of the elements is equally likely. In particular, the rank of an individual element is uniformly distributed on , and the result follows.
∎
A.5 Proof of Theorem 6.2
Proof.
When we assume that C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X then, under the null hypothesis H_{0}:T\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, it follows that the pair (T,C)\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. As is –measurable, also (Z,D)\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. If we write , then Theorem 6.1 applies.
∎
A.6 Proof of Lemma 6.1
The following computation shows that for all functions . We denote the distribution of on by . As we are assuming independence of and given we can decompose .
[TABLE]
where the penultimate equality follows because .
A.7 Proof of Lemma 6.2
Estimating the survival of the censoring distribution amounts to replacing by in the Kaplan Meier Survival curve. This yields:
[TABLE]
Thus the probability of being uncensored by time equals:
[TABLE]
Note now that
[TABLE]
for points that are uncensored. That is, Kaplan–Meier weights equal a re-scaled inverse of the probability of being uncensored by that time.
A.8 Proof of Theorem 6.3
Proof.
The squared norm, written as the inner product with itself, can be expanded into three terms that we compute in turn. We denote by the entrywise product of the matrices and . Using the Hadamard product property where , , we have the following identities:
[TABLE]
[TABLE]
[TABLE]
As the entrywise product is symmetric in its arguments, we see that also
[TABLE]
Thus the weighted HSIC is
[TABLE]
with . In the standard HSIC case and, , so that is the standard (scaled) centering matrix. ∎
A.9 Using multiple transformations
We list 4 ways of combining -values.
- Method 1:
Use a Bonferroni correction and reject if for the smallest -value, denoted by , it holds that .
- Method 2:
Make the following (random) rejection decision: reject with probability , and accept otherwise.
- Method 3:
Fix and reject if . For example, reject if .
- Method 4:
Reject if . This has the advantage that it results in a quantity that can be used as a -value: .
Throughout this section, assume that the null hypothesis holds. Let be the -value resulting from sampling a dataset once, followed by running optHSIC once (so exactly one transformation and one permutation test on the transformed data). We aim to show that the methods 1 and 2 above have correct type 1 error under the assumption that for which we proved for C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and expect to be (approximately) true for C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. We aim to show that methods 3 and 4 have asymptotically (as the number of -values goes to infinity) correct type 1 error rate under the assumption that , which we proved for C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X and expect to be (approximately) true also for C\not\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X. See Table 1. While method 2 is less conservative, it is a random rejection decision which is less desirable. We can imagine Method 3 being not too conservative when .
A.9.1 Method 1
Assume it holds that (see comments at the start of the section). Let be the -values obtained from applying optHSIC times to , in ascending order. The Bonferroni correction procedure rejects if . This has the correct type 1 error probability because by the union bound under the null hypothesis
[TABLE]
A.9.2 Method 2
Assume it holds that (see comments at the start of the section). Given the -values , the second method makes a random rejection decision in the following way: Reject with probability , and accept otherwise. This has correct type 1 error because
[TABLE]
A.9.3 Method 3
Fix and reject if . An example would be to set in which case we reject if . Assume .
This is an approximate method. The ‘ideal’ and practically impossible method is to reject if . We show that this ‘ideal’ method has the correct type 1 error:
[TABLE]
where is the event that
[TABLE]
Assume by contradiction that . Then it must hold that
[TABLE]
which contradicts that . Hence it must hold that . Because in practice is unkown, we can estimate it by and reject if . Since as it is easy to see the approximate method is asymptotically correct.
A.9.4 Method 4
Method 4 is to reject if . This is an approximation of the ‘ideal’ and practically impossible method of rejecting if is such that . We assume it holds that : if we prove it under the assumption the result also follows under the assumption since the latter distribution corresponds to a more conservative test. We now show that this ‘ideal’ method has the correct type 1 error rate. Note
[TABLE]
where is the event that
[TABLE]
Define the following family of distributions:
[TABLE]
where
[TABLE]
We verify that the family and the set satisfy three conditions:
Condition 1: For all it holds that
[TABLE]
Condition 2: For all it holds that
[TABLE]
by definition of .
Condition 3: For all
[TABLE]
We now define the to be an ‘average’ of the distributions in :
[TABLE]
It is easy to see satisfies condition 1:
[TABLE]
To see satisfies condition 2 note that
[TABLE]
To see satisfies condition 3 note that
[TABLE]
Here denotes expectation with respect to and with respect to . Note condition 1 says that is a probability measure, condition says its expectation is less than and condition 3 that is dominated by the measure defined by the uniform density .
Thus, if , then satisfies the three conditions above, with in the third condition. We now show that there is a maximum value so that if , then it is impossible for any distribution to satisfy the three conditions above.
We first show Assume that . If we let be the uniform probability measure on , then it is clear that the first two conditions are met: it is a valid probability distribution (condition 1), the expectation is exactly (condition 2). The third condition is met since for it holds that
[TABLE]
because by assumption .
We now need to show that if there does not exist a distribution that satisfies the three conditions. To that end, note first that if satisfies the three conditions for , then it also satisfies the conditions for any such that (note only appears in the third condition). So in particular such would have to satisfy the conditions also with . However, to change the distribution defined above, one cannot place more mass in the region by condition 3, which says needs to be dominated by the measure defined by the uniform density . On the other hand, if one removes mass from then one automatically increases the mean of the distribution, which violates condition 2, since the mean of . We conclude that . The type 1 error of the ‘ideal’ method is thus at most .
Since as it is easy to see the approximate method has asymptotically correct type 1 error rate. This method has the advantage that it results in a combined -value: , whereas the other methods only lead to rejection decisions. The -value will be conservative if there is little randomness in the dataset. In the case there is no censoring, the -value is a factor 2 bigger than necessary. However, the Bonferroni correction would result in a -value that is a factor bigger than necessary (where , the number of transformations used, which may be much larger than ).
A.10 Tables
A.10.1 Type 1 error rates
A.10.2 Rejection rate under varying censoring regimes
A.11 Binary covariates
As a special case of independence testing we consider the case of a single binary covariate, i.e., . If one groups the data by covariate, then testing independence of and is equivalent to testing equality of lifetime distribution between the two groups. This is known as two-sample testing on right censored data. Popular approaches to this challenge are the logrank test and various weighted logrank tests. optHSIC can be applied to this problem without any adjustments, while wHSIC can be improved in this case in two ways: first, the weights can be estimated even when the censoring distribution differs between the two groups; and second, there exists an alternative permutation strategy that, experiments show, seems to control the type 1 error effectively even under dependent censoring. These adjustments are described in Section A.11.1 and Section A.11.2 respectively. We omit consideration of zHSIC, as it is fundamentally more limited, given the larger number of available methods.
A.11.1 wHSIC for two-sample testing
Let and denote the distribution of and respectively. Let the total sample be as before, and write and for the event times and indicators of individuals with covariate and respectively. We want to asses if . We again use the covariance kernel of Brownian motion. If all of the times were observed (), we could measure the difference in empirical distributions between both groups by the MMD between the two distributions:
[TABLE]
Similar to Section 6.1, when some observations are censored, we might reweight the empirical distributions, and instead compare the weighted empirical distributions
[TABLE]
We propose that the weights are computed by the Kaplan–Meier weights within each group. The test statistic thus becomes:
[TABLE]
This statistic was also, independently, proposed by Matabuena (2019), and can be seen as a special case of wHSIC in the case of binary covariates. Under the hypothesis that C\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X, one can obtain -values using a permutation test, resulting in the following algorithm. Section A.11.2 provides an alternative permuation strategy under dependent censoring, that was proposed by Wang et al. (2010). It was proposed in the context of the logrank test, but can equally be used for other statistics.
A.11.2 ipxHSIC
This subsection overviews a test we name ipxHSIC, which uses the same statistic defined in Section A.11.1 above, but a different permutation strategy that is robust against differences in the censoring distributions of both groups. The permutation strategy was proposed in Wang et al. (2010) to provide reliable -values for the logrank statistic in the case of small or unequal sample sizes. In fact Wang et al. (2010) propose two permutation strategies: the first one, which they call ‘ipz’ (section 2.1.1), permutes group membership and the second, which they call ‘ipt’(section 2.1.2), permutes survival times. These permutation strategies were proposed in the context of logrank tests - but can equally be applied to other statistics, such as wHSIC. The first strategy, which permutes the covariates, is referred to in their work as ‘ipz’ since the procedure first imputes several unobserved times, and then permutes the covariate, which in their work is denoted by . We refer to it as ‘ipx’, as our covariate is denoted by . The algorithm uses the Kaplan–Meier estimator to estimate three distributions: 1) , the censoring distribution in group 0, based on the data observed in group 0; 2) , the censoring distribution in group 1, based on the data observed in group 1; 3) the distribution of the lifetimes based on the pooled dataset containing both groups. With these estimates, a new dataset is constructed, consisting of observations, each consisting of a covariate, an event time, and two censoring times, one for each censoring distribution. This larger dataset is then permuted, and transformed back to a censored dataset. Wang et al. (2010) describe the algorithm in full detail. This method thus combines the wHSIC statistic with an alternative permutation strategy. Because this method relies on explicitly estimating censoring distributions in each group, it is difficult to extend this to the continuous case, where for each covariate we only have one individual in the study with that exact covariate.
A.11.3 Numerical comparison of methods in the two-sample case
We generate data from four different distributions for each of , , and to compare the power and type 1 error of the proposed methods optHSIC, wHSIC, ipxHSIC to the power and type 1 error of the classic logrank test and a weighted logrank test proposed by Ditzhaus and Friedrich (2020). The classical logrank test is known to have low power against certain alternatives, such as crossing survival curves. A weighted logrank test assigns weights to data, giving the logrank test power against different alternatives. In Ditzhaus and Friedrich (2020) a combination of weights is proposed, so as to achieve power against a wider class of alternatives. In particular Ditzhaus and Friedrich (2020) propose a combination of two sets of weights, corresponding to proportional and crossing hazards. As this section mostly serves to provide an example of our methods, we simulate fewer scenarios than in Section 7. In each scenario we let the values range from to in intervals of . To obtain -values in the three HSIC based methods as well as the weighted logrank test we use a permutation test with 1999 permutations. We reject the null hypothesis if our obtained -value is less than .
A.12 Example of data with binary covariates in which optHSIC does not perform well
Consider the following case. Group contains 1050 individuals. Group contains 50 individuals. Up to time , no events occur. At time , 1000 individuals of group are censored. There are now 50 individuals remaining in each of the groups. The individuals of group have event time and the individuals of group have event time . In this example we find the logrank test to have power of and optHSIC to have power of only .
What happens is the following: At time there are 100 individuals at risk. The individuals of group are likely to have their event first, due to the higher rate in the corresponding exponential distribution. Because in group 1000 individuals have been censored, the optimal transport map has a high chance of choosing when . So while in the resulting dataset a slight bias will remain towards individuals in group having their event first, this bias is much less clear than before the transformation. (We thank a reviewer for proposing this scenario.)
There are several characteristics that make the difference in this example so large. Firstly, as mentioned before, optimal transport relies on the ability to choose a ‘similar covariate’. When covariates are binary it may happen that while . Secondly, in this case all the censoring happens in group , causing optimal transport to send mass from group to group . Furthermore, the censoring rate is high ( of all individuals). Lastly, before the censoring occurs there is no evidence of a difference in distribution.
A.12.1 Comments on two-sample simulations
The results show that the logrank test and the weighted logrank test have little power in scenario 2 and 3 and scenario 3 respectively, even though large differences between the samples are present. The logrank is designed to detect differences as in scenario 1, and the weighted logrank is designed to detect differences as in scenario 1 and 2, sacrificing power slightly compared to the logrank test in the first. Scenario 3 is designed to defeat the weighted logrank test, since we constructed an extreme version of an early crossing survival curve, and the test does not contain weights for early crossing. The kernel methods are fully nonparametric, but do lose power in certain scenarios, most notably in Scenario 2 and the example provided. We believe optHSIC is not ideally suited to the case of binary covariates, since optimal transport relies on choosing a ‘similar’ covariate. Furthermore, while there are no fully nonparametric alternatives for independence testing for continuous covariates, there are more alternative two-sample tests. We thus believe the main value of optHSIC lies in the case of continuous covariates.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1B. Schölkopf (2001) B. Schölkopf, A. S. (2001): Learning With Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , Massachusetts: MIT press, 1 edition.
- 2Berrett and Samworth (2019) Berrett, T. B. and R. J. Samworth (2019): “Nonparametric independence testing via mutual information,” Biometrika , 106, 547–566.
- 3Cohen (1999) Cohen, S. (1999): Finding color and shape patterns in images , 1620, Stanford University, Department of Computer Science.
- 4Cox (1972) Cox, D. R. (1972): “Regression models and life‐tables,” Journal of the Royal Statistical Society: Series B (Methodological) , 34.2, 187–202.
- 5Ditzhaus and Friedrich (2020) Ditzhaus, M. and S. Friedrich (2020): “More powerful logrank permutation tests for two-sample survival data,” Journal of Statistical Computation and Simulation , 90, 2209–2227, URL https://doi.org/10.1080/00949655.2020.1773463 .
- 6Gretton et al. (2012) Gretton, A., K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola (2012): “A kernel two–sample test,” Journal of Machine Learning Research , 12, 723–773.
- 7Gretton et al. (2008) Gretton, A., K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola (2008): “A kernel statistical test of independence,” Advances in Neural Information Processing Systems , 585–592.
- 8Lip et al. (2019) Lip, S., L. E. Tan, P. Jeemon, L. Mc Callum, A. F. Dominiczak, and S. Padmanabhan (2019): “Diastolic blood pressure j-curve phenomenon in a tertiary-care hypertension clinic,” Hypertension , 74, 767–775.
