The Robust Kernel Association Test

Kara Martinez; Arnab Maity; Robert Yolken; Patrick Sullivan; Jung-Ying; Tzeng

arXiv:1901.09419·stat.ME·January 29, 2019

The Robust Kernel Association Test

Kara Martinez, Arnab Maity, Robert Yolken, Patrick Sullivan, Jung-Ying, Tzeng

PDF

Open Access

TL;DR

The paper introduces RobKAT, a flexible and robust kernel association test that handles various data distributions without assuming normality, improving detection of genetic associations in complex data.

Contribution

RobKAT is a novel, general kernel association test that extends SKAT and QRKM, offering robustness and flexibility in analyzing diverse data distributions.

Findings

01

RobKAT controls type I error across distributions.

02

RobKAT has comparable or greater power than SKAT.

03

RobKAT detects additional associations in clinical data.

Abstract

Testing the association between SNP effects and a response is a common task. Such tests are often carried out through kernel machine methods based on least squares, such as the Sequence Kernel Association Test (SKAT). However, these least squares procedures assume a normally distributed response, which is often violated. Other robust procedures such as the Quantile Regression Kernel Machine (QRKM) restrict choice of loss function and only allow inference on conditional quantiles. We propose a general and robust kernel association test with flexible choice of loss function, no distributional assumptions, and has SKAT and QRKM as special cases. We evaluate our proposed robust association test (RobKAT) across various data distributions through simulation study. When errors are normally distributed, RobKAT controls type I error and shows comparable power to SKAT. In all other distributional…

Figures21

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Simulation study for type I error of RobKAT (Huber loss), QRKM (LAD loss), and SKAT. The simulation includes IBS kernel, sample size n = 100 𝑛 100 n=100 , and linear h ( \bmath Z i ) = \bmath Z i T \bmath 1 5 ℎ \bmath subscript 𝑍 𝑖 \bmath superscript subscript 𝑍 𝑖 𝑇 \bmath subscript 1 5 h(\bmath{Z}_{i})=\bmath{Z}_{i}^{T}\bmath{1}_{5} . Estimates show proportion of 10 , 000 10 000 10,000 iterations with p-value below α 𝛼 \alpha cutoff.

Error	$α$	Loss Function ( $ρ$ )
		Huber	LAD	SKAT
$t_{3}$	$0.01$	0.0100	0.0083	0.0081
	$0.05$	0.0456	0.0458	0.0466
	$0.10$	0.0955	0.0943	0.0968
$χ_{1}^{2}$	$0.01$	0.0105	0.0113	0.0080
	$0.05$	0.0532	0.0507	0.0475
	$0.10$	0.0997	0.0999	0.0984
$N (0, 1)$	$0.01$	0.0104	0.0089	0.0105
	$0.05$	0.0473	0.0492	0.0455
	$0.10$	0.0914	0.0977	0.0947
$C a u c h y$	$0.01$	0.0120	0.0109	0.0032
	$0.05$	0.0488	0.0509	0.0319
	$0.10$	0.0973	0.1018	0.0841
$10 %$ Mix	$0.01$	0.0115	0.0094	0.0088
	$0.05$	0.0484	0.0476	0.0494
	$0.10$	0.0962	0.0944	0.1006
$30 %$ Mix	$0.01$	0.0108	0.0103	0.0096
	$0.05$	0.0484	0.0487	0.0489
	$0.10$	0.0996	0.0949	0.0986

Equations16

Y_{i} = \bmath X_{i}^{T} \bmath β + h (\bmath Z_{i}) + ϵ_{i},

Y_{i} = \bmath X_{i}^{T} \bmath β + h (\bmath Z_{i}) + ϵ_{i},

\min_{\bmath{\beta},h\in\mathcal{H}_{K}}\ \bigg{[}\sum_{i=1}^{N}\rho\{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-h(\bmath{Z}_{i})\}+\lambda\ ||h||^{2}_{\mathcal{H}_{K}}\bigg{]}

\min_{\bmath{\beta},h\in\mathcal{H}_{K}}\ \bigg{[}\sum_{i=1}^{N}\rho\{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-h(\bmath{Z}_{i})\}+\lambda\ ||h||^{2}_{\mathcal{H}_{K}}\bigg{]}

L(\bmath{\beta},\bmath{\gamma})=\sum_{i=1}^{n}\rho\bigg{(}\frac{e_{i}}{s}\bigg{)}=\sum_{i=1}^{n}\rho\bigg{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-\bmath{Z}_{i}^{T}\bmath{\gamma}}{s}\bigg{)}

L(\bmath{\beta},\bmath{\gamma})=\sum_{i=1}^{n}\rho\bigg{(}\frac{e_{i}}{s}\bigg{)}=\sum_{i=1}^{n}\rho\bigg{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-\bmath{Z}_{i}^{T}\bmath{\gamma}}{s}\bigg{)}

S(\bmath{\beta},\bmath{\gamma})=\sum_{i=1}^{n}\bmath{Z}_{i}\ \psi\bigg{(}\frac{e_{i}}{s}\bigg{)}=\sum_{i=1}^{n}\bmath{Z}_{i}\ \psi\bigg{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-\bmath{Z}_{i}^{T}\bmath{\gamma}}{s}\bigg{)}=0

S(\bmath{\beta},\bmath{\gamma})=\sum_{i=1}^{n}\bmath{Z}_{i}\ \psi\bigg{(}\frac{e_{i}}{s}\bigg{)}=\sum_{i=1}^{n}\bmath{Z}_{i}\ \psi\bigg{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\bmath{\beta}-\bmath{Z}_{i}^{T}\bmath{\gamma}}{s}\bigg{)}=0

\frac{1}{n-p}\sum_{i=1}^{n}\psi^{2}\big{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\widehat{\bmath{\beta}}}{s}\big{)}=E_{\Phi}(\psi^{2}(\bmath{\epsilon}))

\frac{1}{n-p}\sum_{i=1}^{n}\psi^{2}\big{(}\frac{Y_{i}-\bmath{X}_{i}^{T}\widehat{\bmath{\beta}}}{s}\big{)}=E_{\Phi}(\psi^{2}(\bmath{\epsilon}))

ρ (x) = {\frac{1}{2} x^{2} k (∣ x ∣ - \frac{k}{2}) ∣ x ∣ \leq k ∣ x ∣ > k

ρ (x) = {\frac{1}{2} x^{2} k (∣ x ∣ - \frac{k}{2}) ∣ x ∣ \leq k ∣ x ∣ > k

Y_{i} = \bmath X_{i}^{T} \bmath β + c * h (\bmath Z_{i}) + ϵ_{i}

Y_{i} = \bmath X_{i}^{T} \bmath β + c * h (\bmath Z_{i}) + ϵ_{i}

ψ (x) = ⎩ ⎨ ⎧ x a sign (x) a sign (x) \frac{r - ∣ x ∣}{r - b} 0 ∣ x ∣ \leq a a < ∣ x ∣ \leq b b < ∣ x ∣ \leq r r < ∣ x ∣

ψ (x) = ⎩ ⎨ ⎧ x a sign (x) a sign (x) \frac{r - ∣ x ∣}{r - b} 0 ∣ x ∣ \leq a a < ∣ x ∣ \leq b b < ∣ x ∣ \leq r r < ∣ x ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenetic Associations and Epidemiology · Gene expression and cancer classification · RNA Research and Splicing

Full text

\pagerange

The Robust Kernel Association Test (RobKAT)–The Robust Kernel Association Test (RobKAT)

\artmonth000

000

The Robust Kernel Association Test (RobKAT)

Kara Martinez1

Arnab Maity1,∗

[email protected]

Robert H. Yolken5

Patrick F. Sullivan6

and Jung-Ying Tzeng1,2,3,4

1Department of Statistics

North Carolina State University

Raleigh

North Carolina

U.S.A.

2Bioinformatics Research Center

North Carolina State University

Raleigh

North Carolina

U.S.A.

3Institute of Epidemiology and Preventive Medicine

National Taiwan University

Taipei

Taiwan

4Department of Statistics

National Cheng-Kung University

Tainan City

Taiwan

5Stanley Neurovirology Laboratory

Johns Hopkins School of Medicine

USA.

6Department of Genetics

University of North Carolina at Chapel Hill

Chapel Hill

North Carolina

U.S.A

(Received MONTH YEAR. Revised MONTH YEAR. Accepted MONTH YEAR.; 0000)

Abstract

Testing the association between SNP effects and a response is a common task. Such tests are often carried out through kernel machine methods based on least squares, such as the Sequence Kernel Association Test (SKAT). However, these least squares procedures assume a normally distributed response, which is often violated. Other robust procedures such as the Quantile Regression Kernel Machine (QRKM) restrict choice of loss function and only allow inference on conditional quantiles. We propose a general and robust kernel association test with flexible choice of loss function, no distributional assumptions, and has SKAT and QRKM as special cases. We evaluate our proposed robust association test (RobKAT) across various data distributions through simulation study. When errors are normally distributed, RobKAT controls type I error and shows comparable power to SKAT. In all other distributional settings investigated, our robust test has similar or greater power than SKAT. Finally, we apply our robust kernel association test on data from the CATIE clinical trial to detect associations between selected genes on chromosome 6, including the Major Histocompatibility Complex (MHC) region, and neurotrophic herpesvirus antibody levels in schizophrenia patients. RobKAT detected significant association with four SNP-sets (HST1H2BJ, MHC, POM12L2, and SLC17A1), three of which were undetected by SKAT.

keywords:

Genetic association test; Kernel machine regression; Multi-marker hypothesis test; Robust regression; Schizophrenia; Semiparametric.

††volume: 00

1 Introduction

Detecting associations between a set of genetic markers, such as SNPs, and a continuous response is a common problem. These tasks lend themselves to multi-marker approaches, such as kernel methods, that analyze variants at the gene level and may gain power due to their lower multiple testing burden. Kernel methods are also appealing over burden-based multi-marker approaches, which treat the overall genetic marker effect as fixed and test for the mean genetic effect with various weighting schemes (Gauderman et al., 2007; Madsen and Browning, 2009; Li and Leal, 2008). Many of these burden-based methods are optimal in the case of additive loci with similar size and direction of effect, but kernel methods typically avoid these loci effect restrictions by viewing the overall genetic effect as random or the output of a nonparametric function.

Least squares kernel machines (LSKM), developed by Liu et al. (2007) and Kwee et al. (2008), are popular kernel methods that utilize a linear mixed model with normally distributed errors. Mixed model methods such as Restricted Maximum Likelihood (REML) may be applied for simultaneous estimation of parameters, and LSKM estimators can be shown to be the best linear unbiased predictors. Score tests can be used for testing the overall genetic marker effect, and due to distributional assumptions this score statistic follows a Chi-squared mixture under the null hypothesis (Liu et al., 2007). A type of LSKM that has been developed is the sequence kernel association test (SKAT), which incorporates weights into the kernel according to genetic marker minor allele frequencies (Wu et al., 2011). However, the assumption of normally distributed errors limits inference to the conditional mean response. In the case of a skewed or heavy-tailed response, the null distribution of the test statistic does not follow a Chi-squared mixture, leading to moderate to severe loss of power.

A variation on LSKMs, the quantile regression kernel machines (QRKM) utilizes quantile regression methods rather than least squares for parameter estimation. When testing the genetic marker effect, Kong et al. (2016) proposes a QRKM score-type statistic based on the subgradient of the check function rather than the traditional score test. Although QRKM is more robust to extreme observations, the method is restricted only to inference on the conditional quantile. While these quantiles offer more choice than in LSKMs, additional flexibility may be desired.

Although appropriate multi-marker association testing procedures are limited for non-normal responses, one such data set is the the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) study. The CATIE trial aimed to compare the effectiveness of atypical antipsychotics in unrelated patients suffering from schizophrenia and measured patient antibody levels on three neurotrophic herpesviruses. As discussed with additional detail in Section 4, these antibody levels are notably right-skewed. Even after a log-transformation, the data display a bimodal distribution. Violating the assumption of normality, tests such as LSKM and SKAT are not appropriate. Although QRKM is valid with such distributions and robust to outliers, QRKM restricts inference to the conditional quantile and limits the loss function used in the analysis. Thus, we propose a robust and general method to test for multi-marker association with this non-normally distributed response.

Our robust kernel association test accommodates any valid loss function and is robust to outliers, heavy tails, and skewed distributions. We achieve this robust framework by generalizing previous kernel methods and eliminating all distributional assumptions on the response. While LSKM and SKAT specify a square loss and QRKM specifies a ’check’ loss, we follow similar score function derivations with a general loss function subject to a few minor restrictions. The user-specified, general loss function may be taken from the robust statistics literature, rendering RobKAT methodology robust to non-normal responses. The RobKAT test statistic also is a similar quadratic form as previous methods and, as described in Section 2, simplifies to LSKM, SKAT, and QRKM in special cases. However, RobKAT’s general loss function also provides alternatives to LSKM’s conditional mean inference and QRKM’s conditional quantile inference. The null distribution of our test statistic and associated p-values can be calculated through a permutation test, as originally proposed for QRKM. However, since the permutation testing procedure is computationally inefficient, we instead apply a fast permutation testing method described by Zhan and Wu (2017) that estimates the null distribution of the test statistic through moment matching.

The rest of the article is organized as follows. Section 2 details model notation, derivation of our RobKAT statistic, and p-value computation. Section 3 then demonstrates the validity, Type I error control, and power of our test through simulation study comparing to normal-based methods. We show that our proposed testing procedure is robust to non-normal distributions, even providing a powerful and Type I error-controlled test when errors follow a Cauchy distribution. Although comparable to SKAT when the response is normal, RobKAT shows greater power in all non-normal cases that we explored. In Section 4, we apply our proposed testing procedure to detect the association between genes in chromosome 6, including the MHC region, and antibody levels using CATIE clinical trial data. We find four significantly associated gene sets using the skewed antibody response, three of which are not detected by SKAT. In Section 5, we conclude that RobKAT’s computational efficiency and robust nature make it a useful procedure even on the genome-wide scale.

2 Methodology

2.1 Model Specification and Testing Problem

Suppose we observe data $(Y_{i},\bmath{X}_{i},\bmath{Z_{i}}),i=1,\ldots,n$ , where $Y_{i}$ is a scalar response of interest, $\bmath{Z}_{i}$ denotes a $p\times 1$ vector of genetic covariates, and $\bmath{X}_{i}$ denotes a $q\times 1$ vector of other confounders observed for the $i$ th subject. We consider the model

[TABLE]

where $\bmath{\beta}$ is an unknown coefficient vector, $h(\cdot)$ is an unknown function quantifying the effect of the genetic covariates, and $\epsilon_{i},i=1,\ldots,n$ are independent mean zero random errors. Testing for the overall genetic marker effect corresponds to a hypothesis test for $H_{0}:h(\bmath{Z}_{i})=0$ .

One issue in kernel machine regression is that the potentially nonparametric function $h(\cdot)$ may be difficult to model. We instead assume that $h(\cdot)$ is an element of a reproducing kernel Hilbert space (RKHS), $\mathcal{H}_{K}$ , which can be uniquely defined by a positive semi-definite kernel function $k(\cdot,\cdot)$ (Shawe-Taylor and Cristianini, 2004; Wahba, 1990). Due to certain mathematical properties of RKHS, this structure allows us to represent $h(\bmath{Z}_{i})$ with images of the kernel function. Further, we may summarize this information in the positive semi-definite matrix $\mathbb{K}$ , whose $(i,j)$ element is defined by $k(\bmath{Z}_{i},\bmath{Z}_{j})$ . To obtain estimates of $h(\cdot)$ , researchers often minimize the penalized loss criterion

[TABLE]

We note that the nonparametric functional form of $h(\cdot)$ may be infinite-dimensional. Regarding the form of $\rho(\cdot)$ , the least squares kernel machine methods specify $\rho(t)=t^{2}/2$ in order to implement squared-loss.

A key observation is that estimates of $\widehat{\bmath{\beta}}$ and $\widehat{\bmath{h}}$ from Equation 1 match best linear unbiased estimators estimates obtained from the mixed effects model, $Y_{i}=\bmath{X}_{i}^{T}\bmath{\beta}+h_{i}+\epsilon_{i}$ where $\bmath{h}\sim N(0,\tau\mathbb{K})$ and $\epsilon_{i}\sim N(0,\sigma^{2})$ (Liu et al., 2007). Further, estimates for the remaining parameters, $\tau$ and $\sigma^{2}$ , can be computed using restricted maximum likelihood estimates. With the mixed effects model in mind, LSKM performs a hypothesis test for $H_{0}:\tau=0$ using a variance component score test. Leveraging the model’s normality assumptions, inference on the test statistic comes from a Chi-Squared mixture and a Satterthwaite approximation (Liu et al., 2007).

QRKM also uses the penalized minimization criterion but instead specifies a check function for the loss: $\rho(t)=\eta t\ \mathbb{I}_{\{t\geq 0\}}+(\eta-1)t\ \mathbb{I}_{\{t<0\}}$ where $\eta\in(0,1)$ is the conditional quantile of interest (Kong et al., 2016). When researchers are interested in association with the conditional median response, for example, $\eta=0.5$ and $\rho(t)=0.5|t|$ . Estimates for $\bmath{\beta}$ are the values that minimize $\sum_{i=1}^{N}\rho(Y_{i}-X_{i}^{T}\bmath{\beta})$ , Equation 1 under the null hypothesis, and can be computed using quantile regression packages such as quantreg in R. As with LSKM, the QRKM test statistic follows the variance component score test procedure. Since the check loss function is not differentiable, however, the derivation uses the subgradient, $\psi(t)=\eta\mathbb{I}_{\{t>0\}}+(\eta-1)\mathbb{I}_{\{t<0\}}+(\eta-B)\mathbb{I}_{\{t=0\}}$ , where $B$ is the outcome of a $Bern(\eta)$ random variable. Without using an assumption of normality, the null distribution of the QRKM test statistic and resulting p-value are computed using a permutation algorithm (Kong et al., 2016).

In RobKAT, we derive the variance component score statistic but with a general $\rho$ , subject only to a small number of conditions. In this way, LSKM and QRKM will be special cases of our robust and generalized method. Let $\rho$ be a differentiable function $\rho:\mathbb{R}\rightarrow\mathbb{R}^{+}$ such that

$\rho(0)=0$ 2. 2.

$\rho(x)$ is a convex function of $x$ 3. 3.

$\psi(x)=\frac{d}{dx}\rho(x)$ exists and is non-decreasing

The first assumption is taken from standard robust statistics procedures described in Maronna et al. (2006), and the last two are modifications to exclude redescending $\psi$ -functions. Violations of assumptions 2 and 3 can be accommodated, however performance of the resulting test may be suboptimal. The exclusion of these redescending $\psi$ -functions and such performance impairment is discussed further in section 5. Note that we do not make any assumptions on the distribution of $e_{i}$ .

2.2 Test Statistic

To motivate our method, we first consider the simpler case in which $h(\bmath{Z}_{i})=\bmath{Z}_{i}^{T}\bmath{\gamma}$ . Testing $H_{0}:h(\bmath{Z}_{i})=0$ in this case is equivalent to testing $H_{0}:\bmath{\gamma}=0$ . A score-type test follows from minimizing the criterion

[TABLE]

where $s$ is the scale parameter for the distribution of $e_{i}$ . Differentiating and setting equal to zero, the corresponding score function then yields the M-estimation equations

[TABLE]

Let $\widehat{\bmath{\beta}}$ be an estimate of $\bmath{\beta}$ under the null hypothesis $H_{0}:h(\bmath{Z}_{i})=\bmath{Z}_{i}^{T}\bmath{\gamma}=0$ , and let $\widehat{e_{i}}=Y_{i}+\bmath{X}_{i}^{T}\widehat{\bmath{\beta}}$ . Under $H_{0}$ , fitting $\widehat{\bmath{\beta}}$ and accordingly $S(\widehat{\bmath{\beta}},\bmath{\gamma})|_{\bmath{\gamma}=0}$ reduces to the simple task of fitting the linear regression model $Y_{i}=\bmath{X}_{i}^{T}\bmath{\beta}+e_{i}$ , which does not require knowledge of the form of $h(\bmath{Z}_{i})$ and can be done easily using standard robust regression methods. When obtaining an M-estimate of $s$ , minimizing $\sum_{i=1}^{n}\rho(\frac{e_{i}}{s})$ can ensure that regression estimates are invariant. However, we desire an estimate of $s$ that is also robust. One solution is to use Huber’s Proposal 2 robust estimate of scale, which solves

[TABLE]

where $\Phi$ denotes an expectation under the assumption of a standard normal error distribution (Huber, 1977; Schrader and Hettmansperger, 1980). Huber’s Proposal 2 estimate is preferred over other robust estimates of scale as it has desirable convergence properties under our assumption of convex $\rho$ (Schrader and Hettmansperger, 1980).

We then derive the test statistic $T=S(\widehat{\bmath{\beta}},\bmath{\gamma})^{T}S(\widehat{\bmath{\beta}},\bmath{\gamma})=\bmath{w}^{T}\mathbb{K}\bmath{w}$ The last expression uses the simplifications $\bmath{w}$ and $\mathbb{K}$ where the $i^{th}$ element of $\bmath{w}$ is defined by $w_{i}=\psi(\frac{\widehat{e_{i}}}{\widehat{s}})$ and the $(i,j)^{th}$ element of $\mathbb{K}$ is defined by $k(\bmath{Z}_{i},\bmath{Z}_{j})=\bmath{Z}_{i}^{T}\bmath{Z}_{j}$ .

This $\mathbb{K}$ can be generalized to any kernel function $k(.,.)$ . The effect of $k(.,.)$ is linked with the form of $h(\bmath{Z}_{i})$ . In the previous derivation, setting $h(\bmath{Z}_{i})=\bmath{Z}_{i}\bmath{\gamma}$ , a model with only main genetic-marker effects, corresponds to the linear kernel $\mathbb{K}$ with entries defined by the function $k(\bmath{Z}_{i},\bmath{Z}_{j})=\bmath{Z}_{i}^{T}\bmath{Z}_{j}$ . If instead for $\bmath{Z}_{i}\in\mathbb{R}^{p}$ we chose in our derivation $h(\bmath{Z}_{i})=\sum_{j<k}\bmath{Z}_{ij}\bmath{Z}_{ik}\bmath{\gamma}_{jk}+\sum_{j=1}^{p}\bmath{Z}_{ij}^{2}\bmath{\gamma}_{2j}$ , a model with quadratic main effects and all two-way interactions, then we would obtain a quadratic kernel whose entries are defined by $k(\bmath{Z}_{i},\bmath{Z}_{j})=(\bmath{Z}_{i}^{T}\bmath{Z}_{j})^{2}$ . The kernel function $k(.,.)$ generates a function space that contains $h(\bmath{Z}_{i})$ . This theoretical framework thus represents $h(\bmath{Z}_{i})$ by a linear combination $h(\bmath{Z}_{i})=\sum_{j}\alpha_{j}\ k(\bmath{Z}_{i},\bmath{Z}_{j})$ for some weights $\alpha_{j}$ , which gives a clear connection between choice of $k(.,.)$ and form of $h(\bmath{Z}_{i})$ . Since $\mathbb{K}$ is symmetric positive semi-definite, $k(.,.)$ is often thought of as a similarity metric between two individuals, which may be more intuitive to specify than the form of $h(\bmath{Z}_{i})$ itself. Further, any symmetric positive semi-definite matrix describing similarity may be used for $\mathbb{K}$ . When $\bmath{Z}_{i}$ represents genetic markers, for example, the IBS kernel is defined as $k(\bmath{Z}_{i},\bmath{Z}_{j})=\frac{1}{2p}\sum_{k=1}^{p}[2\mathbb{I}\{\bmath{Z}_{ik}=\bmath{Z}_{jk}\}+\mathbb{I}\{|\bmath{Z}_{ik}-\bmath{Z}_{jk}|=1\}]$ or the proportion of alleles shared between individuals $i$ and $j$ .

By instead defining the form of $h(\bmath{Z}_{i})$ through $k(.,.)$ , we generalize our test statistic to any valid kernel function. Thus, we produce a test statistic that is intuitively specified through a kernel function, $k(.,.)$ , and loss function, $\rho$ . Note that any $\rho$ function following the general assumptions listed above is easily incorporated into this test statistic. For example, the case in which $\rho(t)=\frac{1}{2}t^{2}$ leads to the usual LSKM test statistic and $\rho(t)=0.5|t|$ leads to the QRKM statistic (Liu et al., 2007; Kong et al., 2016). Another common choice of $\rho$ is from the family of Huber functions defined by

[TABLE]

where the value of $k$ may be chosen for a desired asymptotic variance (Huber, 1977; Maronna et al., 2006). For $-k<x<k$ , the function is quadratic as in least squares, and for $|x|>k$ the function is linear as in quantile regression. Thus, this family may be seen as an intermediate between least squares and quantile regression.

To calculate our general kernel machine test statistic, we first fit the model under the null hypothesis, $H_{0}:h(\bmath{Z}_{i})=0$ , by regressing $\bmath{Y}$ on $\bmath{X}$ and minimizing $\sum_{i=1}^{n}\rho(e_{i})$ with the iterated re-weighted least squares algorithm. Then we use the null model to compute $\widehat{s}$ , Huber’s Proposal 2 robust estimate of scale. We then compute the score-type statistic described above, $T=\bmath{w}^{t}\mathbb{K}\bmath{w}$ .

2.3 Null Distribution of the Testing Statistic

It is possible to obtain the null distribution for a statistic of the form $\bmath{w}^{T}\mathbb{K}\bmath{w}$ using permutations (Kong et al., 2016). However, there is a substantial computational burden associated with calculating a null distribution using permutations of data sets with large sample sizes. Slightly modifying our test statistic by centering $\mathbb{K}$ , we can avoid the computational burden in obtaining the null distribution of $T$ (Zhan and Wu, 2017). Following the fast algorithm developed by Zhan and Wu (2017), we instead consider $T=\bmath{w}^{T}\mathbb{P}\mathbb{K}\mathbb{P}\bmath{w}$ , where $\mathbb{P}=\mathbb{I}_{n}-\frac{1}{n}\bmath{1}\bmath{1}^{T}$ is the centering matrix. Because $\mathbb{K}$ is symmetric and positive semi-definite, there exists a matrix $\mathbb{K}^{\frac{1}{2}}$ such that $\mathbb{K}=\mathbb{K}^{\frac{1}{2}}(\mathbb{K}^{\frac{1}{2}})^{T}$ . Thus, $T=\bmath{w}^{T}\mathbb{P}\mathbb{K}\mathbb{P}\bmath{w}=tr(\bmath{w}^{T}\mathbb{P}\mathbb{K}\mathbb{P}\bmath{w})=tr(\mathbb{P}\mathbb{K}\mathbb{P}\bmath{w}\bmath{w}^{T})=tr\{\mathbb{P}\mathbb{K}^{\frac{1}{2}}(\mathbb{K}^{\frac{1}{2}})^{T}\mathbb{P}\bmath{w}\bmath{w}^{T}\}$ .

This form of the statistic can then be leveraged with previously established distribution approximation methods. The analytical expressions of the first three moments of $T=tr(\mathbb{X}^{T}\mathbb{X}\mathbb{Y}^{T}\mathbb{Y})$ have been previously derived by Kazi-Aoual et al. (1995). We may then utilize the fast techniques developed for RV-type statistics of the form $tr(\mathbb{P}\mathbb{A}\mathbb{A}^{T}\mathbb{P}\mathbb{B}\mathbb{B}^{T})$ , which approximate the null permutation distribution by moment matching to a Pearson type III distribution following Josse et al. (2008). Although it is possible to approximate the empirical null distribution with the normal, log-normal, or Edgeworth distributions, the Pearson type III approximation is both accurate and more efficient than these competitors (Josse et al., 2008). Following the method described by Zhan and Wu (2017), we use the centered test statistic, analytical moments, and approximated Pearson type III density to obtain a p-value for our test statistic. In the case of $\rho(t)=0.5|t|$ , this faster algorithm has been shown to be about $6,000$ times faster than the standard permutation test with comparable power and type I error (Zhan and Wu, 2017).

3 Simulation Study

We perform a simulation study to investigate the properties of our proposed method with the robust Huber $\psi$ -function when the data have various error distributions and the genetic effect takes various forms. We then compare the performance of our proposed method against that of SKAT, which is not robust and assumes a normally distributed error distribution. We also include the special case of our robust method where $\rho(t)=0.5|t|$ , which corresponds to the QRKM method with $\tau=0.5$ . We refer to the QRKM $\rho(t)=0.5|t|$ as least absolute deviation, LAD.

3.1 Simulation Study Design

We generate data from the model

[TABLE]

We generate elements of $\bmath{X}_{i}$ from $N(0,\mathbb{I}_{5})$ and set $\bmath{\beta}=\bmath{1}_{5}$ , and perform simulation studies with both the linear $h(\bmath{Z}_{i})=\bmath{Z}_{i}^{T}\bmath{1}_{5}$ and the nonlinear $h(\bmath{Z}_{i})=1+\sum_{k=1}^{p}\bmath{Z}_{ik}+2\sum_{k=2}^{p}\bmath{Z}_{i1}\bmath{Z}_{ik}$ , where $\bmath{Z}_{i}$ is a randomly sampled 9-SNP genotype of the SLC17A1 gene in the CATIE antibody study. We compare testing results from iid error terms, $\epsilon_{i}$ , generated from $t_{3}$ , $\chi^{2}_{1}$ , $N(0,1)$ , and $Cauchy(0,1)$ . Additionally, we compare the bimodal setting of a mixture of normal distributions, where $\epsilon_{i}=BW_{0}+(1-B)W_{10}-10(1-\theta)$ where $W_{X}\sim N(X,1)$ , $B\sim Bern(\theta)$ , and $\theta\in\{0.9,0.7\}$ . For all testing procedures, we use an IBS kernel and two sample sizes ( $n=100,300$ ). We use the Huber function defined by $k=1.345$ for $95\%$ efficiency of least squares when the errors are normally distributed.

To examine Type I error, we consider significance levels $\alpha=0.01$ , $\alpha=0.05$ , and $\alpha=0.1$ . We set $c=0$ for the null model under $H_{0}:h(\bmath{Z}_{i})=0$ and use $N=10,000$ Monte Carlo replications. As $c=0$ , the type I errors are identical for both linear and nonlinear $h(\bmath{Z}_{i})$ functions. To investigate the power of our testing procedure compared to SKAT, we test increasing values of $c$ and use $N=1,000$ Monte Carlo replications for each value. Figures 1-2 show the resulting power curves for tests with nominal level $0.05$ .

3.2 Simulation Results

As seen in Table 1, our testing method produces reasonable type I errors for each of the error distributions. Although SKAT produces similar type I errors for $t_{3}$ , $\chi^{2}_{1}$ , and $N(0,1)$ , the type I errors with Cauchy distributed $\epsilon_{i}$ are notably deflated. Our testing method also produces reasonable type I errors across sample sizes for both the $10\%$ and $30\%$ normal mixtures, which is comparable to the SKAT testing method.

For standard normal and $t_{3}$ distributed errors, all tests displayed similar power. For the $\chi_{1}^{2}$ skewed errors, our robust method has similar power to LAD and a slight power advantage over SKAT. For Cauchy distributed errors, however, SKAT loses a substantial amount of power, while the Huber $\psi$ -function method and LAD $\psi$ -function perform reasonably well. For the $30\%$ normal mixture distribution, the Huber $\psi$ -function performs similarly to SKAT, with the LAD $\psi$ -function has a notably greater power than the other two methods. For the 10% normal mixture distribution, the robust methods are more powerful than SKAT and similar to LAD. The association tests under the 30% bimodal mixture overall is less powerful than the association test under the 10% bimodal mixture.

These trends are similar between the linear and nonlinear form of $h(\bmath{Z}_{i})$ . However, the nonlinear modeling of the genetic markers required smaller effect size for maximal power. The previously described trends for power are also consistent when the sample size increases to $n=300$ (See Web Appendix C).

4 Data Analysis

Several genes in chromosome 6, including the Major Histocompatibility Complex (MHC), region have been found to have an established connection to schizophrenia, and previous GWAS analyses have identified multiple SNPs in this region associated with schizophrenia (Aberg et al., 2013; International Schizophrenia Consortium et al., 2009; Shi et al., 2009). As described in Lieberman et al. (2005), the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) study aimed to compare the effect of various atypical antipsychotics on unrelated patients’ schizophrenia symptoms. DNA samples were donated from 51% of the 1460 patients in the CATIE study, and genotyping was performed for 492,000 SNPs (Sullivan et al., 2008).

Genetic variation in chromosome 6 is also thought to contribute to immune function and response to infectious diseases, such as neurotropic herpesviruses. This class of infectious agent has been linked to cognitive impairments, and in the case of Herpes Simplex Virus Type 1 have shown significant association with aspects of the decreased cognition found in schizophrenia disorders (Yolken et al., 2011). At baseline in the CATIE trial, participants provided a blood sample from which investigators measured Ig class antibodies to three neurotrophic herpesviruses: Herpes Simplex Virus type 1 (HSV1), Herpes Simplex Virus type 2 (HSV2), and Cytomegalovirus (CMV) (Yolken et al., 2011).

We aim to apply our robust kernel machine methodology to determine genes near the MHC region with association to levels of three herpesvirus antibodies: HSV1, HSV2, and CMV. We investigate 12 genes on chromosome 6p22.1 including MHC region as was done previously in Maity et al. (2012) and Davenport et al. (2018). For each gene, we perform our proposed kernel association test with the Huber $\psi$ -function and three different kernels (linear, quadratic, IBS), controlling for age and sex. We use a Bonferroni multiple testing correction to an $\alpha=0.05$ significance level ( $\alpha^{*}=0.05/(12*3*3)=0.00046)$ as we are performing tests for 12 genes, 3 kernels, and 3 methods. For comparison, we also apply SKAT and LAD to analyze each of the 12 candidate genes with the same three kernel functions. The results for our analyses are summarized in Table 2.

For the HSV1 and CMV antibody responses, no significant associations were detected for any gene with any loss function. For the HSV2 antibody response and each kernel, both SKAT and RobKAT detected significant association with the MHC region. However, our robust kernel association test was able to detect significant associations to the POM121L2 and SLC17A1 genes, while SKAT and LAD did not detect any significant associations. LAD detected a significant association with the HIST1H2BJ gene, while the other $\psi$ -functions did not.

It is sensible that both SKAT and the proposed test detected the MHC region, which has the most established association with both schizophrenia and immune function. However, the data for each antibody response are skewed (see Supplementary Materials for data distributions). As our previous simulations suggest improved power in non-normally distributed data sets, perhaps our model detected a signal in the SLC17A1 and POM121L2 genes whereas SKAT failed. The association between POM121L2 and schizophrenia has been previously noted by Aberg et al. (2013). The power described in the simulation study represents a probability, so it is not unreasonable that LAD detected HIST1H2BJ gene when our robust test did not.

5 Discussion

We have developed the methodology of a general and robust kernel association test for the model $Y_{i}=\bmath{X}_{i}^{T}\bmath{\beta}+h(\bmath{Z}_{i})+\epsilon_{i}$ with the score-type test statistic $T=\bmath{w}^{t}\mathbb{K}\bmath{w}$ . Our kernel association test is general, and many previously established kernel association tests such as SKAT and QRKM are special cases when specific loss functions are chosen. RobKAT is flexible in that it places few assumptions on both the choice of kernel and the choice of loss function. The test also does not place any assumptions, including existence of moments, on the response distribution. We then implemented RobKAT with fast permutation testing techniques, which also provides the null distribution of our test statistic without assuming a response distribution. This permutation testing approach is key, as the LSKM Chi-squared mixture null distribution with our statistic decreases power and fails to control for type I error in small and moderate sample sizes (See Web Appendix F for simulation evidence).

We ran a simulation study with an IBS kernel, three different loss functions (Huber, LAD, SKAT), and various response distributions ( $t_{3},\chi^{2}_{1},N(0,1),Cauchy(0,1)$ ) to investigate the performance of our robust kernel association test. In all distributional settings, the type I error was controlled and the robust tests showed similar or greater power than SKAT. This pattern also held in a simulation study bimodal distributed responses. Furthermore, as no single robust loss function was universally superior across response distributions, there is a demonstrated benefit of allowing a variety of loss functions, which is a feature of RobKAT.

We were also able to apply our methodology to the CATIE data set to test for associations between levels of neurotropic herpesviruses and genes located in the MHC region of chromosome 6. Although the antibody level response was non-normal, our robust kernel association test was able to detect significant associations between HSV2 and the MHC region and the HIST1H2BJ, POM121L2, and SLC17A1 genes. Among these genes, SKAT only detected association between HSV2 levels and the MHC region, which is unsurprising given the observed trend from simulation that SKAT is less powerful than our robust test when the response is non-normal.

Simulation studies for linear and quadratic kernels were also performed and produced similar power and type I error trends among the various loss functions. For both the linear and nonlinear forms of $h(\bmath{Z}_{i})$ , the tests utilizing different kernels had similar power with the quadratic kernel producing the greatest power (See Web Appendices A-B for full simulation results).

Simulation studies for two additional $\psi$ -functions, Hampel and Bisquare, were also performed (See Appendix). The Hampel family of functions is defined by

[TABLE]

and setting $a=1.353,b=3.157,r=7.216$ in our simulations for $95\%$ efficiency (Hampel, 1986). Tukey’s Bisquare family of functions is defined by $\psi(x)=x\bigg{(}1-(\frac{x}{k})^{2}\bigg{)}^{2}I_{\{|x|\leq k\}}$ and setting $k=4.685$ for $95\%$ efficiency in our simulations (Maronna et al., 2006). These two $\psi$ -functions are called redescending functions, meaning that there exists a rejection point $c<\infty$ such that $\psi(x)=0$ for all $|x|\geq c$ (Huber, 1981).

Although these redescending functions violate our assumptions on $\psi$ , redescending functions may be attractive in data sets with extreme outliers in the distribution’s tails, as the outliers would likely receive no weight in model fitting. Some researchers such as Huber (1981) warn against the use of these redescending functions, arguing that the risks from sensitivity to incorrect $\psi$ scale parameter specification do not outweigh the slight improvement in asymptotic variance. For these reasons, simulation results that include the Hampel and Bisquare $\psi$ -functions are only included in the Supplementary Web Materials.

For the Cauchy and 30% mixture distributions, the redescending $\psi$ -functions had slightly more power than with the other $\psi$ -functions, likely due to the heavy tails and outliers in these distributions. For all other distributions, the power and type I error control were similar to that of the Huber $\psi$ -function. With increasing effect sizes in our simulation study, we observed an eventual decreasing power trend for the redescending $\psi$ -functions. With large effect size, we suspect that the plato in $\rho$ -function inherent from the redescending $\psi$ fails to detect strong signals.

Our algorithm is also computationally efficient through the utilization of fast kernel machine testing methods developed by Zhan and Wu (2017). Thus, not only is the RobKAT test general and robust, but our robust kernel association test is also scalable to moderate and large amounts of SNPs. For example, the average time to perform one RobKAT test with sample size 300, IBS kernel, and linear $h(\bmath{Z}_{i})$ is 0.17 seconds. Additionally, we have produced a user-friendly R package containing the RobKAT software. Thus, our robust kernel association test methodology is a robust generalization with fewer distributional and inferential restrictions than previously established methods, while remaining computationally efficient and user friendly.

Acknowledgements

Data used in the preparation of this article were obtained from the limited access datasets distributed from the NIH-supported “Clinical Antipsychotic Trials of Intervention Effectiveness in Alzheimer’s Disease” (CATIE-AD). This is a multisite, clinical trial of persons with Alzheimer’s Disease comparing the effectiveness of randomly assigned medication treatment. The study was supported by NIMH Contract #N01MH90001 to the University of North Carolina at Chapel Hill. The ClinicalTrials.gov identifier is NCT00015548. This work was partially supported by National Institutes of Health grant P01CA142538 (to JYT) and National Institutes of Health training grant T32GM081057, Biostatistics Training in the Omics Era (to KM).

Supplementary Materials

Supplementary simulation and data analysis results are available with this paper at the Biometrics website on Wiley Online Library.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Aberg et al. (2013) Aberg, K., Liu, Y., Bukszar, J., Mc Clay, J., Khachane, A., Andreassen, O., Blackwood, D., Corvin, A., Djurovic, S., Gurling, H., Ophoff, R., Pato, C., Pato, M., Riley, B., Webb, T., Kendler, K., O’Donovan, M., Craddock, N., Kirov, G., Owen, M., Rujescu, D., St Clair, D., Werge, T., and et Al (2013). A comprehensive family-based replication study of schizophrenia genes. JAMA Psychiatry 70, 573.
2Davenport et al. (2018) Davenport, C. A., Maity, A., Sullivan, P. F., and Tzeng, J.-Y. (2018). A Powerful Test for SNP Effects on Multivariate Binary Outcomes using Kernel Machine Regression. Statistics in Biosciences 10, 117–138.
3Gauderman et al. (2007) Gauderman, W. J., Murcray, C., Gilliland, F., and Conti, D. V. (2007). Testing association between disease and multiple SN Ps in a candidate gene. Genetic Epidemiology 31, 383–395.
4Hampel (1986) Hampel, F. R. (1986). Robust statistics : the approach based on influence functions . Wiley, New York.
5Huber (1977) Huber, P. J. (1977). Robust statistical procedures . Society for Industrial and Applied Mathematics, Philadelphia, PA, 1 edition.
6Huber (1981) Huber, P. J. (1981). Robust statistics . Wiley, Sommerset, NJ, 1 edition.
7International Schizophrenia Consortium et al. (2009) International Schizophrenia Consortium, I. S., Purcell, S. M., Wray, N. R., Stone, J. L., Visscher, P. M., O’Donovan, M. C., Sullivan, P. F., and Sklar, P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–52.
8Josse et al. (2008) Josse, J., Pagès, J., and Husson, F. (2008). Testing the significance of the RV coefficient. Computational Statistics and Data Analysis .