Achieving GWAS with Homomorphic Encryption
Jun Jie Sim, Fook Mun Chan, Shibin Chen, Benjamin Hong Meng Tan, Khin, Mi Mi Aung

TL;DR
This paper presents a method using homomorphic encryption to perform genome-wide association studies securely, enabling privacy-preserving genetic analysis with practical performance on real datasets.
Contribution
It adapts the semi-parallel GWAS algorithm for homomorphic encryption, introducing matrix operations, approximations, and a cache module to improve efficiency and scalability.
Findings
Successfully performed GWAS with homomorphic encryption on real datasets
Achieved analysis time of under 25 minutes for 10,643 SNPs and 245 samples
Demonstrated feasibility of privacy-preserving genetic association studies
Abstract
One way of investigating how genes affect human traits would be with a genome-wide association study (GWAS). Genetic markers, known as single-nucleotide polymorphism (SNP), are used in GWAS. This raises privacy and security concerns as these genetic markers can be used to identify individuals uniquely. This problem is further exacerbated by a large number of SNPs needed, which produce reliable results at a higher risk of compromising the privacy of participants. We describe a method using homomorphic encryption (HE) to perform GWAS in a secure and private setting. This work is based on a proposed algorithm. Our solution mainly involves homomorphically encrypted matrix operations and suitable approximations that adapts the semi-parallel GWAS algorithm for HE. We leverage the complex space of the CKKS encryption scheme to increase the number of SNPs that can be packed within a…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4| Homomorphic Operation | No. Successive Operations |
|---|---|
| Plaintext Multiplication∗ | |
| Ciphertext Multiplication† | |
| Ciphertext Rotation |
| Process | Time Taken (min) | Memory (GB) |
|---|---|---|
| Preprocessing∗ | ||
| Context Generation | ||
| Encryption | ||
| Computations | ||
| Decryption |
| Process | Time Taken (min) | Memory (GB) |
|---|---|---|
| Preprocessing∗ | ||
| Context Generation | ||
| Encryption | ||
| Computations | ||
| Decryption |
| Error | No. of Different Entries | HEAAN Accuracy (%) |
|---|---|---|
| Process | Time Taken (min) | Memory (GB) |
|---|---|---|
| Preprocessing∗ | ||
| Context Generation | ||
| Encryption | ||
| Computations | ||
| Decryption | ||
| Total |
| Error | No. of Different Entries | SEAL Accuracy (%) |
|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Achieving GWAS with Homomorphic Encryption
Jun Jie Sim
Institute of Infocomm Research, Singapore
Fook Mun Chan
Institute of Infocomm Research, Singapore
Shibin Chen
Institute of Infocomm Research, Singapore
Benjamin Hong Meng Tan
Institute of Infocomm Research, Singapore
Khin Mi Mi Aung
Institute of Infocomm Research, Singapore
Abstract
One way of investigating how genes affect human traits would be with a genome-wide association study (GWAS). Genetic markers, known as single-nucleotide polymorphism (SNP), are used in GWAS. This raises privacy and security concerns as these genetic markers can be used to identify individuals uniquely. This problem is further exacerbated by a large number of SNPs needed, which produce reliable results at a higher risk of compromising the privacy of participants.
We describe a method using homomorphic encryption (HE) to perform GWAS in a secure and private setting. This work is based on a proposed algorithm. Our solution mainly involves homomorphically encrypted matrix operations and suitable approximations that adapts the semi-parallel GWAS algorithm for HE. We leverage upon the complex space of the CKKS encryption scheme to increase the number of SNPs that can be packed within a ciphertext. We have also developed a cache module that manages ciphertexts, reducing the memory footprint.
We have implemented our solution over two HE open source libraries, HEAAN and SEAL. Our best implementation took minutes for a dataset with samples, over covariates and SNPs.
We demonstrate that it is possible to achieve GWAS with homomorphic encryption with suitable approximations.
Background
Genome-wide association study (GWAS) compares genetic variants, single-nucleotide polymorphisms (SNP), to see if these variants are associated with a particular trait. The model used in GWAS is essentially logistic regression, evaluated one SNP at a time, corrected with covariates like age, height and weight. The number of SNPs analyzed can easily grow up to million. It is estimated that it can take around hours for samples and million SNPs [1].
Some suggest that cloud computing could offer a cost-effective and scalable alternative that allows research to be done, given the exponential growth of genomic data and increasing computational complexity of genomic data analysis. However, privacy and security are primary concerns when considering these cloud-based solutions.
It was shown in by Lin et. al. [2] that as little as to SNPs could identify an individual uniquely. Homer et. al. [3] further demonstrated that even when DNA samples are mixed among other samples, individuals could be identified. In light of these discoveries, regulations concerning biological data are being updated [4]. The privacy and security of DNA-related data are now more important than ever.
Homomorphic Encryption (HE) is a form of encryption where functions, , can be evaluated on encrypted data , yielding ciphertexts that decrypt to . Putting it in the context of GWAS, genomic data can be homomorphically encrypted and sent to a computational server. The server then performs the GWAS computations on the encrypted data, before sending the encrypted outcome to the data owner for decryption. We argue that this would ensure the privacy and security of genomic data: Throughout the entire process, there is no instance where the server can access the data in its raw, unencrypted form, preserving the privacy of the data. Additionally, since the data is encrypted, no adversary would be able to make sense of the ciphertexts. The data is thus secured on the computational server.
Motivated by these concerns, the iDASH Privacy & Security Workshop [5] has organized several competitions on secure genomics analysis since . The aim of these competitions is to evaluate methods that provide data confidentiality during analysis in a cloud environment.
In this work, we provide a solution to Track of the iDASH 2018 competition – Secure Parallel Genome-Wide Association Studies using Homomorphic Encryption. The challenge of this task was to implement the semi-parallel GWAS algorithm proposed by Sikorska et. al. [6], which outperforms prior methods by about times, with HE. This task seeks to advance the practical boundaries of HE, a continuation from last year’s HE task which was to implement logistic regression with HE.
We propose a modification of the algorithm by Sikorska et. al. [6] for homomorphically encrypted matrices. We developed a caching system to minimize memory utilization while maximizing the use of available computational resources. Our solution also leverages on the complex space of the CKKS encoding to store the SNP matrix and this halved the computation time needed by doubling the number of SNPs processed each time.
Within the constraints of the competition, including a virtual machine with 16GB of memory and 200GB disk space, a security level of at least 128 bits and at most 24 hours of runtime, our solution reported a total computation time of 717.20 minutes. Our best implementation using a more efficient HE scheme, which was not available during the competition, achieved a runtime of minutes.
In the following section, we will first define some notations used in this paper. We will begin by describing the CKKS homomorphic encryption scheme that was used to implement the GWAS algorithm. We describe our methods for manipulating homomorphic matrices that are crucial to our solution. We start with our implementation of logistic regression with HE. Following that, we adapted the GWAS algorithm using suitable approximations to simplify the computations for HE, while preserving the accuracy of the model. We also detail some optimizations that were used to accelerate the runtime. Finally, we present our results and provide some discussion about our results.
Notation
Notation for HE
Let be a power-of-two integer and . For some integer , denote . We let be the security parameter where attacks on the cryptosystem require approximately bit operations. We use to represent sampling from a distribution over some set . Let denote the uniform distribution and denote the discrete Gaussian distribution with variance .
Notation for GWAS
The number of samples, covariates and SNPs are denoted as and respectively. Matrices are denoted in bold font uppercase letters. Let the covariates matrix be denoted as and the SNP matrix as . The rows of or represent the covariates or SNPs from one sample respectively. We denote the rows as . Vectors are denoted in bold font lowercase letters. Let the response vector be denoted as . The vector of weights from the logistic model is denoted as and the corresponding vector of probabilities is denoted as . The vector of SNP effects is denoted as . The transpose of a vector is denoted as . We let denote rounding up to the nearest power-of-two.
Methods
Homomorphic Encryption
HE was first proposed by Rivest et. al.[7] more than 40 years ago while the first construction was proposed by Gentry [8] only a decade ago. For this work, we adopt the HE scheme proposed by Cheon et. al.[9], referred to as CKKS, which enables computation over encrypted approximate numbers. As GWAS is a statistical function, the CKKS HE scheme is the prime candidate for efficient arithmetic.
Most HE schemes are based on “noisy” encryptions, which applies some “small” noise to mask messages in the encryption process. For HE, a noise budget is determined when the scheme is initialized and computing on ciphertexts depletes this pre-allocated budget. Once the noise budget is expended, decryption would return incorrect results. The CKKS scheme [9] treats encrypted numbers as having some initial precision, with the masking noise just smaller than the precision. However, subsequent operations on ciphertexts increase the size of noises and reduce the precision of the messages encrypted within. Thus, decrypted results are approximations of their true value.
The noise budget for the CKKS scheme is initialized with the parameter . For every multiplication, the noise budget is subtracted by the integer . The noise budget for a given ciphertext is denoted as . When the message is just encrypted, . When , the noise budget is said to be depleted.
We provide a brief description of the CKKS scheme and highly encourage interested readers to refer to [9] for the full details.
- •
KeyGen():
Let be the initial ciphertext modulus. Let denote the distribution that chooses a polynomial uniformly from , under the condition that it has exactly nonzero coefficients. Sample a secret , random and error . Set the secret key as , public key as where . Finally, sample , and set the evaluation key , where .
- •
Encrypt():
For , sample and . Let and output .
- •
Decrypt():
For , output
- •
Add():
For , compute and output .
- •
Mult():
For ciphertexts and , let , . Compute and output .
- •
Rescale():
For a ciphertext and an integer , output , where .
With the CKKS scheme, we are able to encode complex numbers into a single element in its message spaces, . This allows us to view a ciphertext as an encrypted array of fixed point numbers. Let ,
- •
Encode():
Output .
- •
Decode():
Output .
Informally, maps to the vector , where and for . This is then mapped to an element of with the inverse of the canonical embedding map. is straightforward, an element in is mapped to a -dimensional complex vector with the complex canonical embedding map and then the relevant entries of the vector is taken to be the vector of messages.
The ability to encode multiple numbers into one ciphertext allows us to reduce the number of ciphertexts used and compute more efficiently. We refer to each number encoded as a slot of the ciphertext. This offers a SIMD-like structure where the same computation on all numbers within a ciphertext can be done simultaneously. This means that adding or multiplying two ciphertexts together would be equivalent to adding or multiplying each slot simultaneously.
The ciphertext of the CKKS scheme can also be transformed into another ciphertext whose slots are a permutation of the original ciphertext.
- •
Rotate(): Outputs whose slots are rotated to the right by positions.
Homomorphic Matrix Operations
In this section, we describe our method of encoding matrices with HE. The batching property of the CKKS scheme allows us to treat ciphertexts as encrypted arrays. With this, we propose methods of encoding a matrix with ciphertexts.
Column-Packed (CP) Matrices.
This is our primary method of encoding a matrix. We encrypt each column of a matrix in one ciphertext and therefore a matrix will be represented by a vector of ciphertexts. This method of encoding a matrix was suggested by Halevi and Shoup in [10].
We require a function, Replicate that takes a vector of size and returns vectors , , , where for , is in all positions. This is shown in Figure 1.
We describe in Algorithm 1, a naive version of Replicate. The reader is advised to refer to [10] for details on implementing a faster and recursive variant.
We first define matrix-vector multiplication between a CP matrix and a vector in Algorithm 2. First, we invoke Replicate on the vector. Next, we multiply each column in the left-hand side matrix with its corresponding . Finally, sum up all ciphertexts and this will give the matrix-vector product.
Matrix multiplication between CP matrices is defined as an iterative process over CP-MatVecMult between the left-hand side matrix and the columns of the right-hand side matrix. This is described in Algorithm 3
Column-Compact-Packed (CCP) Matrices.
In the case where the entries of a matrix can fit within a single vector, we concatenate its columns and encrypt that in one ciphertext. For this type of matrix, we are mainly concerned with the function colSum which returns a vector whose entries are the sum of each column. We present the pseudocode in Algorithm 4. This is achieved by a series of rotations and additions. However, we do not rotate for all slots of the vector, but rather , where is the number of rows in the CCP matrix. We note here that the final sums are stored in every slots, starting from the first slot.
Row-Packed (RP) Matrices.
For this encoding, we encrypt rows of a matrix into a ciphertext, representing them with a vector of ciphertexts just like CP matrices. In this work, we only consider matrix-vector multiplication between an RP matrix and a vector. Multiplication of an RP matrix by a CP matrix is a lot like naive matrix multiplication.
To compute the multiplication of an RP matrix with a vector, we define the dot product between two vectors encoded in two ciphertexts in Algorithm 5. For that, we first multiply the ciphertexts together, which yields their component-wise products. Then, we apply rotations to obtain the dot product in every slot of the vector.
With DotProd, we apply it over the rows of the RP matrix with the vector, producing several ciphertexts that each contain the dot product between a row and said vector. Though a series of masks and additions, these separate ciphertexts are combined into the matrix-vector product between an RP matrix and a vector as shown in Algorithm 6.
Row-Expanded-Packed (REP) Matrices.
This method of encoding a matrix is similar to RP matrices, except that each entry is repeated times for some integer that is a power of two. As with RP matrices, REP matrices are represented by vectors of ciphertexts. By encoding a matrix in this manner, we reduce the number of homomorphic operations when multiplying with other matrices. For this paper, we only consider matrix products between CP and REP matrices.
First, we define a function, Duplicate in Algorithm 7. Suppose that a ciphertext has filled slots out of , Duplicate fills the remaining slots with repetitions of the slots. This is shown in Figure 2.
This can be realized using simple rotations and additions.
To compute matrix products between CP and REP matrices, we first apply Duplicate the columns of the CP matrix. Then, we multiply each column in the CP matrix with its corresponding row in the REP matrix. Finally, we sum all the ciphertexts and obtain the product of the matrices in a CCP matrix. This is shown in Algorithm 8.
Logistic Regression with Homomorphic Encryption
The first step in the GWAS algorithm is to solve a logistic model for its weights . There are several solutions [11, 12, 13, 14, 15] that solve a logistic model with HE, given that it was one of the challenges in the iDASH 2017 competition.
Logistic Regression.
Logistic regression estimates the parameters of a binary logistic model. Such models are used to predict the probability of an event occurring given some input features. These models assume that the logarithm of the odds ratio (log-odds) is a linear combination of the input features.
Let denote the probability of an event occurring. The assumption above can be written as
[TABLE]
Rearranging Equation 1, we get
[TABLE]
where and . This is known as the sigmoid function.
Logistic regression estimates the regression coefficients using maximum likelihood estimation (MLE). This likelihood is given as
[TABLE]
where denotes the rows of the covariates matrix . Often, MLE is performed with the log-likelihood
[TABLE]
Maximizing Equation 6 requires an iterative process. Our implementation in solving the logistic model applies the Newton-Raphson method [16]. This is because the Newton-Raphson method is known to converge quadratically [17] and we wish to solve the model with as little iterations as possible.
The Newton-Raphson method iterates over the following equation
[TABLE]
where and are given as
[TABLE]
is defined to be a by diagonal matrix whose entries are for . We remind the reader here that is a by binary response vector that contains the truth labels of each individual. represents the vector of probabilities that is computed for each individual with Equation 2 using of the particular iteration.
A careful derivation of Equations 1, 2, 3, 4, 5 and 6 can be found in [18].
However, there are two non-HE friendly aspects in this algorithm. Firstly, for each iteration, is re-computed with the iteration’s . This is computationally expensive with homomorphic encryption. Secondly, the sigmoid function Equation 2 contains the exponential function, which is not natively supported by HE schemes. Hence, we approximate the Hessian matrix and the sigmoid function in our implementation.
Hessian Matrix Approximation.
We use an approximation for all Hessian matrices as suggested by Böhning and Lindsay [19]. They proposed using
[TABLE]
as a lower bound approximation for all Hessian matrices in solving a logistic model with the Newton-Raphson method. This approximation is also used by Xie et. al. [20] in their distributed privacy preserving logistic regression. We chose to precompute with an open source matrix library Eigen [21]. We then encrypt as an input to the GWAS algorithm.
Sigmoid Function Approximation.
We use the approximation from Kim et. al. [12] who proposed polynomials of degree as approximations of the sigmoid function. We chose the polynomial of degree :
[TABLE]
Our Algorithm.
We described our algorithm for Logistic Regression with HE in Algorithm 9. We encrypt and as CP matrices and as a ciphertext. We initialize in a ciphertext by encrypting a vector of zeros. We first compute with Algorithm 2 and apply Equation 11 on to each slot in the ciphertext. Note here that is now a vector and is represented by one ciphertext. Instead of encrypting , we treat as encrypted as a RP matrix. We thus invoke RP matrix vector multiplication, Algorithm 6 with and . Finally, is updated with Equation 7.
In comparison with prior works that perform secure computation of logistic regression with HE [11, 12, 13, 14, 15], our method is the first to use the Newton-Raphson method. Gradient descent was chosen to in maximizing the log-likelihood, Equation 6, in other implementations.
In [13], a -bit gradient descent method was adopted, with the FV scheme [22]. Bootstrapping is required in this solution. [11] employed the CKKS scheme [9] with gradient descent. They shared two least squares approximations of the sigmoid function. The winning solution of iDASH 2018 [12] used a gradient descent variant - Nesterov Accelerated Gradient and introduced another approximation of the sigmoid function. [15] use bootstrapping to achieve logistic regression for datasets larger than any of the solutions published. A unique solution proposed in [14] attempts to approximate a closed form solution for logistic regression.
Semi-Parallel GWAS with Homomorphic Encryption
The semi-parallel GWAS algorithm proposed by Sikorska et. al.[6] rearranges linear model computations and leverages fast matrix operations to achieve some parallelization and thus better performance. A logistic model is first solved with the covariates matrix. Let be a temporary variable
[TABLE]
where is the weights of the logistic model, is the response vector and is the vector of probabilities from evaluating the sigmoid function Equation 2 with .
The SNP matrix is then orthogonalized with
[TABLE]
and is orthogonalized with
[TABLE]
The estimated SNP effect can then be computed with
[TABLE]
and the standard error can be computed with
[TABLE]
Division here denotes element-wise division between the vectors and .
The main obstacle for HE with the semi-parallel GWAS algorithm is matrix inversion. General matrix inversion is computationally expensive and inefficient in HE. This is mainly because integer division, which is used frequently in matrix inversion, cannot be efficiently implemented in HE. There are two instances where matrix inversion has to be computed. The first occurs in Equation 12 and the second occurs in the orthogonal transformations Equations 13 and 14. In the following paragraphs, we will describe our method for implementing the semi-parallel GWAS algorithm with HE. We will also describe some optimizations that reduce memory consumption and accelerate computations to qualify within the competition requirements.
Inverse of .
We exploit the nature of to compute its inverse with the Newton-Raphson method in HE. Recall that is a by diagonal matrix whose entries are for . Firstly, we represent the diagonal matrix by a vector containing the diagonal entries to reduce storage and computational complexity. Secondly, the inverse of a diagonal matrix is can be obtained by inverting the entries along the main diagonal. This means that can be computed by inverting the slots of . The entries of are given as , where . We claim an upper bound of on the slots of . The proof is as follows: the derivative of is for which gives a maximium. Substituting provides the upper bound of .
We used this information to set a good initial guess of in the Newton-Raphson method. This would reduce the number of iterations needed to obtain an accurate inverse. We describe this algorithm in Algorithm 10.
Modification of Orthogonal Transformations.
We propose modifications to Equations 13 and 14 as is too expensive to be computed in the encrypted domain.
We define a placeholder matrix as
[TABLE]
We proposed a modification, inspired by the Hessian approximation in Equation 10, to the orthogonal transformation of with
[TABLE]
and with
[TABLE]
The estimated SNP effect is now computed with
[TABLE]
and the standard error is
[TABLE]
Complex Space of CKKS ciphertext
For our first optimization, we exploit the scheme’s native support for complex numbers to pack two SNPs into a single complex number, putting one SNP in the real part and another in the imaginary part. This allows us to fit twice as many SNPs in a single ciphertext and cut the runtime by half.
However, in Equation 20 is more difficult to compute with this packing method. Simply squaring the ciphertext does not yield the correct output as slots now contain complex numbers; for some complex number ,
[TABLE]
Instead, we consider multiplying by its complex conjugate . We have
[TABLE]
Extracting the real parts of Equation 22 and Equation 23, we get
[TABLE]
Recall that is a CCP matrix which is represented by one ciphertext with each slot holding one complex numbers encoding two SNPs. Thus, we compute and . We assign
[TABLE]
and
[TABLE]
Optimizations with HEAAN
There are two optimizations that we used with the HEAAN library to reduce the parameters needed and to improve runtime.
For the first optimization, we rescale the ciphertext by a value that is smaller than after every plaintext multiplication. This means each plaintext multiplication is now “cheaper” than a ciphertext multiplication and hence the value of when initialized can be lowered.
The second optimization would be to perform only power-of-two rotations. A rotation by slots is a composition of power-of-two rotations in the HEAAN library. The required power-of-two rotations are the s of the binary decomposition of . Thus, it would be more efficient if we only perform rotations by a power-of-two. We illustrate this with an example. A rotation by slots would require power-of-two rotations as the binary decomposition of is . A rotation by slots would require power-of-two rotations as the binary decomposition of is . This reduces the number of rotations in our implementation.
Batching SNPs
As is too large to be stored in memory when encrypted, we propose to divide column-wise and process batches of SNPs. We show how to compute the maximum number of SNPs that can fit within a batch. Let be the number of SNPs in a batch. Consider , a matrix product between a by and a by matrix. By Algorithm 8, the result is a CCP matrix, whose ciphertext has to have enough slots for elements. For efficiency as described in the previous section, we round the size of each column to the nearest power-of-two and pad the columns with zeroes. Together with the complex space of the HEAAN ciphertext, the maximum number of SNPs that can be processed as a batch is given as
[TABLE]
Smart Cache Module
We consider the largest matrix in our implementation, which is a by matrix. There is an instance where will be stored as CCP matrix (See next section). This means that the ciphertext would need to have at least slots. Consequentially, is at least . This results in a large set of parameters for the HE scheme which translate to a large amount of memory usage.
The next step requires this CCP matrix to be first converted into a CP matrix. This implies that we need to manage ciphertexts where is the number of individuals. This further increase the memory footprint of the algorithm.
Furthermore, the virtual machine that the iDASH organizers provide only has GB RAM. As a result, we choose to move ciphertexts to the hard disk when they are not used for computations.
We designed a cache module that exploits the vectorized ciphertext structure of encrypted matrices. There are 4 threads on the VM provided, of which 2 is used for reading ciphertexts from the disk while 1 is used to write ciphertext into a file. The last thread is used for computation. A ciphertext will be pre-fetched into memory before it is needed for computation, replacing a ciphertext that is no longer needed.
Our Algorithm
We give a detailed walkthrough of our modified semi-parallel GWAS algorithm in Algorithm 11.
First, we perform logistic regression with , and as described in Algorithm 9. We use from logistic regression, together with from the previous iteration to compute and .
Next, compute the inverse of the slots elements in with inverseSlots. Note that is equivalent to multiplying the ciphertexts and . We then compute as given in Equation 19. At this point, we have and which are both vectors, stored in a ciphertext each.
We construct a temporary variable which is a CP matrix to facilitate computations. Here, we choose to encrypt as a REP matrix. The reason for encrypting differently is because multiplying a CP matrix by a RP matrix requires the RP matrix to be first converted into REP form. This process is very inefficient homomorphically and hence we decided to encrypt it directly as a REP matrix. Thus, the product of with is a CP-REP-MatMult as shown in Algorithm 8. At this point, we is a CCP matrix. We then convert into a CP matrix to compute .
As described earlier, we iterate over partial blocks of the SNP matrix, , divided column-wise. Next, compute its orthogonal transformation . We remind the reader here again that is computed with CP-REP-MatMult which produces a CCP matrix, . We compute, separately, the numerator, , and denominator, , of Equation 20 for each .
For , we multiply and slots-wise and duplicate the slots for as many columns in the CCP matrix . The vector-matrix product is now redefined as a ciphertext multiplication, followed by calling colSum over slots.
For , the computation is similar. After squaring the slots of the CCP matrix , we duplicate and perform a slot-wise multiplication. colSum of the resulting CCP matrix is exactly the second part of the vector-matrix product for - the accumulation sum over every slots.
We wish to highlight here that as stated in Q FAQ for the competition, it is acceptable to return and separately [23]. As such, we decrypt and concatenate all s and s respectively instead of performing a costly inversion of . Finally, we divide the two vectors element-wise to obtain the estimated SNP effect, .
Results
We used the provided dataset of users with covariates and SNPs.
The HE library used is the HEAAN library [24], commit id . The HE parameters used are , and . We observed that the HEAAN context based on these parameters utilizes about GB. The context can be thought of as the base memory needed for HE computations. Furthermore, we run on the output with for ciphertext-ciphertext multiplications and for ciphertext-plaintext multiplications to control noise growth. This gives us a security level of about bits based on the LWE estimator provided by Albrecht et. al. [25].
As described earlier, we require at least slots, where . We chose the minimum number of slots needed, slots and set to be . We are able to process a total of SNPs in each batch, based on Equation 27. This gives us a total of batches. We set for the number of iterations in HomLogisticRegression.
We have tabulated the number of sequential homomorphic computations of our modified GWAS algorithm in Table 1(a).
These numbers represent the circuit depth of the GWAS algorithm. We find that a comparison of the number of these computations is a better measure of evaluating a HE program, independent of HE library used.
We report the time taken and memory consumed on two servers: the VM provided by the iDASH organizers and our server.
The machine provided by the iDASH organizers is an Amazon T2 Xlarge or equivalent VM, which has 4 vCPU, 16GB memory, disk size around 200GB [23]. The results are shown in Table 2(a).
For our server, the CPU model used is Intel Xeon Platinum CPU at GHz with cores and the OS used is Arch Linux. The results are shown in Table 3(a).
We evaluated the accuracy of our results with two methods. The first method compares the vectors and , counting the number of entries that are not equal. However, since the CKKS scheme introduces some error upon decrypting, we are unable to get any identical entries. Instead, we opt to count the number of p-values for which our solution differs from the original algorithm by more than some error, . This is shown in Table 4.
The second method would be to plot a scatter diagram whose x-axis represent and y-axis represent . Ideally, if , the best fit line of the scatter plot should be . We compute the line of best fit with the numpy.polyfit function from python [26] and compared against the line . Our HEAAN based solution gives the line . The scatter plot is given in Figure 3.
We port our implementation to the SEAL library [27] which recently released a version of the CKKS scheme that does not require the modulus. We implemented this with cores on our machine. The parameters used are , and . The context generated in this instance is approximately GB. The results of this implementation is given in Table 5(a).
The accuracy of the SEAL implementation based on the first method is tabulated in Table 6.
Our SEAL based solution gives the line . The scatter plot for the results is given in Figure 4.
Discussion
In our submission, we miscalculated the security level, assuming that it fit the -bit requirements while it was actually about bits. This is due to the use of the modulus for the evaluation key, which is a quirk of the HEAAN library [24].
There is also a limit of 256 subjects with our implementation, due to our desire to pack the entire test dataset into a single ciphertext. For a larger number of subjects (up to 512), the matrix will need at least by slots, which means that has to be at least .
We are aware of the limitations in HEAAN, namely the modulus and slower homomorphic operations. However, it was the only publicly available HE library based on the CKKS scheme.
We can see that SEAL’s implementation of the CKKS scheme is superior in terms of runtime. This is because SEAL implemented an RNS-variant of the CKKS, which improves the speed of the algorithm by almost times. The security level of this implementation based on the LWE estimator is about bits.
However, we are unable to execute our GWAS algorithm with SEAL using . The set of parameters that supports the depth of the algorithm with appears to be too large and caused our server to run out of memory. Hence, for the implementation with SEAL, we reduced to . This reduces the depth of the algorithm and hence the parameters that were used. Consequentially, the accuracy of the results has decreased from to .
Conclusions
In this paper, we demonstrated an implementation of a semi-parallel GWAS algorithm for encrypted data. We employed suitable approximations in adapting the semi-parallel GWAS algorithm to be HE-friendly. Our solution shows that the model trained over encrypted data is comparable to one trained over unencrypted data. Memory constraints are shown to be of little concern with our implementation of a smart cache, which reduced memory consumption to fit within the limits imposed. This signifies another milestone for HE, showing that HE is mature enough to tackle more complex algorithms.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Estrada, K., Abuseiris, A., Grosveld, F.G., Uitterlinden, A.G., Knoch, T.A., Rivadeneira, F.: Grimp: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data. Bioinformatics (2009)
- 2[2] Lin, Z., Owen, A.B., Altman, R.B.: Genomic research and human subject privacy. Science 305 (5681), 183–183 (2004). doi: 10.1126/science.1095019 . http://science.sciencemag.org/content/305/5681/183.full.pdf
- 3[3] Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J.V., Stephan, D.A., Nelson, S.F., Craig, D.W.: Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. https://doi.org/10.1371/journal.pgen.1000167 · doi ↗
- 4[4] Rousseau, D.: Biomedical research: Changing the common rule by david rousseau. PLOS Genetics (2017)
- 5[5] i DASH Privacy & Security Workshop. Last Accessed 15 January 2018. http://www.humangenomeprivacy.org
- 6[6] Sikorska, K., Lesaffre, E., Groenen, P.F., Eilers, P.H.: Gwas on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies. BMC Bioinformatics (2013)
- 7[7] Rivest, R.L., Adleman, L., Dertouzos, M.L.: On data banks and privacy homomorphisms. Foundations of Secure Computation, Academia Press (1978)
- 8[8] Gentry, C.: Fully homomorphic encryption using ideal lattices. In: 41st ACM Symposium on Theory of Computing, pp. 169–178. ACM Press, ??? (2009)
