Secure Computation in Decentralized Data Markets

Fattaneh Bayatbabolghani; Bharath Ramsundar

arXiv:1907.01489·cs.CR·July 3, 2019

Secure Computation in Decentralized Data Markets

Fattaneh Bayatbabolghani, Bharath Ramsundar

PDF

Open Access

TL;DR

This paper presents efficient secure computation protocols for decentralized data markets, enabling privacy-preserving data analysis on sensitive datasets using secure multi-party computation techniques.

Contribution

It introduces novel secure protocols utilizing garbled circuits and homomorphic encryption tailored for decentralized data markets, demonstrating their applicability in healthcare.

Findings

01

Protocols support arbitrary computation

02

Efficient performance on healthcare datasets

03

Applicable to privacy-sensitive data analysis

Abstract

Decentralized data markets gather data from many contributors to create a joint data cooperative governed by market stakeholders. The ability to perform secure computation on decentralized data markets would allow for useful insights to be gained while respecting the privacy of data contributors. In this paper, we design secure protocols for such computation by utilizing secure multi-party computation techniques including garbled circuit evaluation and homomorphic encryption. Our proposed solutions are efficient and capable of performing arbitrary computation, but we report performance on two specific applications in the healthcare domain to emphasize the applicability of our methods to sensitive datasets.

Tables3

Table 1. Table 1: Execution time for LD test in ms and the communication in MB for GC.

$M$	$N$	garbling	evaluation	#gates	#non-XOR gates	Comm.
10	200	10.8	6.7	293550	81690	1.3
	400	14.1	7.7	313370	87150	1.4
	800	12.3	7.7	334510	92970	1.5
	1600	16.7	10.5	356970	99150	1.6
100	200	145.4	83.6	2935500	816900	13.1
	400	161.5	93.1	3133700	871500	14.0
	800	124.0	75.7	3345100	929700	14.9
	1600	132.9	80.2	3569700	991500	15.9
1000	200	1108.7	675.4	29355000	8169000	131.0
	4000	1269.6	758.7	31337000	8715000	139.7
	8000	1391.2	831.8	33451000	9297000	149.0
	16000	1371.4	897.6	35697000	9915000	158.9

Table 2. Table 2: Execution time and space complexity for LD test for FHE.

$M$	Execution	Space	Expected execution (batch)
10	54.2 s	18.34 MB	1840 ms
100	9.1 m	183.4 MB	0.18 s
1000	1.48 h	1.82 GB	1.84 s

Table 3. Table 3: Execution time for LR test in ms and the communication in MB for GC.

range	garbling	evaluation	#gates	#non-XOR gates	Comm.
10	8.2	4.8	198909	106016	5.1
11	14.6	8.7	318717	193056	9.3
12	28.6	17.0	562429	371232	17.8

Equations14

p = \frac{e ^{X \cdot W + b}}{1 - e ^{X \cdot W + b}}

p = \frac{e ^{X \cdot W + b}}{1 - e ^{X \cdot W + b}}

N_{A} = 2 N_{AA} + N_{A a}, N_{a} = 2 N_{aa} + N_{A a} .

N_{A} = 2 N_{AA} + N_{A a}, N_{a} = 2 N_{aa} + N_{A a} .

p_{A B} = \frac{N _{A B}}{N},

p_{A B} = \frac{N _{A B}}{N},

p_{A B} = p_{A} p_{B},

p_{A B} = p_{A} p_{B},

p_{A B} = p_{A} p_{B} + D_{A B},

p_{A B} = p_{A} p_{B} + D_{A B},

χ_{A, B}^{2} = \frac{2 N \cdot D ^{2}}{p _{A} \cdot p _{a} \cdot p _{B} \cdot p _{b}} = \frac{2 N \cdot ( N \cdot N _{A B} - N _{A} \cdot N _{B} ) ^{2}}{N _{A} \cdot N _{a} \cdot N _{B} \cdot N _{b}}

χ_{A, B}^{2} = \frac{2 N \cdot D ^{2}}{p _{A} \cdot p _{a} \cdot p _{B} \cdot p _{b}} = \frac{2 N \cdot ( N \cdot N _{A B} - N _{A} \cdot N _{B} ) ^{2}}{N _{A} \cdot N _{a} \cdot N _{B} \cdot N _{b}}

2 N \cdot (N \cdot N_{A B} - N_{A} \cdot N_{B})^{2} > χ_{A, B}^{2} \cdot N_{A} \cdot N_{a} \cdot N_{B} \cdot N_{b} .

2 N \cdot (N \cdot N_{A B} - N_{A} \cdot N_{B})^{2} > χ_{A, B}^{2} \cdot N_{A} \cdot N_{a} \cdot N_{B} \cdot N_{b} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCryptography and Data Security · Blockchain Technology Applications and Security · Privacy-Preserving Technologies in Data

Full text

Secure Computation in Decentralized Data Markets

Fattaneh Bayatbabolghani and Bharath Ramsundar

Computable

Abstract

Decentralized data markets gather data from many contributors to create a joint data cooperative governed by market stakeholders. The ability to perform secure computation on decentralized data markets would allow for useful insights to be gained while respecting the privacy of data contributors. In this paper, we design secure protocols for such computation by utilizing secure multi-party computation techniques including garbled circuit evaluation and homomorphic encryption. Our proposed solutions are efficient and capable of performing arbitrary computation, but we report performance on two specific applications in the healthcare domain to emphasize the applicability of our methods to sensitive datasets.

1 Introduction

One of the challenges of building a decentralized data market [6] is providing adequate protection for the privacy of data contributors. Data contributors might be unwilling to contribute sensitive information into a data market if they lack adequate protections for their data. Economic considerations may ease some of these worries, but for high-value datasets more powerful cryptographic tools may be necessary to secure user data.

In this paper, we introduce a scenario where different data contributors (makers) wish to share their data (listings) to make data available for buyers who wish to perform specific computations on aggregated data. We assume makers are not comfortable sharing plaintext data. Therefore, our main goal in this work is performing computation on protected and aggregated data. In many examples in practice, these listings are not necessarily physically stored in one database (datatrust) or are not always owned by one organization.

In previous work we introduced decentralized data markets [13, 6] which provide a powerful framework for constructing datasets with distributed ownership and control. We also introduced the maker/listing/datatrust terminology which we will reuse in this current paper. In these scenarios, storing protected listings and performing computation on them is not straightforward. Who performs encryption upon listings? How is computation done on encrypted data? In this paper, we explore these questions on two healthcare inspired examples: performing logistic regression on the breast cancer Wisconsin Dataset [1], and the linkage disequilibrium test on GWAS data. We design a secure solution to compute both logistic regression and linkage disequilibrium tests, but our designed protocol is general and can be used to perform arbitrary computation.

In the following sections, we first provide some background related to the computation of logistic regression [8], linkage disequilibrium [12], and other cryptographic tools [14, 17, 18]. Then we introduce our designed protocols, and at the end provide our experimental results.

2 Background

In this paper we study two sample computational problems: logistic regression (LR) and the on linkage disequilibrium (LD) test performed genome-wide association study (GWAS) data. We design protocols for performing these computations on encrypted data and base on two cryptographic techniques: Homomorphic Encryption (HE) and Garbled Circuit (GC). In the following, we briefly provide needed background before moving to the design of our proposed protocol.

2.1 Logistic Regression

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. In this paper, our focus is on binary LR where LR is used to predict the relationship between independent variables and a dependent variable where the dependent variable is binary. We can divide the computation of LR into two categories: training and testing. In training, a model is trained based on training samples and parameters are computed. In testing, the trained model is applied on test cases. There are some standard open source tools to perform the training phase of LR such as TensorFlow [2], PyTorch [10], and SKLearn [11]. For the rest of this paper, we assume that we have access to a trained LR model and all its learned parameters and only focus on the implementation of LR for the testing phase.

For LR during the testing phase, if we have fixed dimension $n$ , precomputed parameters $W=(w_{1},\ldots,w_{n})$ and $b$ ( $W$ and $b$ are regression coefficients of a trained model), and a sample $X=(x_{1},\ldots,x_{n})$ , then we can compute probability $p$ as:

[TABLE]

As you can see in the computation of $p$ , if we can compute $e^{X\cdot W+b}$ , we can easily compute the rest of the probability. We focus on computing this quantity on encrypted data in section 4.

2.2 Genome-Wide Association and Linkage Studies

In sections 2.2.1-2.2.2, we provide more background about genomic data and LD test computation [15].

2.2.1 Genomic Background

DNA is a sequence of nucleotides $\{A,C,G,T\}$ . An individual’s collection of genes is called a genotype and the physically observable characteristics of an individual are called a phenotype. A genetic marker is defined as a gene or a DNA segment with a known locus (location) on a chromosome, which is typically used to help link an inherited disease with the responsible gene. Then a set of closely linked genetic markers found in one chromosome that tend to be inherited together is called a haplotype.

A single nucleotide polymorphism (SNP) represents a common type of a genetic variation among people in a single nucleotide that occurs at a specific locus in a genome. One of a number of alternative forms of a gene at a given locus is called an allele. The most common and least common alleles that occur in a given population are called major and minor alleles, respectively. We denote a major allele by a capital letter, e.g., $A$ , and a minor allele by the corresponding lowercase letter, e.g., $a$ . An individual inherits two alleles for each gene, one from each parent. If the two alleles are the same, the individual is homozygous for that gene and is heterozygous otherwise. Based on that information, we distinguish between the following categories: homozygous reference genotype, denoted as $AA$ ; heterozygous genotype, denoted as $Aa$ ; and homozygous variant genotype denoted as $aa$ . We refer to the two alleles inherited for a particular gene as a genotype.

Let $N$ denote the total number of collected alleles in a pool of genes. We then use $N_{A}$ and $N_{a}$ to denote the number of major and minor alleles in the observed population, respectively. Similarly, $N_{AA}$ , $N_{Aa}$ , and $N_{aa}$ denote the number of gene variants of the type $AA$ , $Aa$ , and $aa$ , respectively. They are used to compute values $N_{A}$ and $N_{a}$ as

[TABLE]

In addition, an allele frequency is defined as the number of this allele in a certain locus in the observed population. In other words, we define major and minor allele frequencies $p_{A}$ and $p_{a}$ as $p_{A}=N_{A}/N$ and $p_{a}=N_{a}/N$ , respectively. A genotype frequency can be defined analogously.

2.2.2 Linkage Disequilibrium

Linkage disequilibrium is an important notion in population genetics that occurs when genotypes at two different loci are not independent of each other. In other words, LD is the non-random association of pairs of alleles that often descend from a single ancestral chromosome. Consider two loci $A$ and $B$ with two alleles each ( $A$ , $a$ , $B$ , and $b$ ). There are 9 possible genotypes $AABB$ , $AABb$ , $AAbb$ , $AaBB$ , $AaBb$ , $Aabb$ , $aaBB$ , $aaBb$ , $aabb$ , and there are four haplotypes $AB$ , $Ab$ , $aB$ , $ab$ . Let us use $N_{AB},N_{Ab},N_{aB},$ and $N_{ab}$ as the number of instances of each of the four haplotypes in the observed population. Then, their population frequencies are computed as:

[TABLE]

When the alleles’ frequencies are independent (i.e., we have linkage equilibrium), we expect that:

[TABLE]

where, as before, $p_{A}=N_{A}/N$ , $p_{a}=N_{a}/N$ and similarly $p_{B}=N_{B}/N$ , $p_{b}=N_{b}/N$ , but now $N_{A}=N_{AB}+N_{Ab}$ , $N_{a}=N_{aB}+N_{ab}$ , $N_{B}=N_{AB}+N_{aB}$ , $N_{b}=N_{Ab}+N_{ab}$ . However, if the alleles are in LD, the formulas become:

[TABLE]

The parameter $D_{AB}$ is called the coefficient of LD and can be computed as $D_{AB}=p_{AB}-p_{A}p_{B}$ .

Chi-square statistics for the hypothesis $H_{0}$ of no disequilibrium (i.e., $D_{AB}=0$ ) is computed as:

[TABLE]

$H_{0}$ is rejected (i.e., LD is present) if $\chi_{A,B}^{2}$ exceeds a particular threshold or

[TABLE]

2.3 Cryptographic Tools

We design our secure computation protocols using Homomorphic Encryption (HE) and Garbled Circuit (GC). Note that we can use any HE including additive HE and fully HE, but in here we are more interested in exploring fully HE (e.g., Paillier encryption as the additive HE and Lattice-based cryptography as fully HE). In the following we describe HE and GC briefly and then focus on details of the proposed solution.

2.3.1 Homomorphic Encryption

HE is a type of encryption that allows computation to be performed on encrypted data without revealing any information about the original data. In here, we use a specific type of HE where its key is defined in a public-key cryptosystem. This scheme is defined by three algorithms ( $\sf Gen$ , $\sf Enc$ , $\sf Dec$ ), where $\sf Gen$ is a key generation algorithm that on input of a security parameter $1^{\kappa}$ produces a public-private key pair $(pk,sk)$ ; $\sf Enc$ is an encryption algorithm that on input of a public key $pk$ and message $m$ produces ciphertext $c$ ; and $\sf Dec$ is a decryption algorithm that on input of a private key $sk$ and ciphertext $c$ produces decrypted message $m$ or special character $\perp$ that indicates failure. For conciseness, we use notation ${\sf Enc}_{pk}(m)$ or ${\sf Enc}(m)$ and ${\sf Dec}_{sk}(c)$ or ${\sf Dec}(c)$ in place of ${\sf Enc}(pk,m)$ and ${\sf Dec}(sk,c)$ , respectively. A semantically secure encryption scheme guarantees that no information about the encrypted message can be learned from its ciphertext with more than a negligible (in $\kappa$ ) probability.

Note that, in secure computation based on HE, the complexity of a protocol is measured based on non-free (expensive) operations. As an example, in additive HE, addition is a free operation and multiplication is counted as an expensive operation. Therefore, to optimize a solution we need to minimize non-free operations. We can also provide the complexity of a designed protocol based HE in terms of communication and computation complexities of no-free operations. While, fully HE supports arbitrary computation and it is a more powerful tool, but we need to define a specific noise budget for sequential multiplication operations which affects the performance of a computation. Since, we use the SEAL library for implementation, more information about fully HE can be found in [14].

2.3.2 Garbled Circuit

The use of GC allows two parties $P_{1}$ and $P_{2}$ to securely evaluate a Boolean circuit of their choice. That is, given an arbitrary function $f(x_{1},x_{2})$ that depends on private inputs $x_{1}$ and $x_{2}$ of $P_{1}$ and $P_{2}$ , respectively, the parties first represent is as a Boolean circuit. One party, say $P_{1}$ , acts as a circuit generator and creates a garbled representation of the circuit by associating both values of each binary wire with random labels. The other party, say $P_{2}$ , acts as a circuit evaluator and evaluates the circuit in its garbled representation without knowing the meaning of the labels that it handles during the evaluation. The output labels can be mapped to their meaning and revealed to either or both parties.

The fastest currently available approach for circuit generation and evaluation we are aware of is by Bellare et al. [5]. It is compatible with earlier optimizations, most notably the “free XOR” gate technique [9] that allows XOR gates to be processed without cryptographic operations or communication, resulting in virtually no overhead for such gates. A recent half-gates optimization [19] can also be applied to this construction to reduce communication associated with garbled gates. In addition, there are some recent works on GC compilers (e.g., [16, 7]) which are designed based on [5].

An important component of garbled circuit evaluation is 1-out-of-2 Oblivious Transfer (OT). It allows the circuit evaluator to obtain wire labels corresponding to its inputs. In particular, in OT the sender (i.e., circuit generator in our case) possesses two strings $s_{0}$ and $s_{1}$ and the receiver (circuit evaluator) has a bit $\sigma$ . OT allows the receiver to obtain string $s_{\sigma}$ and the sender learns nothing.

Note that, in the two-party setting solution based on GC, the complexity of an operation is measured in the number of non-free (i.e., non-XOR) Boolean gates because of optimization in XOR gate. Also, some computations like shift operation do not consist of any kind of gate and it is totally free. Therefore, to have an optimized solution, we need to minimize the number of non-XOR gates by using more free operations during the computation instead. In addition, we can report the complexity of a designed protocol in terms of the number of non-free gates.

3 Designed Protocols

In both of the following protocols, we assume we have access to Crypto Service Provider (CSP), who is a trusted third party with access to implementations of cryptographic standards and algorithms. (Such a CSP could possibly be added as a participant in future versions of the Computable protocol [6]) We also assume the presence of a datatrust (DT), makers $o_{i}$ where $i=1,\ldots,n$ , and buyers $s_{j}$ where $j=1,\ldots,m$ . At the end of protocol execution, each $s_{j}$ learns the result of a secure computation.

3.1 Homomorphic Encryption Protocol

In HE, we have access to its three main algorithms ( $\sf Gen$ , $\sf Enc$ , $\sf Dec$ ). In this section, we use fully HE (FHE) developed by Brakerski/Fan-Vercauteren (BFV) and Cheon-Kim-Kim-Song (CKKS) as implemented by the SEAL library [14]. We introduce our solution in Protocol 1 and associated Figure 1.

3.2 Garbled Circuit Protocol

Next, we describe the details of the proposed solution based on GC. We have the same architecture as in Protocol 1, but instead of HE, GC is used as the underlying cryptographic tool. In this setting, the CSP needs to have enough computational power and storage to perform the garbling process. We introduce our solution in Protocol 2 and associated Figure 2.

4 Experimental Results

In this section we evaluate the performance of our solution. The garbled circuit implementations were written in C and used the JustGarble library [5, 4] for circuit garbling and evaluation. Our code supports the half-gates optimization [19]. The FHE implementations were performed using the SEAL library [14]. All the computation for GC was run on a 3.3GHz machine, and for HE was run on a 2.7GHz machine, and experimental runs were repeated 10 times and mean values reported.

4.1 Linkage Disequilibrium Results

The GC protocol for the LD test results is reported in Table 1. Note that for the LD test, we vary the value of $N$ to demonstrate how this variable affects performance of the computation. Furthermore, we also vary the number $M$ of SNPs or alleles for which each test is run, with all $M$ instances of each test being executed at the same time.

In addition, we implemented the LD test for the Brakerski/Fan-Vercauteren (BFV) scheme by using the SEAL Library [14]. Running the LD test takes $54.2$ seconds when $M=10$ , and runtime grows linearly with the size of $M$ . Further details about the execution time are provided in Table 2. The SEAL library provides the facility to run HE operations in a batch. The LD test is very amenable to batch computation, and our execution becomes about 3000 times faster when all independent operations are run in a batch. Note that in our experiment, we set the polynomial modulus degree to 8192 and coefficient modulus to 128 and we reported the upper-bound of space complexity in Table 2. Note that in FHE, $M$ is the only LD test parameter that is important in the experiments because based on the selected parameters of FHE, the variable size of $N$ is covered.

4.2 Logistic Regression Results

For the LR test, the computation becomes more complicated, since the exponentiation operation is not supported by the standard SEAL and JustGarble libraries. One potential solution to implement this operation is by using a private lookup table [3]. In this approach we precompute the values of the exponential function for the desired precision and the range of input values and use private lookup to select the output based on private input.

Consider an exponentiation function ( $\sf Exp$ ) that needs to be evaluated on private input $a$ and in our case, it is defined over fixed-point arithmetic. Let the value of $a$ be in the range $[a_{min},a_{max}]$ with $N$ denoting the number of the elements in the range. Then the approach consists of precomputing the function on all possible inputs and storing the result in an array $Z=(z_{0},{\ldots},z_{N-1})$ . Consequently, evaluation of the function on private $a$ corresponds to privately retrieving the needed element of the array $Z$ using $a$ to determine the index. This procedure is formalized in the protocol $\sf Exp$ below. For further details, see reference [3].

$[b]\leftarrow{\sf Exp}([a],Z=\{z_{i}\}_{i=0}^{N-1})$

Compute $[b]\leftarrow{\sf Lookup}(\langle z_{0},{\ldots},z_{N-1}\rangle,[a])$ . 2. 2.

Return $[b]$ .

This approach can be implemented by using a multiplexer. However, this approach does not work well for FHE because its performance directly depends on the range of input values. For larger range, we need more sequential multiplications in the multiplexer, and as a result a larger noise budget ensues, making the solution less efficient. But the private lookup table is a reasonable solution for GC based protocols. In Table 3, we report performance of LR on GC (testing phase) on the breast cancer Wisconsin dataset [1] where the size of inputs is 16 bits and we have different ranges for input values (in bits) for exponentiation operation. This dataset is a binary classification dataset with 30 dimensions and 569 sample data points.

5 Conclusions and Future Directions

In this paper, we design secure, efficient, and general protocols based on homomorphic encryption and garbled circuits to perform computation on sensitive encrypted data in a decentralized data market. We use examples from healthcare to emphasize the applicability of our protocol to sensitive datasets. The designed protocols are general and can be used for arbitrary computation, but we report performance only on our examples of linkage disequilibrium and logistic regression. To the best of our knowledge, our designed protocols are efficiently constructed. Our architecture is especially efficient for the garbled circuit protocol due to the fact that we eliminate oblivious transfer, the most computationally expensive part of GC. Our designed solutions are comparable and competitive with existing protocols including [15].

In addition, our proposed solutions are theoretically salable for larger volumes of inputs, but achieving sufficient efficiency is challenging. More specifically, for lager inputs we may need to define more noise budget in HE protocol (operations that need to be done play an important role to define noise budget) to be able to do all computations with enough precision, and that may cause the solution inefficient in practice. Also, the performance of the private lookup table in GC protocol directly depends on the size of the table; therefore, using the designed protocols for larger inputs in practice is not as straightforward as in theory.

In the current work, the security of our design relies on the existence of an independent crypto service provider (CSP). The CSP is responsible for generating the public-private key pair in the HE scheme, and generates security parameters and garbles circuits in the GC scheme. In practice though, for many applications, we do not have access to such a trusted third party capable of acting as a CSP. As a future direction, we are working on a solution to eliminate the CSP and handle its role by performing a secure multi-party computation between the makers themselves. This approach may add some overhead to the protocols but it will make our design more broadly applicable for real-world use cases.

Another major limitation of the current system is that each new computation requires a custom software implementation. For our experiments, we had to create custom code for both logistic regression and LD testing. Performing this implementation was nontrivial, and the computation of the exponent for logistic regression required some ingenuity. The construction of a more flexible software framework which can allow for broader classes of computation to be easily implemented is left to future work.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Breast cancer wisconsin dataset. available at: Ucimachine learning repository.
2[2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th { { \{ USENIX } } \} Symposium on Operating Systems Design and Implementation ( { { \{ OSDI } } \} 16) , pages 265–283, 2016.
3[3] F. Bayatbabolghani, M. Blanton, M. Aliasgari, and M. Goodrich. Secure fingerprint alignment and matching protocols. ar Xiv preprint ar Xiv:1702.03379 , 2017.
4[4] M. Bellare, V. Hoang, S. Keelveedhi, and P. Rogaway. The Just Garble library. http://cseweb.ucsd.edu/groups/justgarble/.
5[5] M. Bellare, V. Hoang, S. Keelveedhi, and P. Rogaway. Efficient garbling from a fixed-key blockcipher. In IEEE Symposium of Security and Privacy , pages 478–492, 2013.
6[6] R. Chen, B. Ramsundar, and R. Robbins. Fair value and decentralized governance of data. \url https://github.com/computablelabs/computable/blob/master/whitepaper /computable_whitepaper.pdf, 2019.
7[7] A. Groce, A. Ledger, A. J. Malozemoff, and A. Yerukhimovich. Compgc: Efficient offline/online semi-honest two-party computation. IACR Cryptology e Print Archive , 2016:458, 2016.
8[8] D. G. Kleinbaum, K. Dietz, M. Gail, and M. Klein. Logistic regression . Springer, 2002.