Drynx: Decentralized, Secure, Verifiable System for Statistical Queries   and Machine Learning on Distributed Datasets

David Froelicher; Juan R. Troncoso-Pastoriza; Joao Sa Sousa and; Jean-Pierre Hubaux

arXiv:1902.03785·cs.CR·February 28, 2020

Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets

David Froelicher, Juan R. Troncoso-Pastoriza, Joao Sa Sousa and, Jean-Pierre Hubaux

PDF

1 Repo

TL;DR

Drynx is a decentralized system that allows privacy-preserving statistical analysis and machine learning on sensitive distributed datasets, ensuring data confidentiality, correctness, and auditability without requiring trust in any single entity.

Contribution

It introduces a modular, efficient framework combining cryptographic techniques for secure, verifiable, and privacy-preserving data analysis on distributed datasets.

Findings

01

Training logistic regression on large distributed data in under 2 seconds

02

Verification of query correctness completed in less than 22 seconds

03

Supports secure computation of various statistical measures and machine learning models

Abstract

Data sharing has become of primary importance in many domains such as big-data analytics, economics and medical research, but remains difficult to achieve when the data are sensitive. In fact, sharing personal information requires individuals' unconditional consent or is often simply forbidden for privacy and security reasons. In this paper, we propose Drynx, a decentralized system for privacy-conscious statistical analysis on distributed datasets. Drynx relies on a set of computing nodes to enable the computation of statistics such as standard deviation or extrema, and the training and evaluation of machine-learning models on sensitive and distributed data. To ensure data confidentiality and the privacy of the data providers, Drynx combines interactive protocols, homomorphic encryption, zero-knowledge proofs of correctness, and differential privacy. It enables an efficient and…

Figures20

Click any figure to enlarge with its caption.

Tables2

Table 1. TABLE I: Example set of encoding instantiations. All D P s 𝐷 𝑃 𝑠 DPs encoding s ( ρ 𝜌 \rho ) are then aggregated such that Q 𝑄 Q computes π 𝜋 \pi at the end.

Operat. ( $f$ )	$π$ (on $N$ $D P s$ )	$ρ$
		( $𝐕_{𝐢} = [v_{i, 1}, \dots, v_{i, d}], c_{i}$ )
sum	$\sum_{i = 1}^{N} v_{i, 1}$	([ $\sum_{j = 1}^{c_{i}} h_{j}$ ], $c_{i}$ )
mean	$\frac{\sum_{i = 1}^{N} v_{i, 1}}{\sum_{i = 1}^{N} c_{i}}$	([ $\sum_{j = 1}^{c_{i}} h_{j}$ ], $c_{i}$ )
variance	$σ^{2} = \frac{\sum_{i = 1}^{N} v_{i, 2}}{\sum_{i = 1}^{N} c_{i}} - {(\frac{\sum_{i = 1}^{N} v_{i, 1}}{\sum_{i = 1}^{N} c_{i}})}^{2}$	([ $\sum_{j = 1}^{c_{i}} h_{j}$ , $\sum_{j = 1}^{c_{i}} h_{j}^{2}$ ],
std. dev.	$σ = \sqrt{σ^{2}}$	$c_{i}$ )
$A N D / O R$	$\sum_{i = 1}^{N} v_{i, 1} \overset{?}{=} 0$	([ $R_{j}$ ], $c_{i}$ ) or ([ $b_{j}$ ], $c_{i}$ )
min/max	$l / r m_{\neq 0} (\sum_{i = 1}^{N} v_{i, 1}, \dots,$	([ $R_{j, 1}$ , …, $R_{j, d}$ ], $c_{i}$ )
	$\sum_{i = 1}^{N} v_{i, d})$	or ([ $b_{j, 1}$ , …, $b_{j, d}$ ], $c_{i}$ )
frequ. count	$\sum_{i = 1}^{N} v_{i, 1}, \dots, \sum_{i = 1}^{N} v_{i, d}$	([ $f c_{j, 1}$ , …, $f c_{j, d}$ ], $c_{i}$ )
set int/un	$\sum_{i = 1}^{N} v_{i, 1}, \dots, \sum_{i = 1}^{N} v_{i, d}$	([ $R_{j, 1}$ , …, $R_{j, d}$ ], $c_{i}$ )
		or ([ $b_{j, 1}$ , …, $b_{j, d}$ ], $c_{i}$ )
cosim	$s (ϕ, \bar{ϕ}) = \frac{\sum_{i = 1}^{N} v_{i, 1}}{\sqrt{\sum_{i = 1}^{N} v_{i, 2}} \sqrt{\sum_{i = 1}^{N} v_{i, 3}}}$	([ $\sum_{j = 1}^{c_{i}} ϕ_{j} {\bar{ϕ}}_{j}, \sum_{j = 1}^{c_{i}} ϕ_{j}^{2},$
		$\sum_{j = 1}^{c_{i}} {\bar{ϕ}}_{j}^{2}$ ], $c_{i}$ )
$R^{2}$	$1 - \frac{\sum_{i = 1}^{N} v_{i, 3}}{σ^{2}}$	([ $\sum_{j = 1}^{c_{i}} y_{j}$ , $\sum_{j = 1}^{c_{i}} y_{j}^{2}$
		$\sum_{j = 1}^{c_{i}} {(y_{j} - \hat{y_{j}})}^{2}$ ], $c_{i}$ )

Table 2. TABLE III: Table of Recurrent Symbols.

Symbol	Description
$H D S$ , $P D S$	Hospitals & Patients Data Sharing
$𝒢$ , $B$ , $p$	Elliptic curve; base point on $𝒢$ , prime
E_Ω( $m$ ) = $(C_{1}, C_{2})$	ElG encrypt. of $m$ under key $Ω$ ,
$= (r B, m B + r Ω)$	nonce $r$
$K$	$C N s$ pub. coll. key
( $k_{i}$ , $K_{i}$ )	$C N s$ $C N_{i}$ priv., pub. key
$A$ , $A_{1}$ , $A_{2}$ , $Y_{i}$ , $y_{i}$	ZKPs pub. (uppercase), discrete log.
$Q$ , $D P$ , $N$	Querier, Data Provider, $# D P$
$C N$ , $V N$	Computing & Verifying Node
$f_{h}$	Threshold of honest $V N s$
$π$ , $ρ$ , $\bar{r_{i}}$	linear combi., encoding, records
$𝐕_{𝐢} = [v_{i, 1}, \dots, v_{i, l}]$ , $c_{i}$	vector, count
$C T A$ , $C T O$ , $C T K S$	Coll. Tree Aggr., Obfusc., Key Switch.
$w_{i, 1}, w_{i, 2}$	$C N_{i}$ ’s contribution in $C T K S$
$α_{i}$ , $s_{i}$	$C N_{i}$ secret random nonce
$C D P$ , ( $ϵ$ , $δ$ , $θ$ )	Coll. Diff. Privacy & params.
$[b_{l}, b_{u}]$ , $[0, u^{l})$	Range, default range
$x_{i}$ , ( $A_{i, j}$ , $Z_{i}$ , $H$ , $V_{i, j}$ , $a_{i, j}$ )	Range proof priv., pub. values
$T$ , $T_{s u b}$	Proofs and sub-proofs verif. thresh.
$N_{d a}, N_{i}, D$	Tot. & $D P_{i}$ $#$ records, dataset dim.
$p_{v e r}$ , $p_{v e r_{s u b}}$	proof, sub-proof
$P_{f_{h}}$	prob. of $f_{h}$ $V N s$ verif.

Equations25

A_{1} y_{1} + A_{2} y_{2} = A,

A_{1} y_{1} + A_{2} y_{2} = A,

f (\overset{r}{ˉ}) \equiv π ({ρ (\overset{r_{i}}{ˉ})}_{i = 1}^{N}),

f (\overset{r}{ˉ}) \equiv π ({ρ (\overset{r_{i}}{ˉ})}_{i = 1}^{N}),

ρ (\overset{r_{i}}{ˉ}) \equiv (V_{i}, c_{i}),

ρ (\overset{r_{i}}{ˉ}) \equiv (V_{i}, c_{i}),

f_{k} (\overset{r}{ˉ}) \equiv π ({ρ (\overset{r_{i}}{ˉ}, f_{k - 1} (\overset{r}{ˉ}))}_{i = 1}^{N}) .

f_{k} (\overset{r}{ˉ}) \equiv π ({ρ (\overset{r_{i}}{ˉ}, f_{k - 1} (\overset{r}{ˉ}))}_{i = 1}^{N}) .

P_{f_{h}} = i = f_{h} \sum N_{V N} (i N _{V N}) p^{i} (1 - p)^{N_{V N} - i},

P_{f_{h}} = i = f_{h} \sum N_{V N} (i N _{V N}) p^{i} (1 - p)^{N_{V N} - i},

n \sum x_{μ, 1} ... \sum x_{μ, D} \sum x_{μ, 1} \sum x_{μ, 1}^{2} ... \sum x_{μ, 1} x_{μ, D} ... ... ... ... \sum x_{μ, D} \sum x_{μ, 1} x_{μ, D} ... \sum x_{μ, D}^{2} c_{0} c_{1} ... c_{D} \approx \sum y_{μ} \sum y_{μ} x_{μ, 1} ... \sum y_{μ} x_{μ, D}

n \sum x_{μ, 1} ... \sum x_{μ, D} \sum x_{μ, 1} \sum x_{μ, 1}^{2} ... \sum x_{μ, 1} x_{μ, D} ... ... ... ... \sum x_{μ, D} \sum x_{μ, 1} x_{μ, D} ... \sum x_{μ, D}^{2} c_{0} c_{1} ... c_{D} \approx \sum y_{μ} \sum y_{μ} x_{μ, 1} ... \sum y_{μ} x_{μ, D}

\displaystyle\scriptstyle J(\theta)=\frac{1}{N_{da}}\sum_{\mu=1}^{N_{da}}\Big{[}-y^{(\mu)}\log(h_{\theta}(x^{(\mu)}))-(1-y^{(\mu)})\log(1-h_{\theta}(x^{(\mu)}))\Big{]}+lr_{\theta},

\displaystyle\scriptstyle J(\theta)=\frac{1}{N_{da}}\sum_{\mu=1}^{N_{da}}\Big{[}-y^{(\mu)}\log(h_{\theta}(x^{(\mu)}))-(1-y^{(\mu)})\log(1-h_{\theta}(x^{(\mu)}))\Big{]}+lr_{\theta},

\displaystyle\scriptstyle J_{a}(\theta)=\Big{[}\frac{1}{N_{da}}\scriptstyle\sum_{\tau=1}^{k}\scriptstyle\sum_{r_{1},...,r_{\tau}=0}^{D}a_{\tau}(\theta_{r_{\tau}}\cdots\theta_{r_{\tau}})A_{\tau,r_{1},...,r_{\tau}}-a_{0}\Big{]}+LR_{\theta},

\displaystyle\scriptstyle J_{a}(\theta)=\Big{[}\frac{1}{N_{da}}\scriptstyle\sum_{\tau=1}^{k}\scriptstyle\sum_{r_{1},...,r_{\tau}=0}^{D}a_{\tau}(\theta_{r_{\tau}}\cdots\theta_{r_{\tau}})A_{\tau,r_{1},...,r_{\tau}}-a_{0}\Big{]}+LR_{\theta},

A_{τ, r_{1}, \dots, r_{τ}} = \sum_{μ = 1}^{N_{d a}} a_{τ, r_{1}, \dots, r_{τ}}^{(μ)} = \sum_{μ = 1}^{N_{d a}} (y^{(μ)} - y^{(μ)} (- 1)^{τ} - 1) (x_{r_{1}}^{(μ)} \dots x_{r_{τ}}^{(μ)}),

A_{τ, r_{1}, \dots, r_{τ}} = \sum_{μ = 1}^{N_{d a}} a_{τ, r_{1}, \dots, r_{τ}}^{(μ)} = \sum_{μ = 1}^{N_{d a}} (y^{(μ)} - y^{(μ)} (- 1)^{τ} - 1) (x_{r_{1}}^{(μ)} \dots x_{r_{τ}}^{(μ)}),

P_{n}

P_{n}

= P (i = 1 \sum n R_{i} = 0 ∣ i = 1 \sum n - 1 R_{i} = 0) \cdot P (i = 1 \sum n - 1 R_{i} = 0)

+ a = 1 \sum # G - 1 P (i = 1 \sum n R_{i} = 0 ∣ i = 1 \sum n - 1 R_{i} = a) \cdot P (i = 1 \sum n - 1 R_{i} = a)

= a = 1 \sum # G - 1 P (i = 1 \sum n R_{i} = 0 ∣ i = 1 \sum n - 1 R_{i} = a) \cdot P (i = 1 \sum n - 1 R_{i} = a)

= P (R_{n} = - a) \cdot a = 1 \sum # G - 1 P (i = 1 \sum n - 1 R_{i} = a)

= \frac{1}{# G - 1} \cdot a = 1 \sum # G - 1 P (i = 1 \sum n - 1 R_{i} = a) = \frac{1}{# G - 1} \cdot (1 - P_{n - 1}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ldsec/drynx
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\DeclareCaptionType

copyrightbox

Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets

David Froelicher, Juan R. Troncoso-Pastoriza, Joao Sa Sousa, and Jean-Pierre Hubaux This work was partially supported by the grant #2017-201 of the Strategic Focal Area “Personalized Health and Related Technologies (PHRT)” of the ETH Domain.D. Froelicher is with the Laboratory for Data Security and DeDiS Laboratory, Ecole Polytechnique Federale de Lausanne, 1015 Lausanne, Switzerland, e-mail: [email protected]. Joao Sa Sousa, Juan R. Troncoso-Pastoriza and Jean-Pierre Hubaux are with the Laboratory for Data Security, Ecole Polytechnique Federale de Lausanne, 1015 Lausanne, Switzerland, e-mail: [email protected].

Abstract

Data sharing has become of primary importance in many domains such as big-data analytics, economics and medical research, but remains difficult to achieve when the data are sensitive. In fact, sharing personal information requires individuals’ unconditional consent or is often simply forbidden for privacy and security reasons. In this paper, we propose Drynx, a decentralized system for privacy-conscious statistical analysis on distributed datasets. Drynx relies on a set of computing nodes to enable the computation of statistics such as standard deviation or extrema, and the training and evaluation of machine-learning models on sensitive and distributed data. To ensure data confidentiality and the privacy of the data providers, Drynx combines interactive protocols, homomorphic encryption, zero-knowledge proofs of correctness, and differential privacy. It enables an efficient and decentralized verification of the input data and of all the system’s computations thus provides auditability in a strong adversarial model in which no entity has to be individually trusted. Drynx is highly modular, dynamic and parallelizable. Our evaluation shows that it enables the training of a logistic regression model on a dataset (12 features and 600,000 records) distributed among 12 data providers in less than 2 seconds. The computations are distributed among 6 computing nodes, and Drynx enables the verification of the query execution’s correctness in less than 22 seconds.

Index Terms:

decentralized system, distributed datasets, privacy, statistics, machine learning, homomorphic encryption, zero-knowledge proofs, differential privacy.

I Introduction

To produce meaningful results, statistical and machine-learning analyses often demand large amounts of data. Although data storage and computation costs have dropped over the years, notably due to low-cost and powerful cloud-computing solutions, the sharing of these data is still cumbersome. Massive amounts of data are generated daily to track individuals’ actions, health, shopping habits, interests, political and religious views [1], but privacy concerns and ethical/legal constraints often prohibit or discourage the sharing of personal and sensitive data. In Europe, the new data-protection regulation, General Data Protection Regulation (GDPR) [2], effective since May 2018, requires that (a) the collection and use of personal data can only be done with the consent of the subject and (b) that the data have to be anonymized or encrypted before being shared. This leads to a conundrum, especially in domains such as demography, finance and health, where data have to be shared, e.g., for enabling research, but they also need to be protected to ensure individuals’ fundamental right to privacy. Cross-border data sharing is even more challenging, as the legislations among countries can be heterogeneous, forcing companies to geographically adapt their own privacy measures.

Multiple examples show that even when data can be shared, a centralization of the data can have serious consequences, affecting hundreds of millions of individuals [3, 4]; this was the case with the Equifax breach [4], in which personal information (including social-security numbers and credit-card information) of more than 143 million consumers (about 40% of the US population) was compromised. Centralized solutions are subject to multiple threats as the central database, which stores data from multiple mutually-untrusted sources, constitutes a high-value target for possible attackers and a single point of failure.

Existing solutions for secure databases [5, 6, 7, 8, 9] usually add a cryptographic layer on top of the query engine or focus exclusively on the data-release privacy, e.g., by using differential privacy. However, most of these solutions have a significant performance overhead or are still fully centralized hence either have a single point of failure, or do not protect the data during the query execution.

In this context, decentralized data-sharing systems [10, 11, 12, 13, 14, 15] have raised considerable interest and are key enablers for privacy-conscious big-data analysis. By distributing the storage and the computation, thus avoiding single points of failure, these systems enable data sharing and minimize the risks incurred by centralized solutions. Nevertheless, many of these systems rely on honest-but-curious or trusted third-party assumptions that might not provide sufficient guarantees when the data to be shared are highly sensitive, valuable, influential or private. Other solutions with stronger threat models, e.g., UnLynx [16], are limited in the computations they support, e.g., sum only. Moreover, none of these solutions considers the possibility that both computing entities and data providers can be malicious.

Improving upon and using some techniques introduced in UnLynx, we propose Drynx, an operational, decentralized and secure system that enables queriers to compute statistical functions and to train and evaluate machine-learning models on data hosted at different sources, i.e., on distributed datasets. Drynx ensures data confidentiality, data providers’ ( $DPs$ ) privacy and protects individuals’ data from potential inferences stemming from the release of end results, i.e., it ensures differential privacy. It also provides computation correctness. Finally, it ensures that strong outliers, either maliciously or erroneously input by $DPs$ , cannot influence the results beyond a certain limit, and we denote this by results robustness. These guarantees are ensured in a strong adversarial model where no entity has to be individually trusted and a fraction of the system’s entities can be malicious. Drynx relies on interactive protocols, homomorphic encryption, zero-knowledge proofs of correctness and distributed differential privacy. It is scalable, dynamic and modular: Any entity can leave or join the system at any time and Drynx offers security features or properties that can be enforced depending on the application, e.g., differential privacy.

In this paper, we make the following contributions:

•

We propose Drynx, an efficient, modular and parallel system that enables privacy-preserving statistical queries and the training and evaluation of machine-learning regression models on distributed datasets.

•

We present a system that provides data confidentiality and individuals’ privacy, even in the presence of a strong adversary. It ensures the correctness of the computations, protects data providers’ privacy and guarantees robustness of query results.

•

We propose techniques that enable full and lightweight auditability of query execution. Drynx relies on a new efficient distributed solution for storing and verifying proofs of query validity, computation correctness, and input data ranges. We exemplify and evalutate the implementation of this solution by using a blockchain.

•

We propose and implement an efficient, modular and multi-functionality query-execution pipeline by

–

introducing Collective Tree Obfuscation, a new distributed protocol that enables a collective and verifiable obfuscation of encrypted data;

–

presenting multiple data-encoding techniques that enable distributed computations of advanced statistics on homomorphically encrypted data. We propose new encodings, and improvements and adaptations of previously introduced private-aggregation encodings to our framework and security model;

–

adapting an existing zero-knowledge scheme for input-range validation to our security model;

–

proposing a new construction of the Key Switching protocol introduced in UnLynx [16], improving both its performance and capabilities.

To the best of our knowledge, Drynx is the only operational system that provides the aforementioned security and privacy guarantees. Drynx implementation is fully available at www.github.com/ldsec/drynx.

II Related Work

Centralized systems for privacy-preserving data sharing [8, 17, 18, 19] and trusted-hardware based solutions [20] usually require one entity, i.e., a central entity or a hardware provider, to be trusted, which constitutes a single point of failure. Even though these systems can be more efficient than their decentralized counterparts, they often require a centralization or outsourcing of the data storage, which goes against regulations or is cumbersome to achieve [21] and can be inappropriate for sensitive data. In Drynx, we avoid these issues by decentralizing data-storage, computation and correctness verification, thus efficiently distributing trust.

In order to execute queries and compute statistics on distributed datasets, multiple decentralized solutions [10, 12, 14, 22, 23, 24, 25] rely on techniques that have a high expressive power, such as secret sharing and garbled circuits. These solutions are often flexible in the computations they offer but usually assume (a) honest-but-curious computing parties and (b) no collusion or a 2-party model. Furthermore, they do not provide a way to check the computations undertaken in the system. Although they might efficiently distribute trust, their strong honesty assumptions are risky when the data or the computed statistics are highly sensitive. Bater et al. [10] enable the evaluation of various SQL queries on datasets hosted by a set of distrustful data providers, but both the data providers and the computing entity are trusted to follow the protocol. Corrigan-Gibbs and Boneh [26] propose Prio, a system that ensures privacy as long as one computing entity out of $n$ is honest, but it only guarantees end results robustness in the case where the involved parties are all honest-but-curious. Moreover, Prio does not protect against DPs colluding among themselves or with the computing nodes. In Drynx, no entity has to be individually trusted in order to provide both privacy and robustness.

Systems relying on homomorphic encryption [11, 13, 16, 27, 28, 29] are often limited in the functionalities they offer (e.g., sum only). They present high-performance overhead in comparison with their less secure counterparts or still rely on honest-but-curious parties. In our previous work, we presented UnLynx [16], a decentralized system that enables the computation of (only) sums on distributed datasets and ensures $DPs$ ’ privacy and data confidentiality. UnLynx assumes $DPs$ to be honest-but-curious and, unlike Drynx, it does not ensure end results robustness. Moreover, UnLynx does not provide a practical solution for auditability. In this work, we show how to overcome these limitations and provide a system that enables secure computations of multiple operations in a stronger threat model.

There are multiple solutions proposed for the problem of training machine-learning models on distributed data in a privacy-preserving way [13, 27, 30, 31, 32, 33, 34, 35, 36]. Mohassel and Zhang [30] propose a two-party solution, SecureML; it enables the training of specific models, e.g., linear regression. Boura et al. [31] present a solution that relies on a novel and more flexible approximation of the logistic regression function but assumes honest-but-curious parties. Nikolaenko et al. [27] and Juvekar et al. [32] combine homomorphic encryption and garbled circuits to perform private ridge-regression and neural-network inference, respectively. Aono et al. [33] and Kim et al. [13] rely on homomorphic encryption to train an approximated logistic regression function. Zheng et al. [36] combine homomorphic encryption and distributed convex optimization, in their system called Helen, in order to collaboratively train linear models. Recently, multiple solutions based on federated learning (relying on differential privacy and edge computing) have been proposed [24, 37, 38, 39, 40, 41, 42]. These solutions aim at protecting the resulting model from inference attacks [43, 44]. Some of these works [37, 39] assume a trusted party that holds the data, trains the machine-learning model, and performs the noise addition to achieve differential privacy guarantees. Other works [24, 29, 38, 45, 46] propose solutions for distributed settings in which the parties exchange differentially private model parameters with the help of an untrusted server that trains a collective global model. These approaches are computationally efficient but usually require very high privacy budgets to obtain a useful collective model (due to the noise addition); hence it is unclear what privacy protection they achieve in practice [47]. To this end, some works attempt to obtain more useful models in the distributed setting by combining differential privacy with homomorphic encryption [40, 41] or multi-party computation techniques [42]. However, most of these solutions are specifically tailored, parameterized and optimized for a given operation, e.g. gradient descent, and would require a redesign if used for different operations. Finally, they assume a weaker threat model with honest-but-curious computing parties and, unlike Drynx, they do not enable verification of computation correctness and results robustness.

III Background

We introduce Drynx’s main components and two exemplifying use cases. We describe the cryptographic tools that we use to distribute trust and workload. We present the blockchains that we use to implement our solution to ensure Drynx’s correctness and auditability. Finally, we introduce the notion of differential privacy and verifiable shuffle, which are at the core of our solution to ensure individuals’ privacy.

III-A Use Cases

We illustrate Drynx’s utility in the medical sector, as it is a paradigmatic example where privacy is paramount and data sharing is needed. Recently, multiple initiatives have emerged to realize the promise of personalized medicine and to address the challenges posed by the increasing digitalization of medical data [48, 49, 50]. In this context, the ability to share highly sensitive medical data while protecting patients’ privacy is becoming of primary importance. We illustrate the possible use of Drynx in two specific settings that cover most medical data sharing scenarios: (1) Hospital Data Sharing ( $HDS$ ), where multiple hospitals enable statistical computations and the training of machine-learning models across their datasets of patients (e.g., [50, 51]), and (2) Personal Data Sharing ( $PDS$ ), where a medical institute runs studies, e.g., on heart issues, by directly computing on data collected from people’s wearables (e.g., [52, 53]).

III-B ElGamal Homomorphic Encryption

Drynx requires an additively homomorphic cryptosystem; we choose to rely on the Elliptic Curve ElGamal (ECEG) [54], which enables an efficient use of zero-knowledge proofs for correctness [55]. However, Drynx’s functionality is not bound to this choice and can be achieved with other cryptosystems. ECEG relies on the difficulty of computing a discrete logarithm in a finite field; in this case, an Elliptic Curve subgroup of $\mathbb{Z}_{p}$ , with $p$ a big prime. The encryption of a message $m\in\mathbb{Z}_{p}$ is EΩ( $m$ ) = $(rB,\ mB+r\Omega)$ , where $r$ is a uniformly-random nonce in $\mathbb{Z}_{p}$ , $B$ is a base point on an elliptic curve $\mathcal{G}$ and $\Omega$ a public key. The table of symbols is presented in Appendix A. The additive homomorphic property states that EΩ( $\alpha m_{1}+\beta m_{2}$ ) = $\alpha$ EΩ( $m_{1}$ ) + $\beta$ EΩ( $m_{2}$ ) for any messages $m_{1}$ and $m_{2}$ and for any scalars $\alpha$ and $\beta$ . In order to decrypt a ciphertext $(rB,\ mB+r\Omega)$ , the holder of the corresponding private key $\omega$ ( $\Omega=\omega B$ ) multiplies $rB$ and $\omega$ yielding $\omega(rB)=r\Omega$ and subtracts this point from $mB+r\Omega$ . The result $mB$ is then mapped back to $m$ , e.g., by using a hashtable. Drynx relies on fixed-point representation to encrypt floating values.

III-C Zero-Knowledge Proofs

Universally-verifiable zero-knowledge proofs (ZKPs) can be used to ensure computation integrity and to prove that encrypted data are within given ranges. In Drynx, we choose to verify computation integrity by using the proofs for general statements about discrete logarithms, introduced by Camenisch and Stadler [55]. These proofs enable a verifier to check that the prover knows the discrete logarithms $y_{1}$ and $y_{2}$ of the public values $Y_{1}=y_{1}B$ and $Y_{2}=y_{2}B$ and that they satisfy a linear equation

[TABLE]

where $A$ , $A_{1}$ , $A_{2}$ are public points on $\mathcal{G}$ . This is done without revealing any information about $y_{1}$ or $y_{2}$ .

The input-range validation is done by relying on the proofs proposed by Camenisch and Chaabouni [56], with which we can prove that a secret message $m$ lies in a given range $[0,u^{l})$ with $u$ and $l$ integers, without disclosing $m$ . The prover writes the base-u decomposition of its secret value $m$ and commits to the u-ary digits by using the verifier signatures on these digits. The $l$ created commitments prove to the verifier that $m\in[0,u^{l})$ . We present this proof, adapted to our framework, in Algorithm 1. Finally, both proofs can be made non-interactive through the Fiat-Shamir heuristic [57].

III-D Interactive Protocols

Interactive protocols can be used to distribute the computations and the trust among multiple computing nodes $CNs$ . In Drynx, each $CN_{i}$ possesses a private-public key pair ( $k_{i}$ , $K_{i}$ ) where $k_{i}$ is a uniformly-random scalar in $\mathbb{Z}_{p}$ and $K_{i}=k_{i}B$ is a point on $\mathcal{G}$ . The $CNs$ ’ collective public key is $K=\textstyle\sum_{i=1}^{\#CN}K_{i}$ . The corresponding secret key $k=\textstyle\sum_{i=1}^{\scriptstyle\#CN}k_{i}$ is never reconstructed such that a message encrypted by using $K$ can be decrypted only with the participation of all $CNs$ . An attacker would have to compromise all $CNs$ in order to decrypt a message. As shown in Section V, to produce the intended results, Drynx protocols require the participation of all the $CNs$ .

III-E Blockchains

A blockchain is usually a public, append-only ledger that is distributively maintained by a set of nodes and serves as an immutable ledger [58, 59]. Its main applications are in cryptocurrencies [59, 60] but is also used in other domains, e.g., health care [61]. Data are bundled into blocks that are validated through the consensus [62, 63] of the maintaining nodes. Each block contains a pointer (i.e., a cryptographic hash) to the previous valid block, a timestamp, a nonce, and application-specific data. The chain of these blocks forms the blockchain.

III-F Differential Privacy

Differential privacy is an approach for privacy-preserving reporting results on statistical datasets, introduced by Dwork [64]. This approach guarantees that a given randomized statistic, $\mathcal{M}(DS)=R$ , computed on a dataset $DS$ , behaves similarly when computed on a neighbor dataset $DS^{\prime}$ that differs from $DS$ in exactly one element. More formally, ( $\epsilon$ , $\delta$ )-differential privacy [65] is defined by $\Pr\left[\mathcal{M}(DS)=R\right]\leq\exp(\epsilon)\cdot\Pr\left[\mathcal{M}(DS^{\prime})=R\right]+\delta$ , where $\epsilon$ and $\delta$ are privacy parameters: the closer to 0 they are, the higher the privacy level is. ( $\epsilon$ , $\delta$ )-differential privacy is often achieved by adding noise to the output of a function $f(DS)$ . This noise can be drawn from the Laplace distribution with mean 0 and scale $\frac{\Delta f}{\epsilon}$ , where $\Delta f$ , the sensitivity of the original real valued function $f$ , is defined by $\Delta f=\max_{D,D^{\prime}}{||f(DS)-f(DS^{\prime})||_{1}}$ . Other mechanisms, e.g., relying on a Gaussian distribution, have also been proposed [66, 67].

III-G Verifiable Shuffles

To randomly select a value from a public list of noise values and to ensure differential privacy, we rely on a verifiable shuffle [68, 69, 70, 71]. We implemented and use the verifiable shuffle of ElGamal pairs, described by Neff [69]. This protocol takes as input a list of $\chi$ ElGamal pairs $(C_{1,i},\ C_{2,i})$ and outputs $(\bar{C}_{1,i},\ \bar{C}_{2,i})$ pairs such that for all $1\leq i\leq\chi$ , $(\bar{C}_{1,i},\bar{C}_{2,i})=(C_{1,\upsilon(i)}+r^{\prime\prime}_{\upsilon(i)}B,C_{2,\upsilon(i)}+r^{\prime\prime}_{\upsilon(i)}\Omega)$ , where $r^{\prime\prime}_{\upsilon(i)}$ is a re-randomization factor, $\upsilon$ is a permutation and $\Omega$ is a public key. The permutation $\upsilon$ is used to change the order of the ElGamal pairs and $r^{\prime\prime}_{\upsilon(i)}$ is used to modify the value of the ciphertext encrypting a message $m$ such that its decryption still outputs $m$ . As a result, an adversary not knowing the decryption key $\upsilon$ and the $r^{\prime\prime}_{\upsilon(i)}$ is unable to link back any ciphertext $(\bar{C}_{1,i},\bar{C}_{2,i})$ with a ciphertext $(C_{1,i},\ C_{2,i})$ . Neff provides a method for proving that such a shuffle is done correctly, i.e., that there exists a permutation $\upsilon$ and re-randomization factors $r^{\prime\prime}_{i,j}$ such that $output$ = SHUFFLE ${}_{\upsilon,r^{\prime\prime}_{i,j}}$$(input)$ , without revealing anything about $\upsilon$ or $r^{\prime\prime}_{i,j}$ . This is achieved by using honest-verifier zero-knowledge proofs, introduced by Neff [68, 69].

IV System Overview

In this section, we describe the system and threat models, before presenting Drynx’s functionality and security requirements.

IV-A System Model

The system model is represented in Figure 1. For simplicity, we describe here the logical roles in Drynx, and in Section VIII we discuss the fact that a physical node can simultaneously play multiple roles.

A querier $Q$ can execute a statistical query and the training and evaluation of a machine-learning model on distributed datasets held by $DPs$ . The $CNs$ collectively handle the computations in the system; i.e., from $Q$ ’s perspective, they emulate a central server and provide answers to her queries. The verifying nodes’ ( $VNs$ ) role is to provide auditability; they collectively verify the query execution and immutably store the corresponding proofs. They enable an auditor, e.g., $Q$ or an external entity, to easily verify (audit) the correctness of the query execution.

In Drynx’s typical workflow, the query is defined by the querier $Q$ and is then broadcast to the $CNs$ and $DPs$ . The $DPs$ answer with their encrypted responses that are then collectively aggregated and processed by the $CNs$ , before the result is sent to $Q$ . We assume that the used data formats are sufficiently homogeneous among different $DPs$ and that the $DPs$ are able to interpret the queries, e.g., there is a common ontology of attributes and the query-language is agreed-on during system setup.

An exemplifying instantiation of this system model in the $HDS$ scenario (Section III-A) would feature the $CNs$ as universities that want to enable researchers ( $Q$ ) to compute on data held by multiple hospitals ( $DPs$ ). $VNs$ can be independent or governmental institutions ensuring that data protection regulations are respected.

We assume that the system’s topology and public information, e.g., public keys, are known by all entities. Authentication and authorization are out of scope of this paper and we briefly discuss them in Section VIII.

IV-B Threat Model

We assume a strong threat model:

•

Queriers. They are considered malicious as they can try to infer information about the $DPs$ from the queries end results or by colluding with other entities in the system.

•

Computing Nodes. We consider an Anytrust model [72], which means that all Drynx’s security and privacy guarantees (Section IV-D) are ensured, as long as at least one of the $CNs$ is honest-but-curious (or plain honest).

•

Data Providers. The $DPs$ are considered malicious as they can try to produce an incorrect answer to a query in order to bias the final results. They can also collude with other nodes to infer information about other $DPs$ or about a query end-results.

•

Verifying Nodes. We assume that a threshold number of the $VNs$ is honest. This threshold, e.g., $f_{h}=2f+1$ out of $f_{t}=3f+1$ , where $f_{t}$ is the number of $VNs$ , is defined depending on the consensus algorithm [62, 63] that is used to ensure a correct and immutable storage of the proofs’ verification results.

IV-C Functional Requirements

Drynx enables the computation on distributed datasets of any operation in the family of encodable operations. An encodable operation can be separated in two parts: the $DPs$ ’ local computations and the collective aggregation. In the collective part, the computations are executed on encrypted data and are thus limited by the homomorphism in the used cryptographic scheme, e.g., additions and/or multiplications. $DPs$ ’ computations are executed locally and are therefore not limited.

Definition 1.

An encodable operation $f$ computed among $N$ $DPs$ is defined by:

[TABLE]

in which the encoding $\rho$ is defined by

[TABLE]

where $\mathbf{V_{i}}=[v_{i,1},...,v_{i,d}]$ is a vector of $d$ values computed on a set of $c_{i}=|\bar{r_{i}}|$ records, where $|.|$ stands for cardinality. $\bar{r}$ is the set of all distributed datasets’ records, $\bar{r_{i}}$ is the set of records that belong to $DP_{i}$ , and $\pi$ is a polynomial combination of the outputs of the encodings $\rho$ . The encodings are defined as locally computed functions on the subsets ( $\bar{r_{i}}$ ) of each $DP_{i}$ . It is also possible to express an encodable operation as a recursive function:

[TABLE]

In Drynx, for any specific operation $f$ , each $DP_{i}$ creates an encoding $\rho$ computed on its set of records $\bar{r_{i}}$ . Then, $\pi$ is executed in two parts: the $CNs$ first aggregate all $DPs$ ’ encodings outputs ( $\sum_{i=1}^{N}\{\rho(\bar{r_{i}})\}$ ) and, if needed, the querier post-processes $\pi$ on the aggregated result (e.g, if $\pi$ involves information-preserving operations not executable by the CNs under homomorphic encryption).

We give here an instantiation of Definition 1 that enables the computation of the average, and in Section VII we show how an encoding can be instantiated to enable the computation of: sum, count, frequency count, average, variance, standard deviation, cosine similarity, min/max, AND/OR and set intersection/union, and the training and evaluation of linear and logistic regression models.

For example, if $Q$ wants to compute the average ( $f$ ) heart rate over multiple patients across hospitals ( $HDS$ (Section III-A)), each hospital ( $DP_{i}$ ) answers with the encoding of its (encrypted) local sum of each patient’s heart rate ( $h$ ): $\rho(\bar{r_{i}})\equiv([\sum_{j=1}^{c_{i}}h_{i,j}],c_{i})$ . These encodings are then (homomorphically) added across all hospitals, and $Q$ can (decrypt and) compute the global average by using $\pi=\sum_{i=1}^{N}v_{i,1}/\sum_{i=1}^{N}c_{i}$ . We remark here that whereas $\rho$ and $\pi$ are application dependent, the workflow is common to all the possible operations.

Finally, in Drynx, an auditor can efficiently audit a query execution. Moreover, the proofs required for auditability are produced such that their creation does not affect the query runtime.

IV-D Security Requirements

Drynx must ensure:

•

Data confidentiality. The data input by the $DPs$ have to remain confidential at any time. Only $Q$ is able to see the query answer.

•

$DPs$ ’ privacy. No entity is able to infer information about one single $DP$ or about any individual storing his data in a $DP$ ’s database.

•

Query Execution Correctness. We consider the query execution to be correct when both results robustness and computation correctness requirements are met:

–

Results robustness. The query results are protected against strong outliers, either maliciously or erroneously input by the $DPs$ .

–

Computation correctness. Any computation undertaken by the $CNs$ is correctly executed.

V Drynx Design

To overcome the limitations in existing works and meet the requirements presented in the previous section, we propose a novel system model in which we enable query auditability by introducing $VNs$ . Additionally, Drynx provides multiple functionalities in a stronger threat model by relying on $DPs$ that encode locally computed results proven to be within a certain range. It limits the trust in $DPs$ by controlling that their results are in these pre-defined ranges. We propose a system that remains generic and practical while operating in a threat model stronger than existing works. We discuss now the design of this system.

In Drynx’s Security Design (Section V-A), we show how we build Drynx to meet all its security requirements:

•

In Section V-A1, we introduce a simple query-execution pipeline enabling Drynx’s functionalities and protecting data confidentiality.

•

In Section V-A2, we build upon the previously introduced query-execution pipeline and explain how to ensure $DP$ s’ privacy by introducing the new concept of a neutral encoding. This enables a $DP$ to privately choose whether to answer a query. We also explain how Drynx handles bit-wise operations and maintains $DPs$ ’ privacy. Finally, we introduce distributed differential privacy that is used to ensure that no entity infers information about a single $DP$ or individual from the query end results.

•

In Section V-A3, we show how we provide auditability in an efficient way by relying on a set of $VNs$ . We describe how Drynx ensures results robustness by leveraging on range proofs and how all Drynx’s computations can be verified by relying on proofs of correctness.

In Drynx’s Optimized Design (Section V-B), we discuss how to optimize Drynx’s performance:

•

In Section V-B1, we present Drynx’s full query-execution pipeline. We show how multiple parts of the query execution and verification can be run concurrently thus optimize Drynx’s runtime.

•

In Section V-B2, we introduce a tradeoff between security and performance by enabling a probabilistic verification of the query execution.

V-A Drynx Security Design

We present Drynx core security architecture.

V-A1 Data Confidentiality

First, we introduce a confidential distributed data-sharing system (Figure 2) that can run the same operations as Drynx, but only meets one of the security requirements: data confidentiality.

We describe the query execution protocol, and sketch the proof of confidentiality for this system. Afterwards, we describe how to enhance this construction to meet Drynx’s other security requirements without breaking data confidentiality.

Initialization. Each $CN_{i},DP_{i}$ and $Q_{i}$ generates its own private-public key-pair $(k_{i},K_{i})$ . The $CNs$ ’ public keys are then summed up in order to create $K$ , the $CNs$ ’ public collective key that is used to encrypt all the processed data. 2. 2.

Query. $Q$ formulates the query that is broadcast in clear through the $CNs$ to the $DPs$ . Although the querier could directly communicate with the $DPs$ , our choice simplifies the communication scheme and the synchronisation inside the system, as the $CNs$ have to know the query and receive the $DPs$ inputs to perform the computations in the remaining steps. The query defines the operation, the attributes on which the operation is computed, the participating $DPs$ and (optionally) the filtering conditions. Drynx works independently of the query language. We illustrate its use with a SQL-like query to compute the average heart rate among patients for which data are held by $n$ $DPs$ :

SELECT average $heart\_rate$ ON $DP_{1},...,DP_{n}$ WHERE $patient\_state=^{\prime}hypertensive^{\prime}$ 3. 3.

Retrieval & Encoding. The $DPs$ compute their local answer by following $\rho$ which is defined in the operation encoding (Definition 1). For this purpose, they first locally retrieve the corresponding data. 4. 4.

Encryption. The $DPs$ encrypt their encoded answer under $K$ and send the corresponding ciphertexts back to the $CNs$ . 5. 5.

Collective Tree Aggregation ( $CTA$ ). The $CNs$ collectively aggregate all $DPs$ ’ responses by executing a $CTA$ protocol relying on the Collective Aggregation protocol defined in UnLynx [16]. The $CNs$ are organized into a tree structure such that each $CN$ waits to receive the aggregation results from its children and sums them up before passing the result on to its own parent. 6. 6.

Collective Tree Key Switching ( $CTKS$ ). The $CNs$ collectively convert the aggregated result, encrypted under $K$ , to the same result encrypted under $Q$ ’s public key $K^{\prime}$ , without ever decrypting. This protocol (Protocol 6) is a new construction of the Key Switching proposed in UnLynx [16]. Conceptually, each $CN$ partially decrypts $m$ (i.e., the term $-(C_{1})k_{i}$ in the computation in step 2) and re-encrypts it with $Q$ ’s public key $K^{\prime}$ (i.e., the term $+\alpha_{i}K^{\prime}$ in step 2).

[TABLE]

We improve the efficiency of $CTKS$ by changing the way the ciphertexts are transformed and by organizing the $CNs$ in a tree structure, thus reducing its execution time. In this structure, multiple $CNs$ can perform their local operations (3 scalar multiplications and 1 addition) in parallel, and the $CTA$ requires $\#CN-1$ aggregations and communications between the nodes. We show the computational complexity of all Drynx protocols in Table II. 7. 7.

Decryption. $Q$ decrypts and decodes the query results.

Security Arguments. We show that, as long as one $CN$ is honest, an adversary who controls the remaining $CNs$ , $DPs$ and $Q$ cannot break data confidentiality. Without loss of generality, we assume that at least one $DP$ is honest, as only in this case there is data to protect from the adversary. We sketch the proof by relying on the real/ideal simulation paradigm [73] and show that an adversary cannot distinguish a “real” world experiment, in which the adversary is given “real” data (sent by honest $DPs$ ), and an “ideal” world experiment, in which the adversary is given data (e.g., random) generated by a simulator. It can be shown that the $DPs$ send encrypted data that are never decrypted before being aggregated and re-encrypted ( $CTKS$ ) under $Q$ ’s public key. Therefore, due to the cryptosystem’s semantic security, the adversary cannot distinguish between a simulation and a real experiment. It can be seen that data confidentiality is thus ensured during end-to-end query execution:

In Retrieval & Encoding, the $DPs$ operate only on their local data and no external data is seen by any malicious party. In Encryption, the $DPs$ encrypt their responses with $K$ and these responses are aggregated, still under encryption, in $CTA$ . The (summed) ciphertexts cannot be decrypted unless all $CNs$ collude, which is not possible as they follow an Anytrust model. Finally, in $CTKS$ (Protocol 6), a ciphertext is switched from $K$ to $Q$ ’s public key such that $Q$ can decrypt:

•

in $CTKS$ * Steps: 1-3. * The ciphertext is encrypted under $K$ and thus cannot be decrypted without the collusion of all $CNs$ .

•

in $CTKS$ * Step: 4. * The ciphertext is always $\textstyle(\tilde{C}_{1},\tilde{C}_{2})=(\tilde{r}B,mB+\tilde{r}K^{\prime})$ where $\tilde{r}=\textstyle\sum_{i=0}^{t}\alpha_{i}$ and $0\leq t\leq\#CN$ and can only be decrypted if the $t$ $CNs$ collude with $Q$ , who is the intended recipient of the message.

V-A2 DPs’ Privacy

Drynx protects $DP$ s’ and individuals’ privacy by ensuring that (a) each $DP$ can privately decide whether to answer a query, (b) only the result of the operation, as defined by the operation encoding, is disclosed to $Q$ , and (c) no entity can infer information about a single $DP$ or individual.

Neutral Response

If a $DP$ determines that a query can jeopardize its privacy, it can choose to not respond, or answer with a neutral response, thus hiding its refusal to participate in the query without distorting the query results. For this purpose we define neutral response:

Definition 2.

A $DP_{i}$ sends a neutral response by defining its response encoding (Definition 1) by $\rho(\bar{r_{i}})\equiv(\mathbf{O},0)$ , where $\mathbf{O}$ is the neutral vector such that $\mathbf{W}+\mathbf{O}=\mathbf{W}$ with $\mathbf{W}$ being any encoding vector; $c_{i}=0$ as $DP_{i}$ computes on 0 records.

In Section VII, we describe how a neutral response can be generated for each listed encoding.

Security Arguments. A $DP$ not answering a query would suggest (leak) to other entities that this query is too sensitive for it. $DPs$ ’ responses are always encrypted and, due to the indistinguishability property of the underlying cryptosystem, a neutral response is indistinguishable from a non-neutral one, thus effectively hiding the $DP$ ’s refusal.

Privacy-Preserving Bit-wise Operations

In Drynx, $DPs$ ’ responses are summed through the available additive homomorphism; if these responses are binary, the result of the sum can leak to $Q$ more than the operation result. For example, when an OR operation is executed over a set of $DPs$ , $Q$ should only know if the answer is $true$ (1) or $false$ (0). Nevertheless, if the $DPs$ ’ responses are naively summed, $Q$ gets the number of $DPs$ that answered ‘1’ and ‘0’. To overcome this issue, we propose the Collective Tree Obfuscation ( $CTO$ ) protocol, detailed in Protocol V-A2. For bit-wise operations, $CTO$ is run between steps $CTA$ and $CTKS$ of the query execution. In $CTO$ , the $CNs$ collectively obfuscate a ciphertext by multiplying it with a random secret.

$CTO$ enables privacy-preserving bit-wise operations in Drynx as a ‘1’ is obfuscated to a random value whereas ‘0’ is preserved. To know the result of the operation, $Q$ only checks if the final value is ‘0’ or not.

[TABLE]

Security Arguments. Protocol V-A2 does not hinder the confidentiality of $m$ and indeed obliviously and statistically obfuscates $m$ . The confidentiality relies on the cryptosystem’s semantic security, as $m$ remains encrypted during the whole protocol execution. A multiplicative blinding of $m$ in $\mathbb{Z}_{p}$ is defined by $s\cdot m$ , where $s$ is a secret scalar value in $\mathbb{Z}_{p}$ . The output of the $CTO$ protocol is the encryption of $(\scriptsize\sum s_{i})\cdot m$ . We can rewrite $(\scriptsize\sum s_{i})\cdot m$ by separating the contributions of the honest $CNs$ $h$ (at least one $CN$ due to our Anytrust model assumption) and malicious $CNs$ $e$ : $(\scriptsize\sum_{i\in h}s_{i}+\scriptsize\sum_{i\in e}s_{i})\cdot m=(\scriptsize\sum_{i\in h}s_{i})\cdot m+(\scriptsize\sum_{i\in e}s_{i})\cdot m$ . Even if an adversary knows $(\scriptsize\sum_{i\in e}s_{i})\cdot m$ , the other term $(\scriptsize\sum_{i\in h}s_{i})\cdot m$ ensures a multiplicative blinding of $m$ in $\mathbb{Z}_{p}$ .

Distributed Differential Privacy

Drynx relies on the Collective Differential Privacy ( $CDP$ ) protocol, introduced in Unlynx [16], to ensure differential privacy, and prevent information inference about some $DPs$ and/or individuals from the query results. For completeness, we briefly present the $CDP$ (Protocol V-A2) and refer to [16] for more details. The choice of parameters depends on the application’s privacy policy and is out of the scope of this paper.

[TABLE]

Security Arguments. We observe that the list of noise values is verifiably generated from the differential privacy parameters and that all the $CNs$ privately shuffle the values. This protocol’s security is analyzed in details in UnLynx [16].

V-A3 Query Execution Correctness

We first describe how Drynx provides auditability by enabling an efficient verification of the query execution correctness. The latter is achieved by guaranteeing results robustness and computation correctness. The first is ensured by limiting the $DPs$ ’ values to be in a specific range (by means of range proofs) and the second by using ZKPs for all the $CNs$ computations.

Auditability

To provide an efficient solution for the query verification, Drynx relies on a set of $VNs$ that verify the query correctness in parallel to its execution and without affecting its runtime. After each operation, $Q$ , the $CNs$ and $DPs$ create proofs of correct computations or value range that they sign with their private key (to provide authentication). Their signed proofs are sent to all the $VNs$ . This enables an efficient query execution as the proof creation and verification are executed independently from it.

In order to implement this solution, we can rely on the distributed architecture of the $VNs$ and can provide integrity and immutability by using a blockchain, i.e., the proof blockchain. This enables the public and immutable storage of both the query and its verification results. Moreover, it enables an efficient and lightweight verification of the query correctness. An auditor, e.g., $Q$ , has only to request the block corresponding to the query, to verify the $VNs$ signatures and to check the query verification results. We detail this in Protocol V-A3 and show an example of the proof blockchain in Figure 3.

[TABLE]

Security Arguments. If an entity trusts a threshold $f_{h}$ of the $VNs$ , it can verify the query correct execution by checking the corresponding block in the proof blockchain. The verifier can check that $f_{h}$ nodes agree on the correctness of the proofs. A block is created for every query, even if the proofs are wrong, thus enabling any entity to determine which parties were involved in incorrectly computed queries. Otherwise, as all the proofs are universally verifiable and stored by all $VNs$ , an auditor, not trusting $f_{h}$ of the $VNs$ , can request the proofs from a subset of them and check the proofs by itself.

Results Robustness

If the querier defines a query with range boundaries on the $DPs$ ’ values, the $DPs$ are requested to create proofs of range by following the algorithm detailed in Algorithm 1. This algorithm is built by adapting the $[0,u^{l})$ -range proof scheme proposed by Camenisch et al. [56] to the Anytrust model. In this algorithm, the prover, i.e., $DP$ , writes its secret value $m$ in base- $u$ and commits to the $u$ -ary digits by using the $CN_{i}$ s’ signatures on these digits ( $A_{i,b}$ in Algorithm 1). The $l$ created commitments complete the proof. To adapt this algorithm to the Anytrust model, the $DP$ must compute multiple proof elements, i.e., $c$ , $V_{i,j}$ , $a_{i,j}$ , by combining all $CNs$ ’ signatures, i.e., $Z_{i}$ , $A_{i,b}$ . This ensures that the $DP$ uses at least one $CN$ ’s signature for which it does not know the underlying secret. The same transformation in [56] can be applied to generalize the proof to any range $[b_{l},b_{u})$ .

Security Arguments. Both the correctness and the zero-knowledge property of the range proof are proven by Camenisch et al. [56].

These proofs are universally verifiable and sound in the Anytrust model. The latter comes from the fact that the elements depending on the $CNs$ ’ secrets $x_{i}$ are computed as a combination of all their public signatures. As at least one $CN_{i}$ is honest-but-curious, one of the $x_{i}$ is unknown (not revealed) to the $DP$ (prover).

Computation Correctness

In order to ensure the correctness of the query execution, each computation executed by a $CN$ has to be proven correct.

•

Collective Tree Aggregation. The $CNs$ provide to-be-aggregated input ciphertexts and the resulting ciphertexts that constitute the ZKP.

•

Collective Tree Obfuscation. The $CNs$ produce an obfuscation proof by relying on Expression (1) in Section III-C. Each $CN_{i}$ multiplies $C$ by $s_{i}$ to obtain the obfuscated ciphertext $(C^{\prime}_{1},C^{\prime}_{2})$ with (a) $C^{\prime}_{1}=s_{i}C_{1}$ and (b) $C^{\prime}_{2}=s_{i}C_{2}$ . For both equations, $y_{1}=s_{i}$ is the discrete logarithm; we have the public values $A=C^{\prime}_{1}$ , $A_{1}=C_{1}$ for (a) and $A=C^{\prime}_{2}$ , $A_{1}=C_{2}$ for (b), which constitute the proof.

•

Collective Differential Privacy. In this protocol, each $CN$ sequentially executes a Neff shuffle and produces the corresponding ZKP of correctness described in Section III-G. This proof basically contains the input and output lists, the public key encrypting the ciphertexts, and commitment values.

•

Collective Tree Key Switching. The $CNs$ create the ZKP by applying Equation (1) in Section III-C, in which we have $y_{1}=k_{i}$ , $y_{2}=\alpha_{i}$ , the discrete logarithms of $k_{i}B=K_{i}$ and $\alpha_{i}B$ , respectively. All points $K_{i}$ , $\alpha_{i}B$ , $A=w_{i,2}$ , $A_{1}=-rB$ and $A_{2}=K^{\prime}$ are made public and do not leak any information about the underlying secrets.

Security Arguments. We rely on proofs that are universally verifiable and zero-knowledge. They do not affect data confidentiality beyond what can be inferred from the proven facts themselves.

V-B Drynx Optimized Design

We present Drynx’s final query execution pipeline, before describing how the query verification’s performance can be optimized.

V-B1 Full Query Execution Pipeline

We show Drynx’s full pipeline in Figure 4. Query execution and verification are executed concurrently and multiple steps of the query execution can be executed in parallel. The $CNs$ aggregate each $DP$ ’s response in $CTA$ , as soon as they receive it. The noise generated from the $CDP$ has to be added after all the results have been aggregated. However, if the differential privacy parameters are predefined, this protocol can be executed independently from the other steps or even pre-computed.

V-B2 Probabilistic Query Verification

To improve the performance of the query verification, we enable a probabilistic verification of the proofs by the $VNs$ . We show that this strategy still enables a verifier to detect a misbehaving entity with a high probability, yet considerably improves performance (see Section IX). A proof for a specific operation (e.g., $CTKS$ for a set of ciphertexts $S$ ) can have multiple sub-proofs (e.g., $CTKS$ for one ciphertext $C\in S$ ). One proof is considered incorrect if one or more of the sub-proofs is incorrect. We introduce the two thresholds $T$ and $T_{sub}$ that define the probability of verifying a single proof and a sub-proof, respectively. We modify the $VNs$ ’ operations in step 2 of the Query Execution described in Protocol V-A3, by adding this probabilistic verification based on $T$ and $T_{sub}$ . Each $VN$ stores all the proof it receives. It then generates a random value $r\in[0,1]$ ; if $r<T$ , it starts the probabilistic verification of the sub-proofs. For each sub-proof, the same method is applied, using $T_{sub}$ .

Security Arguments. The probabilistic verification does not necessarily compromise the security level of the system, given that the verification of each proof is redundantly done by each $VN$ . A proof is verified with a probability $p_{ver}=1-(1-T)^{N_{VN}}$ , where $N_{VN}$ is the number of $VNs$ , and a sub-proof with a probability $p_{ver_{sub}}=1-((1-T)+T(1-T_{sub}))^{N_{VN}}$ . The probability that a proof or a sub-proof is verified by at least $f_{h}$ nodes is

[TABLE]

where $p$ is either $p_{ver}$ (for a proof) or $p_{ver_{sub}}$ (for a sub-proof). For example, if $N_{VN}=7$ , $T=1$ and $T_{sub}=0.3$ , all the proofs are at least partially verified and each sub-proof is verified by $f_{h}=5$ $VNs$ with $P_{f_{h}}=98.48\%$ . Each sub-proof is thus verified by at least $f_{h}$ of the $VNs$ with a high probability. Due to the honesty assumption, a sub-proof is at least verified by one honest $VN$ with a high probability. Moreover, the thresholds $T$ and $T_{sub}$ can be set to arbitrarily reduce the probability that one sub-proof is not verified by at least one honest node. Therefore, if all the $VNs$ that participated in the verification agree on the result, the auditor knows the proof is correct, otherwise it can either choose to only trust some of the $VNs$ or fetch all proofs and verify them itself, as all the proofs are universally verifiable. For example, an auditor can choose to verify only the proofs that were not checked by any of the $VNs$ she trusts.

VI Security Analysis

We employed only existing, peer-reviewed cryptographic schemes and discussed the composability of the security of the different blocks in previous sections. We corroborate these arguments with a brief summary of the security analysis.

•

Data confidentiality. In Section V-A1, we sketched the proof for confidentiality in our simplified system and discussed in Section V-A how further design choices do not hinder confidentiality. In summary, data confidentiality is ensured as the data are always encrypted and no operation, e.g., $ZKP$ creation, affects it.

•

$DPs$ ’ privacy. $DPs$ can privately decide whether to answer a query, and differential privacy is ensured for the $DPs$ and individuals, which protects them from potential inferences stemming from the release of end results. The latter is ensured in Drynx by blindly adding noise, sampled from a specific distribution, to the query end results. As described in Section V-A2, this noise can be verified to be from a specific distribution (e.g., Laplacian) and no entity knows which noise value is added.

•

Results robustness. This is ensured as all $DPs$ ’ values can be verified to be within a certain range and all $CNs$ ’ computations must be proven correct, as depicted in Section V-A3. By enforcing the generation of range proofs by $DPs$ , we protect against strong outliers, maliciously or erroneously input, which can significantly distort the query results. $DPs$ can still input incorrect values, but their influence on the final result is limited. We give an intuition on how robust a computation is against such behavior in Section IX-B.

•

Computation correctness. The proofs of correct computations (Section V-A3) ensure that the $DPs$ ’ answers are correctly aggregated ( $CTA$ ) and that the remaining steps ( $CTO$ , $CTKS$ , $CDP$ ) are correctly executed.

VII Encodings

We present a set of statistical computations that can be executed in Drynx. We then explain how to instantiate encodings (Definition 1) for the training of both linear and logistic regression machine-learning models. We adapt the logistic regression solution, proposed by Aono et al. [33], to our framework, thus enabling $Q$ to train this model in a verifiable and privacy-preserving way, even in the presence of a strong adversary. Some of the encodings are adapted from the Corrigan-Gibbs and Boneh [26] system and improved upon.

Numerical Statistics. Table I lists a set of simple statistics that can be performed with Drynx. The sum, mean, variance, std. deviation, cosine similarity (cosim) and R2 operations are executed by requiring the $DPs$ to send the result of their local and partial statistic computation. As an example, for variance, each $DP_{i}$ locally computes the sum of the values (records) $h_{j}$ that match the query, ( $\scriptstyle\sum_{j=1}^{c_{i}}h_{j}$ ) where $c_{i}$ is $DP_{i}$ ’s dataset cardinality, the square of those same values ( $\scriptstyle\sum_{j=1}^{c_{i}}h_{j}^{2}$ ) and generates $\rho(\bar{r_{i}})=\scriptstyle([\sum_{j=1}^{c_{i}}h_{j},\scriptstyle\sum_{j=1}^{c_{i}}h_{j}^{2}],c_{i})$ . These values are independently aggregated among all $DPs$ and the overall variance is computed by $Q$ , after decryption, using the corresponding $\pi$ (defined in Table I). For the frequency count, $DPs$ are expected to send the vector $\mathbf{V_{i}}$ filled with the number of occurrences ( $fc$ ) for specific values. The cosine similarity is computed between two vectors $\phi$ and $\boldsymbol{\bar{\phi}}$ , where each $DP_{i}$ holds a subset of the coefficients of each vector.

Bit-Wise Statistics. As depicted in Table I, bit-wise operations can be executed in two ways: Each $DP_{i}$ either (1) sends a random encrypted integer $R$ or (2) sends an encrypted bit $b$ . For (1), in the OR (resp. AND) case, each $DP_{i}$ is requested to send an encrypted integer $E_{K}(R_{i})$ , where $R_{i}=0$ if the input is [math] (resp. $1$ ), and a random positive integer otherwise. The OR (resp. AND) expression is $true$ (resp. $false$ ) if the sum $\textstyle\sum R_{i}>0$ . $Q$ obtains the final result by testing if the output is 0 or not. The result of this operation can be erroneous if $\textstyle\sum R_{i}\equiv 0\ mod(\#G)$ , or in other words, if the order $\#G$ of the Elliptic Curve subgroup divides the sum of all $DPs$ ’ random values. This happens only with a probability smaller than ${1}/{(\#G-1)}$ (proof in Appendix B). This probability is close to 0 as $\#G$ is much bigger than the decryptable plaintext values, and can be further reduced by repeating the query. Alternatively, in (2) each $DP_{i}$ has to send $b_{i,j}=0$ or $b_{i,j}=1$ encrypted value. This eliminates the error probability but requires more computations and proofs of correctness, as the $DPs$ have to prove that their values are in $\{0,1\}$ , and a $CTO$ protocol (Section V-A2) has to be executed to preserve privacy. The min (resp. max) is computed by applying the or operation element-wise among vectors $\mathbf{V_{i}}$ . Each $DP_{i}$ computes its local min (resp. max) $m_{DP_{i}}$ in a specified range, e.g., [0:100], which is represented by $\mathbf{V_{i}}=[b_{i,0},...,b_{i,100}]$ . Each $b_{i,j}>m_{DP_{i}}$ (resp. $b_{i,j}<m_{DP_{i}}$ ) is encoded with a ‘1’ (or random) and a ‘0’ otherwise. The min (resp. max) across all $DPs$ corresponds to the leftmost (resp. rightmost) position with a ‘1’ in the vector resulting from the OR operation. Similarly, the set intersection (resp. union) is computed by using the AND (resp. OR) operation element-wise on the vectors $\mathbf{V_{i}}$ .

Regression Models.

Linear Regressions. We assume a dataset distributed over the $DPs$ with $D$ features $x_{1},...,x_{D}$ and a label value $y$ such that $y\approx c_{0}+c_{1}\times x_{1}+c_{2}\times x_{2}+...+c_{D}\times x_{D}$ . Drynx computes the least-squares linear fit over all the $DPs$ by building a system of $D+1$ equations that $Q$ can use in order to compute the linear regression coefficients $c_{0},c_{1},c_{2},...,c_{D}$ :

[TABLE]

where all the sums are between $\textstyle\mu=1$ and $\textstyle\mu=\textstyle\sum_{i=1}^{N}c_{i}$ . Each $DP_{i}$ sends $\textstyle\sum_{j=1}^{c_{i}}x_{j,\eta}$ , $\textstyle\sum_{j=1}^{c_{i}}x_{j,\eta}x_{j,\zeta}$ , $\textstyle\sum_{j=1}^{c_{i}}y_{j}$ , $\textstyle\sum_{j=1}^{c_{i}}y_{j}x_{j,\eta}$ , $\forall\eta,\zeta\in\{1,2,...,D\}$ , $\eta\neq\zeta$ .

Logistic Regressions. We consider again a dataset of $N_{da}$ records (distributed among the $DPs$ ) with a dimension $D$ where each record $x^{(\mu)}=(1,x^{(\mu)}_{1},\cdots,x^{(\mu)}_{D})\in R^{D}$ consists of $D$ features and an offset term of 1, and is associated with a label $y^{(\mu)}\in\{0,1\}$ . The original logistic regression cost function is

[TABLE]

where $h_{\theta}(x)={1}/{(1+\exp(\scriptstyle\sum_{\eta=0}^{D}\theta_{\eta}x_{\eta}))}$ and $lr_{\theta}=\frac{\lambda}{2N_{da}}\scriptstyle\sum_{\eta=1}^{D}\theta_{\eta}^{2}$ , $\lambda$ is the L2-regularization parameter. $J(\theta)$ can be approximated by a linear function

[TABLE]

by using the fact that $\scriptstyle\log(\frac{1}{1+\exp(x)})\approx\scriptstyle\sum_{\tau{}=0}^{k}a_{\tau}x^{\tau}$ , where $a_{0},a_{1},...,$ $a_{k}$ can be chosen as the $k+1$ first coefficients of the Taylor expansion of $\scriptstyle\log(\frac{1}{1+\exp(x)})$ , or as the coefficients of the quadratic approximation that minimizes the area between the original function and its approximation. The $A_{\tau,r_{1},\cdots,r_{\tau}}$ coefficients are defined by

[TABLE]

where the $a^{(\mu)}_{\tau,r_{1},\cdots,r_{\tau}}$ are computed and encrypted by the $DPs$ before being collectively aggregated by the $CNs$ .

Neutral Response. A neutral response for and and set intersection is $O=[1,...,1]$ , and $O=[0,...,0]$ for other operations.

Optimized and Iterative Encoding Drynx can also be used in order to execute iterative processes, e.g., a k-means algorithm. In this case, each iteration can simply be mapped to a query sent to the system. An iterative process can also be used in order to optimize existing encodings, such as the min and max. In their basic versions, these encodings rely on a $d$ -bit vector in which each bit represents a value in a predefined range of size $d=|b_{u}-b_{l}|$ . This means that each $DP$ sends $d$ ciphertexts. This process can be optimized by using a binary-search iterative process as depicted in Protocol VII. In the Range Reduction step, each query only requires one ciphertext per $DP$ and reduces by half the range of possible answers. This step is repeated until this range is reduced to a predefined size $EL$ . It must be noted that the execution of other iterative processes would work in a similar way: For example, for a k-means algorithm [74], $Q$ performs one iteration by executing one query that includes the centroids in clear; the $DPs$ then assign their points to the closest centroid before aggregating their points by cluster; then, the same operation is repeated among all $DPs$ by using Drynx typical query workflow and $Q$ computes the new centroids. As in Protocol VII and as described below, this algorithm leaks the intermediate results. We do not address the problem of hiding the intermediate results, e.g, by using differential privacy, in this work.

[TABLE]

Security Arguments. For all encoding and in each query, $Q$ learns the elements of $\mathbf{V}$ (aggregated over all $DPs$ ) and the (approximate) number of samples considered $c$ , as defined by encoding.

For the iterative process, in the Range Reduction, the $DPs$ ’ answers remain confidential, but the range is sent in clear in each query thus revealed to other entities. $Q$ controls the size of the range of possible values that is leaked by defining an entropy limit $EL$ . In the final step, the max query is privately executed on the remaining range. This provides a tradeoff between performance and privacy (that we analyze in Section IX). The number of ciphertexts is lowered to $n=g+\lceil\frac{d}{2^{g}}\rceil,\ g=\lfloor log_{2}(\frac{d}{EL})\rfloor$ , which reduces the amount of computations and proofs by a factor $\frac{d}{n}$ . For example, if $Q$ wants to know the $DPs$ ’ minimum value in $[0,1000)$ with $EL=100$ , the workload is reduced by a factor of $7.8$ and the query leaks a range of 100 possible minimum values.

VIII Discussion and Extensions

We illustrate multiple extensions for Drynx by relying on our use cases, $HDS$ and $PDS$ (Section III-A).

Modularity. Drynx is highly modular and some of its security features can be enabled or disabled, depending on the application. For example, if results robustness is not required, input-range validation can be omitted without hindering Drynx’s execution and the remaining security guarantees are preserved. The same applies for $DPs$ ’ privacy features, e.g., differential privacy.

For example, in $HDS$ , each hospital (or $DP$ ) locally executes the query on multiple patient records and the range proofs can be omitted if the range of possible values is too broad or if the hospital is trusted to input correct values. Otherwise, the range boundaries have to be set accordingly. In this case, the querier has to use her knowledge on the attributes involved (e.g., age is between 0 and 150) and the information she has on the $DPs$ ’ data (e.g., $DPs$ have a maximum of $X$ data samples) to define the ranges. In $PDS$ , the ranges for the input values can be used to enforce tighter bounds (e.g., heart rate can only take values in [40,100] beats-per-minute) as each $DP$ has one data record.

Drynx also enables the collective protection of data at rest by having $DPs$ locally encrypt their data with the $CNs$ ’ collective key $K$ . This limits the flexibility of the system as $DPs$ are then required to pre-compute all necessary inputs (e.g., the square root of the values to enable the computation of the variance) and the range proofs before entering the encrypted data in their databases. It also requires a fixed set of $CNs$ , as only they can operate with that pre-encrypted data.

As mentioned before, Drynx’s primary goal is to guarantee $DPs$ ’ privacy and still enable the queriers to obtain the results of computations performed over multiple databases. For this, Drynx enables optional security and privacy features, such as differential privacy. These features can be enabled or disabled depending on the application requirements, hence enabling multiple trade-offs between security and privacy, performance and accuracy (see below).

Collusion Resistance. Each participant can play multiple roles without hindering Drynx’s security. For example, in $HDS$ , a hospital can be a $DP$ and also play the role of a $CN$ , to ensure its data confidentiality without having to trust any other hospital. It can also be a $VN$ thus take part in the verification process.

Availability. Drynx’s privacy and security guarantees hold even in the case where multiple $CNs$ or $DPs$ become unavailable. Any entity can leave or join the system without hindering Drynx’s operation, as long as they are not involved in a query under execution. In the event of a $CN$ becoming unresponsive during the query execution, the $CTA$ and $CTKS$ steps cannot be finalized, as they both require the participation of all $CNs$ . Therefore, in this case, the process is stopped and $Q$ can request the same query by choosing another set of $CNs$ , e.g., by excluding the faulty $CN$ (s). An unresponsive $DP$ only reduces the number of responses included in the statistic being computed and does not disrupt Drynx’s process. Standard mechanisms, e.g., limiting the rate at which queries are accepted, can be implemented in Drynx to avoid DDoS attacks.

**Accuracy. There are several aspects that can influence output precision in Drynx. (a) We first remark that the $DPs$ ’ inputs to the system have to be approximated by fixed-point representation if they are floating values, as explained in Section III-B.

(b) Drynx’s encodings and query executions do not intrinsically hinder the accuracy of the computed results, as all operations are exact, as long as the target function is exactly encodable. In fact, it is worth noting that the encoding for the logistic regression training is built from an approximation of the original cost function.

Additionally, (c) the $DPs$ can privately decide whether to answer a query; this choice can influence the final result. However, the number of samples considered in the computation, i.e., $c_{i}$ in Definition 1, is always sent to $Q$ , who can then observe if this number changed since her last query. It also enables her to take an informed decision on the statistical significance of the results, to accept them or not.

(d) Drynx can guarantee differential privacy by adding noise to the final result. In this case, Drynx returns approximate results, and the accuracy loss depends on the chosen privacy parameters and the executed operation. The choice of these parameters and the perturbation introduced in the results is thus orthogonal to this work.

Finally, (e) malicious $DPs$ can try to distort the query result by inputting erroneous values. Drynx limits malicious $DPs$ ’ influence on the final result by enabling the querier to restrict the range of possible inputs. This bounds the perturbation that some $DPs$ can generate on the results. If the inputs were not bounded, one malicious $DP$ could completely distort the final result by inputting extreme values. It is difficult to provide hard numbers for the accuracy of Drynx in the presence of malicious $DPs$ , as it depends on many parameters such as the executed operation, the chosen input ranges, the number of $DPs$ and data records. Nonetheless, in Section IX we show how the use of ranges limits the influence of malicious $DPs$ in two examples.**

Authentication/Authorization. Authentication and authorization fall out of the scope of this paper, but for the sake of completeness we briefly mention here that Drynx can integrate off-the-shelf solutions based on federated or distributed architectures [75, 76, 77].

IX Performance Evaluation

We discuss our experimental setup and evaluate Drynx’s performance. We show that it scales almost (in some cases better than) linearly with the number of $CNs$ , $VNs$ and $DPs$ , and we compare Drynx against existing solutions. We also discuss multiple security, privacy and performance tradeoffs.

IX-A System Implementation

We implemented Drynx in Go [78], and our full code is publicly available [79]. We relied on Go’s native crypto-library and on public advanced crypto-libraries [80]. For the implementation of the proofs’ storage and verification, we use a skipchain [81], which is made of blockchain-like blocks that, to enable clients to efficiently navigate arbitrarily on the chain, also contain back-and-forward pointers to older and future blocks. We rely on a (private) permissioned blockchain [82], as in our examples $HDS$ and $PDS$ (Section III-A), the participants, i.e., researchers, patients or hospitals, have to be known and authorized. However, Drynx works independently of the blockchain type, and a permission-less blockchain can also be used in a less restrictive scenario. Drynx works independently of the used Elliptic Curve; we tested it on the Ed25519 [83] and bn256 Elliptic Curves [84]. Both curves provide 128-bit security, and we used bn256 by default as it enables pairing operations (required for range proofs). Our prototype is built as a modular library of protocols that can be combined in multiple ways. The communication between different participants relies on TCP with authenticated channels (through TLS).

IX-B System Evaluation

We used Mininet [85] to simulate a realistic virtual network between the nodes; we restricted the bandwidth of all connections between nodes to 100Mbps and imposed a latency of 20ms on all communication links. We evenly distributed the $CNs$ , $DPs$ , $VNs$ and $Q$ on a set of 13 machines that have two Intel Xeon E5-2680 v3 CPUs with a 2.5GHz frequency that supports 24 threads on 12 cores and 256GB RAM.

We begin our evaluation by studying how the different steps in Drynx’s pipeline can be executed in parallel. We then show that Drynx’s runtime only slightly increases when the number of records per $DP$ grows (and the number of $DPs$ remains constant).

In our default setup, we consider 6 CNs and 7 VNs. We set the proof verification thresholds $T=1.0$ and $T_{sub}=0.3$ and show, in Section IX-B1, the effect of these thresholds on Drynx’s execution time. The joint use of these thresholds ensures that all the proofs are at least partially verified and that each sub-proof is verified by $f_{h}$ $VNs$ with a probability of $98.5\%$ . We show Drynx’runtime without the $CDP$ protocol as $CDP$ can be pre-computed or run in parallel with other steps. We notice that the $CDP$ ’s runtime depends on the number of $CNs$ and on the size $\tilde{l}$ of the list of noise values. This creates a tradeoff between privacy and performance as a greater $\tilde{l}$ provides a higher privacy level, as it reduces $\delta={1}/{\tilde{l}}$ but also increases the time to generate and shuffle the list of noise values. With a Laplacian distribution and $\tilde{l}=100$ , $CDP$ ’s runtime is 2.9 seconds with an overhead of 8.1 seconds for the proof verification.

IX-B1 Drynx Evaluation

Parallel Execution. Figure 5(a) shows the runtime for training a logistic regression model. We use a randomly-generated dataset of 12 floating-point features and 600,000 records split among 12 $DPs$ . We remark that the operations are verified in parallel to the query execution; this parallelization enables $Q$ to obtain the query results as soon as it is computed (denoted by query execution dashed line). At the end of the verification process, an auditor can check the query by verifying the signature and the query-proofs map of the corresponding block in the proofs blockchain, which in this case takes 0.4 seconds. The blocks’ sizes are small as they only contain the query and the corresponding query-proofs map; in this example one block is 56kB.

Scaling. We show how Drynx’s execution time evolves with an increasing number of data records (Figure 5(b)), $CNs$ and $DPs$ (Figure 5(c)) and $VNs$ (Figure 5(d)). Inspired by $HDS$ and $PDS$ , we simulate the computation of the heart-rate variance (values between $[0,256)$ ) over a set of distributed patients. In Figure 5(b), we observe that Drynx scales better with (a) the number of records per $DP$ (and fixed number of $DPs$ ) than (b) with the number of $DPs$ ; case (a) represents $HDS$ , where a $DP$ is an hospital with a database of multiple patients, whereas case (b) represents $PDS$ , where each patient is a $DP$ ( $\#DPs$ $=$ $\#records$ ). This is because (a) enables the $DPs$ to locally pre-aggregate their data, thus reducing the amount of proofs and computations. For Figures 5(c) and 5(d) and for the remaining part of the evaluation, we set the number of $DPs$ to 10 per $CN$ . In $HDS$ , this could correspond to a use case in which some $DPs$ are hospitals and the others are independent doctors sharing their data. We observe that Drynx’s runtime increases with the number of $DPs$ , $CNs$ , and $VNs$ . However, an increasing number of $CNs$ and $VNs$ also means a higher security level, as the trust is distributed among more entities.

Operations. Figure 5(e) shows Drynx’s runtime for all the operations with a large integer range of $[0,2^{20}]$ for each of the $DPs$ ’ inputs (the size of the $DPs$ ’ inputs is shown below each operation). We observe that for all operations, the query execution time is always below 1.5 seconds; and the overhead incurred by the proofs verification increases with the size of the $DPs$ ’ inputs. This is expected, as the larger the DPs’ inputs become, the more ciphertexts there are for the system to process, and more proofs there are to verify. We also observe that bit-wise operations take more time when the $DPs$ opt to send a bit value that is then obfuscated (using the $CTO$ protocol).

Verification Thresholds. In Figure 5(f), we show how the different thresholds on the proofs verification affect Drynx’s performance with a variance query. It can be seen that sending the proofs (communication time is denoted by a dashed line) is the most time consuming part, and that reducing the thresholds reduces the verification time. For example, by having $T=1$ and $T_{sub}=0.2$ , we effectively reduce the verification workload by a factor close to 0.8, and a sub-proof is still verified by $f_{h}=5$ of the $VNs$ with a high probability ( $83.48\%$ ).

Malicious DPs. By enforcing $DPs$ ’ values to be within a specific range, Drynx limits the influence of malicious $DPs$ on the computed statistic. We illustrate this in a simple and realistic example (using $PDH$ from Section III-A) by computing the average heart rate over a dataset of 8922 hypertensive patients [86]. The real heart-rate values are limited to be between 40 bpm (beats per minute) and 100 bpm and, as presented by Lorgis et al. [86], the average value obtained among honest $DPs$ is $a_{h}=$ 70 bpm with a 95% confidence interval of $\pm 6$ bpm. Each patient ( $DP_{i}$ ) must send $(\mathbf{V_{i}},c_{i})=([heart\_rate],count)$ (Definition 1), in which $heart\_rate$ has to be in $[40,100]$ and $count$ in $[0,1]$ . In order to maximize the result’s distortion, a malicious $DP$ can send an extreme value, which is within the range bounds. We assume that all malicious $DPs$ collude and send the same value $heart\_rate=e$ , and that the computed average is given by $a_{m}=(h\cdot a_{h}+e\cdot d)/(h+c)$ , where $h$ and $d$ are the numbers of honest and dishonest $DPs$ , and $c$ is the sum of $c_{i}$ sent by malicious $DPs$ . The relative error is $|1-(a_{m}/{a_{h}})|$ . We remark that a malicious $DP$ can maximize this error with a valid input by sending $([100],0)$ . In Figure 5(g), we observe that with 1% of malicious $DPs$ for the range [40,100], the highest relative error is $1.44\%$ . This error corresponds to 1 bpm, still in the 95% confidence interval. We observe similar results when the cosine similarity is computed in the same settings. For this example, we also present the worst-case scenario in which the cosine similarity computed on the honest $DPs$ is 1 and the malicious $DPs$ input extreme values from the range of accepted values to reduce the similarity. As shown in Figure 5(g), these numbers highly depend on the chosen bounds. Even if many other factors influence this error (e.g., the computed operation and the distribution of the values), it shows that Drynx can limit the power of malicious $DPs$ .

Iterative Queries. Figure 5(h) depicts how Drynx’s runtime can be reduced by using multiple queries to execute a min/max operation in a binary-search style. This represents a tradeoff between privacy and performance, as each iterative query is sent in clear, leaking the interval where the min/max value is. We assume that $Q$ sets the entropy limit $EL=100$ , in other words, another entity in the system can learn that the min/max is in an interval of at least 100 values. The precise value is kept private. We observe that the execution time is not improved when the range is small, but is greatly reduced when the range grows, reaching an execution time reduction of almost $96\%$ at a range size of 100,000.

Communication. Figure 5(i) depicts Drynx’s runtime evolution with respect to both the communication delay and bandwidth capacity with a heart rate variance query. We remark that when the latter is reduced by a factor 100, the runtime increases by a factor 2 or 3. This shows that our system is more sensitive to communication delay than bandwidth capacity.

Bandwidth. In Table II, we present the computation and bandwidth complexities for 1 ciphertext (i.e., 2 points (2p) on the Elliptic Curve, 2p = 64 bytes) per $DP$ . We use $DP$ , $VN$ , and $CN$ as the numbers of corresponding entities in the system. $s$ is the size of the Schnorr signature [57] ( $s=96$ bytes), $h$ is the hash size ( $h=32$ bytes), $l$ comes from the range $[0,u^{l})$ for the range proofs ( $u^{l}=16^{2},l=2$ ), $pap$ is a pairing point’s size ( $pap=384$ bytes) and $n$ is the number of values that are used in the $CDP$ ( $n=100$ ). We do not include the computational complexity for the local computations executed by the $DPs$ and $CNs$ . We refer to Neff’s work [69] for the complexity of the verifiable shuffle ( $VS$ ). We observe that when the number of $CNs$ and $VNs$ increases, the computational, bandwidth and storage costs increase for all the steps. As having more $CNs$ or $VNs$ improves the security and the distribution of the workload in the system, it creates a tradeoff between security, efficiency, and scalability.

IX-B2 Comparison with Existing Works

We supplement the related work’s overview, described in Section II, by presenting here a qualitative and quantitative comparison with multiple systems that are Drynx’s closest related works. We compare Drynx against SMCQL [10], UnLynx [16], Prio [26], Boura et al. [31], Aono et al. [33], Kim et al. [13] and Gazelle [32]. In Table 6(a), we show that Drynx provides several functionalities in a strong threat model and achieves results that can rival with other secure and dedicated approaches, notably in the training of logistic regression models as depicted in Figure 6(a). Drynx performs as well or better than its two closest related works, UnLynx and Prio, and provides better security guarantees.

We observe that solutions based exclusively on secret sharing and garbled circuits, namely SMCQL [10], Prio [26] and Boura et al. [31], offer multiple or advanced functionalities but fail to provide proofs of correct executions. Systems solely based on homomorphic encryption (HE), namely UnLynx [16], Aono et al. [33], Helen [36] and Kim et al. [13], are limited in the functionalities they offer. Furthermore, Aono et al. [33] and Kim et al. [13] rely on data centralization. Gazelle [32] combines HE and garbled circuits and enables complex evaluations of neural networks, but does not protect $DPs$ ’ privacy or provide computation correctness. Contrarily, Drynx enables multiple operations while distributing trust, computations, and data storage, and it provides strict security guarantees in a stronger adversarial model.

We quantitatively compare Drynx to Unlynx [16] and Prio [26], which are, to the best of our knowledge, the closest prior works. Drynx’s query execution time for the sum is faster than UnLynx, as we improved the $CTKS$ protocol by enabling its execution in a tree fashion, thus reducing its execution complexity from $O(\#CN)$ to $O(log(\#CN))$ . Unlike UnLynx, Drynx enables the verification of $DPs$ ’ value ranges, which, for the computation of a sum, adds an overhead of only 0.6 seconds (out of a total time of 2 seconds, as depicted in Figure 5(e)). However, Drynx enables a faster scalable verification of proofs by an auditor. After the proofs are verified and the results stored in the proof blockchain, an auditor can simply request and verify the corresponding block, which in this case takes approximately 0.4s. In Unlynx, an auditor has to request the proofs from each entity and verify them by itself, which takes 1.4s.

Prio [26] relies on secret-shared non-interactive proofs that are created by the $DPs$ to prove the correctness of their inputs to the system and that are collectively verified by the $CNs$ . Even though both systems have similar functionalities, Prio provides input-range verification and computation correctness only when all the $CNs$ are honest-but-curious. We adapted the Gorrigan-Gibbs prototype implementation [91] of Prio to a similar deployment environment as Drynx so that both use the same communication settings, thus enabling a fair comparison. In Figure 6(b), we compare Prio’s runtime in an illustrative example by using the min operation on the range $[0,1000)$ with increasing number of $CNs$ and $DPs$ , against multiple settings of Drynx. This figure shows that Drynx significantly outperforms Prio when computing min without using obfuscation ( $CTO$ ) hence accepts a small probability of error ( ${1}/{(\#\mathcal{G})}$ ) and avoids the need for range proofs. If we use obfuscation, Drynx scales similarly as Prio, but it must be noted that Drynx performs its operations in a stronger threat model. When used in Prio’s threat model (delimited by a black line), Drynx is about two times faster. This is because each range proof can be sent and verified by a single $VN$ as all $VNs$ are considered honest-but-curious under Prio’s threat model.

X Conclusion

We have proposed Drynx, a novel system that enables a querier to compute statistics and train machine-learning models on distributed datasets in a strong adversarial model where no entity is individually trusted. Drynx provides query-execution auditability and ensures the end-to-end confidentiality of the data. It protects the privacy of the data providers and relies on an immutable and distributed ledger to provide efficient correctness verification and proofs storage. Drynx is highly modular, offering configurable tradeoffs between security, privacy, and efficiency. Finally, Drynx enables privacy-preserving computations of widely-used statistics on sensitive and distributed data, thus offering features that are absolutely needed in crucial areas such as user-behavior analysis or research for personalized medicine.

Acknowledgment

The authors would like to thank Henry Corrigan-Gibbs and all members of the Laboratory for Data Security at EPFL for their helpful feedback and their support.

Appendix A Table of Symbols

Appendix B Error Probability

In Section VII, we notice that the result of bit-wise operations, when $DPs$ are requested to answer with random values $R_{i}s$ , can be erroneous with a probability smaller than ${1}/{(\#G-1)}$ . We demonstrate here this result and provide an expression for the probability of error $P_{n}$ where $n$ is the number of $DPs$ .

[TABLE]

We have $\scriptstyle P_{n}=\scriptstyle\frac{1}{\#G-1}\cdot(1-P_{n-1})\leq\scriptstyle\frac{1}{\#G-1}$ and $\scriptstyle P_{n}=\scriptstyle\sum\limits_{i=2}^{n}(-1)^{i}\cdot(\scriptstyle\frac{1}{\#G-1})^{i-1}$ .

Bibliography91

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] “Big Data Privacy is a Bigger Issue Than You Think.” https://www.techrepublic.com/article/big-data-privacy-is-a-bigger-issue-than-you-think (25.06.2018).
2[2] “GDPR,” https://www.eugdpr.org (25.07.2018).
3[3] “A new data breach may have exposed … every American adult,” https://tinyurl.com/ydz 7jpdk (4.02.2019).
4[4] “Equifax Breach,” https://tinyurl.com/y 9h 4pgsk (4.02.2019).
5[5] V. Bindschaedler, R. Shokri, and C. A. Gunter, “Plausible deniability for privacy-preserving data synthesis,” VLDB , vol. 10, no. 5, 2017.
6[6] X. Hu, M. Yuan, J. Yao, Y. Deng, L. Chen, Q. Yang, H. Guan, and J. Zeng, “Differential Privacy in Telco Big Data Platform,” VLDB , vol. 8, no. 12, 2015.
7[7] N. Johnson, J. P. Near, and D. Song, “Towards Practical Differential Privacy for SQL Queries,” VLDB , vol. 11, no. 5, 2018.
8[8] R. A. Popa, C. Redfield, N. Zeldovich, and H. Balakrishnan, “Crypt DB: protecting confidentiality with encrypted query processing,” in SOSP . ACM, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Drynx: Decentralized, Secure, Verifiable System for Statistical Queries and Machine Learning on Distributed Datasets

Abstract

Index Terms:

I Introduction

II Related Work

III Background

III-A Use Cases

III-B ElGamal Homomorphic Encryption

III-C Zero-Knowledge Proofs

III-D Interactive Protocols

III-E Blockchains

III-F Differential Privacy

III-G Verifiable Shuffles

IV System Overview

IV-A System Model

IV-B Threat Model

IV-C Functional Requirements

Definition 1**.**

IV-D Security Requirements

V Drynx Design

V-A Drynx Security Design

V-A1 Data Confidentiality

V-A2 DPs’ Privacy

Neutral Response

Definition 2**.**

Privacy-Preserving Bit-wise Operations

Distributed Differential Privacy

V-A3 Query Execution Correctness

Auditability

Results Robustness

Computation Correctness

V-B Drynx Optimized Design

V-B1 Full Query Execution Pipeline

V-B2 Probabilistic Query Verification

VI Security Analysis

VII Encodings

VIII Discussion and Extensions

IX Performance Evaluation

IX-A System Implementation

IX-B System Evaluation

IX-B1 Drynx Evaluation

IX-B2 Comparison with Existing Works

X Conclusion

Acknowledgment

Appendix A Table of Symbols

Appendix B Error Probability

Definition 1.

Definition 2.