# Drynx: Decentralized, Secure, Verifiable System for Statistical Queries   and Machine Learning on Distributed Datasets

**Authors:** David Froelicher, Juan R. Troncoso-Pastoriza, Joao Sa Sousa and, Jean-Pierre Hubaux

arXiv: 1902.03785 · 2020-02-28

## TL;DR

Drynx is a decentralized system that allows privacy-preserving statistical analysis and machine learning on sensitive distributed datasets, ensuring data confidentiality, correctness, and auditability without requiring trust in any single entity.

## Contribution

It introduces a modular, efficient framework combining cryptographic techniques for secure, verifiable, and privacy-preserving data analysis on distributed datasets.

## Key findings

- Training logistic regression on large distributed data in under 2 seconds
- Verification of query correctness completed in less than 22 seconds
- Supports secure computation of various statistical measures and machine learning models

## Abstract

Data sharing has become of primary importance in many domains such as big-data analytics, economics and medical research, but remains difficult to achieve when the data are sensitive. In fact, sharing personal information requires individuals' unconditional consent or is often simply forbidden for privacy and security reasons. In this paper, we propose Drynx, a decentralized system for privacy-conscious statistical analysis on distributed datasets. Drynx relies on a set of computing nodes to enable the computation of statistics such as standard deviation or extrema, and the training and evaluation of machine-learning models on sensitive and distributed data. To ensure data confidentiality and the privacy of the data providers, Drynx combines interactive protocols, homomorphic encryption, zero-knowledge proofs of correctness, and differential privacy. It enables an efficient and decentralized verification of the input data and of all the system's computations thus provides auditability in a strong adversarial model in which no entity has to be individually trusted. Drynx is highly modular, dynamic and parallelizable. Our evaluation shows that it enables the training of a logistic regression model on a dataset (12 features and 600,000 records) distributed among 12 data providers in less than 2 seconds. The computations are distributed among 6 computing nodes, and Drynx enables the verification of the query execution's correctness in less than 22 seconds.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.03785/full.md

## Figures

24 figures with captions in the complete paper: https://tomesphere.com/paper/1902.03785/full.md

## References

91 references — full list in the complete paper: https://tomesphere.com/paper/1902.03785/full.md

---
Source: https://tomesphere.com/paper/1902.03785