Private Shotgun DNA Sequencing: A Structured Approach

Ali Gholami; Mohammad Ali Maddah-Ali; Seyed Abolfazl Motahari

arXiv:1904.00800·cs.IT·April 4, 2019

Private Shotgun DNA Sequencing: A Structured Approach

Ali Gholami, Mohammad Ali Maddah-Ali, Seyed Abolfazl Motahari

PDF

Open Access

TL;DR

This paper introduces a structured DNA sequencing method that enhances privacy by separating sequencing and assembly processes and adding non-target fragments, enabling privacy with a single sequencing machine.

Contribution

It proposes a novel layered approach to DNA sequencing that ensures privacy using only one machine, unlike previous methods requiring multiple non-colluding machines.

Findings

01

Ensures privacy through layered sequencing and fragment addition.

02

Achieves privacy with a single sequencing machine.

03

Uses coverage depth proportional to powers of two.

Abstract

DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals' sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at a sequencing machine, and assembling the reads, which is done at a trusted local data collector. To confuse the sequencer, in a pooled sequencing scenario, in which multiple sequences are going to be sequenced simultaneously, for each target individual, we add fragments of one non-target individual, with a known DNA sequence at the data collector. Then coverage depth of the individuals, defined as the number of DNA fragments per DNA site, are selected proportional to the powers of two. This…

Equations181

\hat{X} = ϕ (Y, R),

\hat{X} = ϕ (Y, R),

P (\hat{x}_{n} \neq = x_{n}) \leq ϵ, \forall n \in [N] .

P (\hat{x}_{n} \neq = x_{n}) \leq ϵ, \forall n \in [N] .

\frac{I ( X _{m, n} , m \in [ 0 : M - 1 ] , n \in [ N ] ; R )}{M N} \leq β .

\frac{I ( X _{m, n} , m \in [ 0 : M - 1 ] , n \in [ N ] ; R )}{M N} \leq β .

α_{m, n} = 2^{m} α_{0},

α_{m, n} = 2^{m} α_{0},

\tilde{α}_{k, n} = 2^{k} α_{0} .

\tilde{α}_{k, n} = 2^{k} α_{0} .

\frac{α _{0}}{2 ^{M + 1} - 2} \geq \frac{8 η ( 1 - η )}{( 1 - 2 η ) ^{2}} ln (\frac{1}{ϵ}) .

\frac{α _{0}}{2 ^{M + 1} - 2} \geq \frac{8 η ( 1 - η )}{( 1 - 2 η ) ^{2}} ln (\frac{1}{ϵ}) .

M \geq \frac{1}{β} .

M \geq \frac{1}{β} .

G_{n} = m = 0 \sum M - 1 2^{m} X_{m, n} + Z_{n},

G_{n} = m = 0 \sum M - 1 2^{m} X_{m, n} + Z_{n},

σ^{2} ≜ \frac{2 ^{M + 1} - 2}{α _{0}} \frac{η ( 1 - η )}{( 1 - 2 η ) ^{2}} .

σ^{2} ≜ \frac{2 ^{M + 1} - 2}{α _{0}} \frac{η ( 1 - η )}{( 1 - 2 η ) ^{2}} .

m = 0 \sum M - 1 i = 1 \sum 2^{m} α (\tilde{X}_{m, n, i} + \tilde{Y}_{m, n, i}),

m = 0 \sum M - 1 i = 1 \sum 2^{m} α (\tilde{X}_{m, n, i} + \tilde{Y}_{m, n, i}),

\tilde{X}_{m, n, i}

\tilde{X}_{m, n, i}

\tilde{Y}_{m, n, i}

G_{n}

G_{n}

- k = 0 \sum M - 1 y_{k, n} - 2^{M + 1} \frac{η}{1 - 2 η} .

E (\tilde{X}_{m, n, i} ∣ X_{m, n})

E (\tilde{X}_{m, n, i} ∣ X_{m, n})

= (1 - 2 η) X_{m, n} + η

Var (\tilde{X}_{m, n, i} ∣ X_{m, n})

- (E (\tilde{X}_{m, n, i} ∣ X_{m, n}))^{2}

= (X_{m, n}^{2} (1 - η) + (1 - X_{m, n})^{2} (η))

- ((1 - 2 η) X_{m, n} + η)^{2}

= η (1 - η),

\tilde{X}_{m, n, i} = (1 - 2 η) X_{m, n} + η + Z_{m, n, i},

\tilde{X}_{m, n, i} = (1 - 2 η) X_{m, n} + η + Z_{m, n, i},

\frac{1}{α _{0} ( 1 - 2 η )} i = 1 \sum 2^{m} α_{0} \tilde{X}_{m, n, i} = 2^{m} X_{m, n} + \frac{2 ^{m} η}{1 - 2 η} + \frac{\sum _{i = 1}^{2^{m} α_{0}} Z _{m, n, i}}{α _{0} ( 1 - 2 η )} .

\frac{1}{α _{0} ( 1 - 2 η )} i = 1 \sum 2^{m} α_{0} \tilde{X}_{m, n, i} = 2^{m} X_{m, n} + \frac{2 ^{m} η}{1 - 2 η} + \frac{\sum _{i = 1}^{2^{m} α_{0}} Z _{m, n, i}}{α _{0} ( 1 - 2 η )} .

\frac{\sum _{i = 1}^{α_{0}} Z _{m, n, i}}{α _{0} ( 1 - 2 η )} = \frac{1}{α _{0} ( 1 - 2 η )} \frac{\sum _{i = 1}^{α_{0}} Z _{m, n, i}}{α _{0}}

\frac{\sum _{i = 1}^{α_{0}} Z _{m, n, i}}{α _{0} ( 1 - 2 η )} = \frac{1}{α _{0} ( 1 - 2 η )} \frac{\sum _{i = 1}^{α_{0}} Z _{m, n, i}}{α _{0}}

\tilde{Y}_{k, n, i}

\tilde{Y}_{k, n, i}

q_{n} ≜ \frac{1}{α _{0} ( 1 - 2 η )} (m = 0 \sum M - 1 i = 1 \sum 2^{m} α_{0} (\tilde{X}_{m, n, i} + \tilde{Y}_{m, n, i}))

q_{n} ≜ \frac{1}{α _{0} ( 1 - 2 η )} (m = 0 \sum M - 1 i = 1 \sum 2^{m} α_{0} (\tilde{X}_{m, n, i} + \tilde{Y}_{m, n, i}))

q_{n} = m = 0 \sum M - 1 2^{m} (X_{m, n} + Y_{m, n}) + \tilde{Z}_{n},

q_{n} = m = 0 \sum M - 1 2^{m} (X_{m, n} + Y_{m, n}) + \tilde{Z}_{n},

P (error) \leq Q (\frac{d _{min}}{2 σ}) .

P (error) \leq Q (\frac{d _{min}}{2 σ}) .

P (error) \leq Q (\frac{1}{2 σ}) \leq exp (\frac{- 1}{8 σ ^{2}}),

P (error) \leq Q (\frac{1}{2 σ}) \leq exp (\frac{- 1}{8 σ ^{2}}),

P (X ∣ R)

P (X ∣ R)

P (X) = n = 1 \prod N P (x_{n}) .

P (X) = n = 1 \prod N P (x_{n}) .

I (X; R) = n = 1 \sum N I (x_{n}; q_{n}) .

I (X; R) = n = 1 \sum N I (x_{n}; q_{n}) .

\frac{I ( x _{n} ; q _{n} )}{M} \leq β .

\frac{I ( x _{n} ; q _{n} )}{M} \leq β .

Z_{n} ≜ m = 0 \sum M - 1 2^{m} (X_{m, n} + Y_{m, n})

Z_{n} ≜ m = 0 \sum M - 1 2^{m} (X_{m, n} + Y_{m, n})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Cryptography and Data Security · Oral and gingival health research

Full text

Private Shotgun DNA Sequencing: A Structured Approach

Ali Gholami, Mohammad Ali Maddah-Ali, and Seyed Abolfazl Motahari Ali Gholami is with the Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran. (email:[email protected]). Mohammad Ali Maddah-Ali is with Nokia Bell Labs, Holmdel, New Jersey, USA. (email:[email protected]). Seyed Abolfazl Motahari is with the Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. (email:[email protected])

Abstract

DNA sequencing has faced a huge demand since it was first introduced as a service to the public. This service is often offloaded to the sequencing companies who will have access to full knowledge of individuals’ sequences, a major violation of privacy. To address this challenge, we propose a solution, which is based on separating the process of reading the fragments of sequences, which is done at a sequencing machine, and assembling the reads, which is done at a trusted local data collector. To confuse the sequencer, in a pooled sequencing scenario, in which multiple sequences are going to be sequenced simultaneously, for each target individual, we add fragments of one non-target individual, with a known DNA sequence at the data collector. Then coverage depth of the individuals, defined as the number of DNA fragments per DNA site, are selected proportional to the powers of two. This layered structured solution allows us to ensure privacy, using only one sequencing machine, in contrast to our previous solution, where we relied on the existence of multiple non-colluding sequencing machines.

Index Terms:

DNA sequencing, shotgun sequencing, privacy.

I Introduction

Functionalities within the human body is coded in the DNA. The way cells evolve and form different tissues and limbs are highly correlated to the information stored in the genome. Human genome is a sequence of nucleotides chosen from the four member set $\{A,C,G,T\}$ . The sequence in human genomes are very similar–more than 98 percent alike. What is mostly responsible for variations among human genomes are Single Nucleotide Polymorphisms (SNPs). In fact, an individual’s genome can be uniquely characterized by its SNPs–that is called genotyping.

Having access to the genome sequence can benefit individuals for health care purposes both in diagnostic and therapeutic decision-making procedures [1], [2], [3]. As a result, the usage of genetic testing services have risen massively in the past decade [4], [5], as well as genetic testing providers. As genomic data is becoming a leading part of health care procedures, concerns involving the privacy and confidentiality of this data have grown similarly [6], [7], [8]. The disclosure of this data can be maliciously used for example by insurance companies to increase the rates for particular diseases and drugs. Moreover, The disclosure of this information puts the information on the relatives in danger as well, due to the inherited similarities between family members [9], [10]. Thus, accessing genomic data in one hand is useful in curing diseases and on the other hand its disclosure is a violation to the privacy of individuals [11, 12, 13, 14]. There are a lot of papers addressing the issue of privacy in data exploration for genomic data. Some have used the concept of k-anonymity for providing data privacy, some have used differential privacy and others provided solutions by cryptographic methods [15, 16, 17, 18, 19, 20, 21, 22, 23]. The objective in all those papers was to make sure no one’data is revealed in a published data set due to the process of sharing data for research purposes. In this paper, we have looked into the issue of privacy in a different way. The privacy is violated at the beginning of sequencing process, due to the access of the sequencing company to the sequence. Therefore, before we even disclose our data, the company knows our sequence.

The most popular method in sequencing the whole genome is shotgun DNA sequencing [24], [25], [26]. In this method, the genome is broken into multiple fragments with various lengths. After that, a sequencing machine reads these fragments are assembles the reads to build the whole sequence. Assembling algorithms available let the sequencing procedure to be both cost and time effective. It takes just a couple of days with a cost of less than 1000 dollars to sequence the genome, thanks to the existing sequencing machines. Also, to further reduce the costs and time, pooled sequencing can be used [27], [28], [29]. In this methodology, rather than sequencing one individual, the genomes of a set of individuals are pooled together and sent to the sequencer. This will reduce the cost in comparison to the case in which these individuals sequenced the genome separately. Also, as wii be seen later on, the usage of pooled sequencing will benefit us in providing the privacy constraint.

Taking a deep look at the sequencing procedure, we realized that the sequencing process is itself a source of leakage for the sequence information. In this paper we introduce a scheme in which sequencing is possible while this kind of leakage is prevented and we will guarantee this privacy mathematically. In fact, we are going to sequence the genome of a set of individuals, using a sequencing machine, while limiting the knowledge received by the sequencer as desired. We first mention that the sequencing process consists of two phases. First is the reading phase in which the sequencer reads the received fragments; i.e. determines the sequence of nucleotides in each fragment. Second is the processing phase where a machine called data collector, using the received reads, assembles the sequence of each individual. We aim at separating the two phases to provide privacy. In fact, we will introduce a methodology in which the sequencer is unable to do the processing phase while the data collector has the ability. In other words, the reading phase which needs high tech machines is outsourced, and the processing phase which is computational is done on a trusted local machine. To separate the two phases, we should make sure the data collector has more information in comparison to the sequencer. One of the ideas used in that direction is the usage of a set of individuals which their genome sequence is known a–priori to the data collector and unknown to the sequencer. the other idea is to use the finite field addition. Briefly, if we have two binary random variables and one of them has a uniform distribution, their summation in binary field reveals no new information of the non-uniform random variable; i.e. having the output of this summation, does not change the distribution of the random variable in comparison to the prior distribution. With these two ideas, we are going to limit the information leakage at the sequencer as desired, while letting the data collector to reconstruct the sequences.

This problem is conceptually connected to the Shamir sharing scheme [30]. In this scheme, a secret is partitioned to multiple parts, and each part is stored in a data base. This partition is done in such a way that with a subset of the data bases, the secret is reconstructed. In fact there is a threshold for the number of data bases where any subset with the number of data bases equal or more than that, can reconstruct the secret, and any subset with the number of data bases less than that threshold, receives no information about the secret [31]. Based on this solution, there are many works providing solutions [32], [33].

The rest of paper is organized as follows. The problem setting is provided in Section II. In Section III, an achievable scheme is introduced with the corresponding results. In Section IV, a generalized version of the scheme is introduced with the resulting theorems and Section V concludes the paper and introduces some future steps.

II Problem Setting

We propose an architecture in which there is a trusted data collector and a sequencing machine (i.e. sequencer). also, there is a set of individuals that want their genome to be sequenced privately, without leaking the sequence data to the sequencer. There are $M\in\mathbb{N}$ individuals in this set and they are labeled from [math] to $M-1$ . The data collector has the duty to gather the genomes of the individuals in the set and pool their fragments (the genome is sheared to fragments with various sizes) together and send this pool to the sequencer. Then, the sequencer will read these fragments (reading phase) and reports the resulting reads to the data collector. At last, the data collector, using the set of reads, assembles the sequences for all individuals (processing phase) and reports the results to them. To provide privacy, unlike conventional methods, we aimed at separating the reading phase with the processing phase. In fact, the sequencer has the duty to do the reading phase and the data collector is used for the processing phase. Our objective for privacy is to guarantee that the processing phase can not be done in the sequencer.

To separate the two phases, we should create an information gap between the sequencer and the data collector. To do this, we use another set of individuals which their sequences are known before hand to the data collector but unknown to the sequencer. The genomes of this set of individuals are also collected by the data collector and their fragments are added to the pool. This set is of size $K\in\mathbb{N}$ and the individuals are labeled from [math] to $K-1$ and are called known individuals. The previous set, which the aim is to sequence their genomes, are called unknown individuals.

We referred to SNPs earlier as the main source of difference between human genomes. Although there are four types of nucleotides, two of them can occur in every SNP position for all individuals, and this binary set in every position is known a–priori for the population. Also, for each SNP position, the allele occurring with more frequency in the population is called the major allele and the one occurring with less frequency is called the minor allele. Considering this, the sequence of every individual can be characterized by a vector in $\{0,1\}^{N}$ where $N\in\mathbb{N}$ is the total number of SNPs and [math] and $1$ represent the minor and major alleles, respectively. Moreover, we define the matrix $\mathbf{X}$ which contains the random variable $X_{m,n}\in\{0,1\}$ in its row $m$ and column $n$ that indicates the allele for unknown individual $m$ in SNP position $n$ . Similarly, the matrix $\mathbf{Y}$ is defined for the known individuals. Keep in mind that the entries in $\mathbf{X}$ are unknown both at the sequencer and the data collector, but the entries in $\mathbf{Y}$ are unknown to the sequencer and known to the data collector, leading to an information gap between these two.

Let $\mathcal{F}_{m,n}$ and $\tilde{\mathcal{F}}_{k,n}$ denote the set of fragments containing SNP position $n\in\mathbb{N}$ for the unknown individual $m$ and known individual $k$ respectively. The data collector sends the set of fragments $\bigcup\limits_{m=1}^{M}\bigcup\limits_{n=1}^{N}\mathcal{F}_{m,n}+\bigcup\limits_{k=1}^{K}\bigcup\limits_{n=1}^{N}\tilde{\mathcal{F}}_{k,n}$ to the sequencer (see Fig. 1). Let us define the random variables $\alpha_{m,n}\triangleq|\mathcal{F}_{m,n}|$ and $\tilde{\alpha}_{k,n}\triangleq|\tilde{\mathcal{F}}_{k,n}|$ as the coverage depth for SNP position $n$ for the unknown individual $m$ and known individual $k$ , respectively. Note that in the sequencing process, from each individual, there are a number of genomes provided for the data collector, so for most regions in the genome for one individual, there are multiple fragments containing the region. The sequencer reads each SNP with a probability of error. As will be seen later, to lower the effect of reading error caused by the sequencer, we should increase the coverage depth. The set of reads sent to the data collector by the sequencer is denoted by $\mathcal{R}$ .

Sequencers have errors in reading bases. The probability of error in reading a SNP in a fragment is assumed to be constant across all sequences and for all SNPs and is denoted by $\eta\in(0,1)$ . More precisely, in the sequencer, for a fragment of an individual, and in a SNP, the probability that a $1$ is read [math] or vice versa is $\eta$ , independent of the individual, the fragment, and the SNP.

Having $\mathbf{Y}$ as a side-information, the data collector maps $\mathcal{R}$ to the matrix $\hat{\mathbf{X}}\in\{0,1\}^{M\times N}$ using a function $\phi$ , i.e.

[TABLE]

where $\hat{\mathbf{X}}$ refers to an estimate of the matrix of SNPs for unknown individuals $\left(\mathbf{X}\right)$ .

The proposed scheme should be such that the following two conditions are satisfied:

•

Reconstruction Condition: Let $\mathbf{x}_{n}$ and $\hat{\mathbf{x}}_{n}$ denote the column $n$ of the matrix $\mathbf{X}$ and $\hat{\mathbf{X}}$ respectively. The reconstruction condition requires that the inequality below hold for any given $\epsilon\in(0,1)$ :

[TABLE]

$\epsilon$ is referred to as the accuracy level and is a design parameter.

•

Privacy Condition: For privacy to be held, we want the distribution of $X_{m,n},m\in[0:M-1],n\in[N]$ remains almost the same before and after reading the fragments. To be precise, the privacy condition requires that the following inequality hold for any given $\beta\in(0,1)$ :

[TABLE]

$\beta$ is referred to as the privacy level and is a design parameter.

In the following section we will introduce a proposed scheme that satisfies the two conditions simultaneously.

III Structured Achievable Scheme with Constant Coverage Depth

In this section, we propose a scheme to satisfy both the reconstruction (1) and privacy (2) constraints. We have two assumptions in our scheme:

•

Assumption 1: Every fragment is short enough to contain no more than one SNP.

•

Assumption 2: Every fragment is long enough that can be correctly mapped to the reference genome, i.e. we can identify exactly from what region of the genome sequence they came from.

These two assumptions are realistic. We should keep in mind that there are approximately 3.3 million SNPs in the human genome. Comparing to the 3 billion length of the whole genome, it is concluded that the average distance between two SNPs is roughly 1000 base pairs [34]. Moreover, using short read alignment algorithms like Bowtie [35], it is possible to assemble reads of length in the order of a couple of hundreds. Thus using such algorithms, and choosing the fragments lengths to be about few hundreds, both assumptions are valid simultaneously.

In the proposed achievable scheme, we focus on the case where $S=M$ . In cases where $M$ is greater than $S$ , we partition the set of individuals into groups of size $S$ and use this scheme for each group separately. In this paper, we propose a specific assignment scheme for the coverage depth parameters. In the proposed solution, named structured scheme, for $\forall m\in[0:M-1]$ , $\forall k\in[0:K-1]$ , and $\forall n\in[N]$ we have

[TABLE]

and

[TABLE]

where $\alpha_{0}\in\mathbb{N}$ . Also, entries in $\mathbf{X}$ have prior probabilities following the major allele frequencies and entries in $\mathbf{Y}$ have uniform prior probabilities.

Keeping the coverage depth variables exactly as introduced in the above equations is practically impossible. They are actually random variables. Analyzing the random case is rather complicated. To have a better understanding of the problem and make the analysis tractable, in this section, we consider the constant case and later in Section Blah, we generalize the results to the case of random coverage depths.

First, we introduce the main results. Then we derive the mathematical models in the data collector and the sequencers in Subsections III-A and III-B, respectively. We rely on these models to prove the main results in Subsections III-C and III-D. At last, we discuss the results in Subsection III-E.

The following theorem provides a sufficient condition for the reconstruction condition to hold.

Theorem 1.

In the structured scheme with constant coverage depth and reading probability of error of $\eta$ , the reconstruction condition (1) is satisfied if

[TABLE]

The following theorem provides a sufficient condition for the privacy condition to hold.

Theorem 2.

In the structured scheme with constant coverage depth, the privacy condition (2) is satisfied if

[TABLE]

The main message of these results is that we can choose the parameters of the proposed scheme such that both conditions are satisfied, simultaneously. In other words, these theorems confirm that the separation of the reading phase and the processing phase together with adding known individuals and by adjusting coverage depths, offers enough flexibility to satisfy both conditions at the same time; based on (6), $M$ is chosen, and using (5), $\alpha_{0}$ is set.

Example 1.

If we assume the values $\eta=\epsilon=\beta=0.01$ , then based on Theorem 2, we can have $M\geq 100$ and based on Theorem 1 for $M=100$ , $\alpha_{0}\simeq 9.6\times 10^{29}$ (or greater). Also for $\eta=\epsilon=\beta=0.1$ , we get $M=10$ and $\alpha_{0}\simeq 5300$ . For another example if we assume the values $\eta=0.01$ , $\epsilon=0.001$ , and $\beta=0.1$ , we will have $M=10$ and $\alpha_{0}\simeq 1166$ . ∎

III-A Mathematical Model in Data Collector in the Structured Scheme

For any SNP position $n\in[N]$ , the data collector should be able to estimate the vector $\textbf{x}_{n}=\left[X_{0,n},X_{1,n},\cdots,X_{M-1,n}\right]$ .

In this subsection, we seek for the model that the data collector observes in SNP position $n$ . We will show that the data collector receives $G_{n}$ as

[TABLE]

in which $Z_{n}\sim\mathcal{N}\left(0,\sigma^{2}\right)$ where

[TABLE]

To obtain this model, we should keep in mind that the fragments have no tags and the data collector and sequencer both do not know the corresponding individual which every fragment belongs to. Therefore, when the data collector receives the read fragments from sequencer, the only information it gets is the number of major (or minor) alleles in every position $n\in[1:N]$ . Consequently, the data collector receives the following summation

[TABLE]

in which $\tilde{X}_{m,n,i}$ and $\tilde{Y}_{m,n,i}$ are noisy versions of $X_{m,n}$ and $Y_{m,n}$ respectively, due to the reading error caused by the sequencer. Also, recall that the data collector knows the sequence of known individuals a priori, i.e. it knows the value for all $Y_{m,n}$ . Let us assume these values are $Y_{m,n}=y_{k,n}$ . Therefore we have

[TABLE]

Note that the $i$ index refers to the read number. After scaling (9) and subtracting $\sum_{k=0}^{M-1}y_{k,n}$ and $2^{M+1}\frac{\eta}{1-2\eta}$ , (9) can be written as

[TABLE]

Note that subtracting $\sum_{k=0}^{M-1}y_{k,n}$ in the above equation is fine, because of the full knowledge of matrix $\mathbf{Y}$ is available at the data collector.

To follow, we derive the parameters of the random variable $\tilde{X}_{m,n,i}$ on the condition of knowing $X_{m,n}$ . Based on (12) we have

[TABLE]

in which the last inequality is valid for both possible values of $X_{m,n}$ ; i.e. [math] and $1$ . Using the MMSE estimate and orthogonality principle, we can write

[TABLE]

where $Z_{m,n,i}$ is a random variable with $\mathbb{E}(Z_{m,n,i})=0$ and $\textrm{Var}(Z_{m,n,i})=\eta(1-\eta)$ . Also $Z_{m,n,i}$ and $X_{m,n}$ are uncorrelated. Consequently

[TABLE]

Based on central limit theorem $\frac{\sum_{i=1}^{\alpha_{0}}Z_{m,n,i}}{\sqrt{\alpha_{0}}}$ converges in distribution to a normal distribution with zero mean and variance $\eta(1-\eta)$ . Thus

[TABLE]

converges in distribution to a normal distribution with zero mean and variance $\frac{\eta(1-\eta)}{\alpha_{0}(1-2\eta)^{2}}$ . Thus, the last term in the right-hand side of (20) converges to a normal distribution with zero mean and variance $2^{m}\eta(1-\eta)$ . Similarly Using (15), we reach a similar equation.

Consequently, using (20), we can rewrite (III-A) as (7).

III-B Mathematical Model in Sequencer in Structured Scheme

Similar to the previous subsection, the sequencer receives the following summation in (9). The difference here with the previous subsection is that all individuals are unknown form the sequencer’s view point. Therefore,

[TABLE]

Yet, $\tilde{X}_{m,n,i}$ follows (12).

Scaling the summation in (9), the sequencer receives $q_{n}$ defined as

[TABLE]

Taking similar steps as in the previous subsection, $q_{n}$ is written as

[TABLE]

where $\tilde{Z}_{n}\sim\mathcal{N}\left(0,\sigma^{2}\right)$ in which $\sigma^{2}$ is defined in (8).

III-C Proof of Theorem 1

Having reached the mathematical model in the data collector in (7), we provide the proof of theorem 1.

Proof.

Note that the value of the summation $\sum_{m=0}^{M-1}2^{m}X_{m,n}$ uniquely matches to a $\textbf{x}_{n}$ (in binary representation of it, each entry corresponds to a $X_{m,n}$ for different values of $m$ ). Therefore, our objective is to find the summation above. The probability of error in estimating the summation, based on (7), is simply upper bounded by

[TABLE]

Obviously, here $d_{\text{min}}=1$ due to the fact that $X_{m,n}$ s are chosen from the set $\{0,1\}$ . Thus

[TABLE]

in which $\sigma^{2}$ is defined in (8).

∎

III-D Proof of Theorem 2

Using the mathematical model in (26), we are ready to provide the proof of theorem 2.

Proof.

The fact is that for the sequencers, $\mathcal{R}$ is equivalent to $q_{n}$ , $\forall n\in[N]$ because fragments contain just one SNP and are grouped based on their containing SNP position and in the group containing SNP position $n$ , the information is stored in $q_{n}$ . Thus we have

[TABLE]

Recall that $\mathbf{x}_{n},$ $\forall n\in[N]$ denotes the column $n$ of $\mathbf{X}$ . Due to independence of entries in $\mathbf{X}$ , we have

[TABLE]

Based on the last two equalities

[TABLE]

Thus, for privacy condition (2) to be satisfied, it is sufficient for every $n\in[N]$ to have

[TABLE]

To begin, we define $Z_{n}$ as

[TABLE]

It is concluded that the following Markov chain holds,

[TABLE]

Thus we have

[TABLE]

In what follows, we seek for $I(\mathbf{x}_{n};Z_{n})$ . We have

[TABLE]

We expand $Z_{n}$ in binary formation

[TABLE]

Consequently, the following equations hold

[TABLE]

in which in equation $i$ , $B_{i+1}$ is the carry over of the left-hand summation in binary field. Equivalently we have

[TABLE]

(37) yields

[TABLE]

We expand the right hand side of the above equality as

[TABLE]

Based on (41) and the fact that entries of $\mathbf{Y}$ have uniform prior probabilities, $b_{0,n}$ has uniform distribution, so $H(b_{0,n})=1$ . For $H(b_{1,n}|b_{0,n})$ we have

[TABLE]

which also results in $1$ . Note that $(a)$ is resulted from the fact that $B_{1,n}$ is sufficient statistic for $b_{1,n}$ . Also $(b)$ is resulted from (42). Similarly, all the terms in (III-D) result to $1$ except the last term. Therefore,

[TABLE]

Based on (33), for the second term in the right hand side of (III-D) we have

[TABLE]

Using the last two equalities and (III-D), we have

[TABLE]

The proof is complete.

∎

III-E Discussion

As it is seen from theorem 1, the minimum $\alpha_{0}$ needed to preserve the reconstruction condition, behaves exponential with $M$ . $\alpha_{0}$ is a noise-resistance parameter and as it becomes larger, the ratio of the fragments containing false reads concentrate to the probability of error in the reading phase ( $\eta$ ); that is why increasing $\alpha_{0}$ helps to eliminate the noise term in (7).

Taking a deeper look at the procedure in the proof of Theorem 2, we realize that we have created the binary field addition in our scheme, as was desired. The bits $b_{i,n}$ that derive form (41) to (43), are the result of binary field addition. The addition is for two random variables where one of them has uniform distribution, $Y_{i,n}$ , and the other, $X_{i,n}$ ,follows the distribution of SNP position $n$ . If the value of $b_{i,n}$ is given alone, the results reveals no new information about $X_{i,n}$ . Thus these bits alone, are not leaking any information. So we have created this kind of addition, thanks to adjusting the coverage depth values. From (7) it is concluded that the only bit leaking information in position $n$ is $B_{M,n}$ which means the binary field addition scheme is not working perfectly, but we should remember that the problem addressed in this paper has its limitations that we should adapt to. Interestingly, the maximum entropy of this bit is $1$ and this upper bound on the information leakage is independent of $M$ . This aspect is very interesting and useful and results the average information leakage per bit to be at most $\frac{1}{M}$ . Therefore by increasing $M$ , this average decreases, so we can adjust $M$ so that we reach the desired $\beta$ . Note that based on our simulations, $I(\mathbf{x}_{n};Z_{n})=H(B_{M,n}|b_{M-1,n}\cdots b_{0,n})$ is an increasing function of $M$ (see Figure 3) and tends to an ultimate value. So by increasing $M$ , the information leakage per bit decreases with the rate of $\frac{1}{M}$ , not more.

IV Structured Achievable Scheme with Random Coverage Depth

In the previous section, we analyzed the problem for constant coverage depth; however, it is not a practical case because we do not have exact control on the number of fragments. In this section, we consider a more general case in which the coverage depth parameters are random variables. We assume them to be binomial variables and approximate them with normal distribution. Therefore, for $\forall n\in[N],\forall m\in[0:M-1]$ , we have

[TABLE]

Similarly for $\forall n\in[N],\forall k\in[0:K-1]$ , we have

[TABLE]

Due to the fact that coverage depths mostly have large values, we assumed that $\alpha_{0}\in\mathbb{N}$ .

As the previous section, we introduce the results hereunder. After that, the mathematical model and the estimation rule are introduced in Subsections IV-A and IV-B. Then, the proof of Theorem 3 is provided in Subsection IV-C. Following them, we discuss the results in Subsection IV-D.

The following theorem provides a sufficient condition to satisfy the reconstruction condition.

Theorem 3.

In the all-but-one scheme, the reconstruction condition (1) is satisfied if:

[TABLE]

where

[TABLE]

and

[TABLE]

*Remark 1**:*

For the privacy condition, Theorem 2 is valid here as well. This will be discussed later in Subsection LABEL:discuss2.

IV-A Mathematical Model in Data Collector in the Structured Scheme

In this subsection, we will show that the information the data collector receives is the value in $G_{n}$ which is written as

[TABLE]

where $\delta_{m,n}$ and $\tilde{\delta}_{k,n}$ are normal random variables with zero mean and variance $2^{m}\sigma_{1}^{2}$ and $2^{k}\sigma_{1}^{2}$ respectively, where

[TABLE]

Also, $Z_{n}\sim\mathcal{N}(0,\sigma^{2})$ where

[TABLE]

In the pooled sequencing scenario, the sequencer will receive $G_{n}$ , which is defined as

[TABLE]

Consider the random variable $\sum_{i=1}^{\alpha_{m,n}}\tilde{X}_{m,n,i}$ conditioned on $X_{m,n}$ . We have

[TABLE]

It is trivial that the random variables $\sum_{i=1}^{2^{m}\alpha_{0}}\tilde{X}_{m,n,i}$ and $\sum_{i=2^{m}\alpha_{0}+1}^{\alpha_{m,n}}\tilde{X}_{m,n,i}$ are independent conditioned on $X_{m,n}$ . Also, the distribution of $\sum_{i=2^{m}\alpha_{0}+1}^{\alpha_{m,n}}\tilde{X}_{m,n,i}$ resembles that of $\sum_{i=1}^{\alpha_{m,n}-2^{m}\alpha_{0}}\tilde{X}_{m,n,i}$ both conditioned on $X_{m,n}$ .

We define

[TABLE]

Thus we have $\mathbb{E}\left(\xi_{m,n}\right)=0$ and

[TABLE]

Similar to the steps taken in Subsection III-A and as a result of the central limit theorem and orthogonality principle

[TABLE]

For the second term in (IV-A) we have

[TABLE]

Using the law of total variance we have

[TABLE]

where $(a)$ results from the fact that $\mathbb{E}\left(\xi_{m,n}\right)=0$ . Based on (IV-A), (IV-A), (IV-A), (IV-A) and due to the MMSE rule and the orthogonality theorem we have

[TABLE]

where $Z_{m,n}\sim\mathcal{N}\left(0,\frac{2^{m}\eta(1-\eta)}{\alpha_{0}(1-2\eta)^{2}}\right)$ . Using the same steps, for the data collector we have

[TABLE]

where $\tilde{Z}_{k,n}\sim\mathcal{N}\left(0,\frac{2^{k}\eta(1-\eta)}{\alpha_{0}(1-2\eta)^{2}}\right)$ .

Therefore using (65) and (66), (IV-A) can be written as

[TABLE]

where

[TABLE]

Thus

[TABLE]

For the fraction $\frac{\alpha_{m,n}}{\alpha_{0}}$ we can write it as

[TABLE]

where

[TABLE]

Therefore $\textrm{Var}(\zeta_{m,n})=\textrm{Var}\left(\alpha_{m,n}\right)$ and for $\delta_{m,n}\triangleq\frac{\zeta_{m,n}}{\alpha_{0}}$ we have

[TABLE]

Also, $\tilde{\delta}_{k,n}$ is defined similarly. Using (67), (70), and (72), (IV-A) is resulted from (IV-A)

IV-B Estimation Rule

For any SNP position $n\in[N]$ , the objective for the data collector is to estimate the vector $\mathbf{x}_{n}=\left[X_{1,n},X_{2,n},\cdots,X_{M,n}\right]^{T}$ . We define the extended vector $\tilde{\mathbf{x}}_{n}\triangleq\left[X_{1,n},\cdots,X_{M,n},y_{1,n},\cdots,y_{K,n}\right]^{T}$ , where the last $K$ entries are known to the data collector. Therefore, for the data collector, estimating $\tilde{\mathbf{x}}_{n}$ is equivalent to estimating $\mathbf{x}_{n}$ .

In this section, our objective is to find the rule that should be used by the data collector to estimate $\tilde{\mathbf{x}}_{n}$ . Using the ML rule, the estimate $\hat{\tilde{\mathbf{x}}}_{n}$ is obtained by

[TABLE]

Let

[TABLE]

Based on (IV-A),

[TABLE]

Therefore,

[TABLE]

IV-C Proof of Theorem 3

Based on the mathematical model and estimation rule presented in the previous subsections, we are ready to provide the proof of theorem 3.

Proof.

Similar to the proof presented in subsection III-C and based on the estimation rule in (76), our estimation resembles an AWGN channel. In other words, if we estimate $\sum_{m=0}^{M-1}2^{m}(X_{m,n}+y_{m,n})$ , then $\hat{\tilde{\mathbf{x}}}_{n}$ is resulted accordingly. Thus, for the probability of error we have

[TABLE]

Putting the right-hand side less than $\epsilon$ results

[TABLE]

Rewriting the left-hand side by substituting $\sigma_{1}^{2}$ results

[TABLE]

In order (78) to hold, it is sufficient for both two terms in the right-hand side of the above equality to be less than $\frac{1}{16\ln\left(\frac{1}{\epsilon}\right)}$ . From the first inequality we have

[TABLE]

From the second one we reach

[TABLE]

As both inequalities above should hold, the theorem is proven.

∎

IV-D Discussion

First of all, if we put $\sigma_{\alpha}=0$ in Theorem 3, the result resembles that of Theorem 1 which was expected. Also, as it is seen from 3, by increasing $M$ and decreasing $\epsilon$ , $e_{1}$ grows much faster (quadratic) than $e_{2}$ . So for small enough $\sigma_{\alpha}$ , $e_{1}$ is probably the bigger value.

*Remark 2**:*

In this remark we will show that theorem 2 works in the random case of coverage depth too. Similar to the steps taken in the Subsection IV-A, the sequencer will receive $q_{n}$ in SNP position $n\in[N]$ such that

[TABLE]

where $\delta_{m,n}$ and $\tilde{\delta}_{k,n}$ are normal random variables with zero mean and variance $\sigma_{1}^{2}$ . Also, $\tilde{Z}_{n}\sim\mathcal{N}(0,\sigma^{2})$ where $\sigma^{2}$ is defined in (57). We can write $q_{n}$ as

[TABLE]

where $\hat{Z}_{n}\sim\mathcal{N}\left(0,(2^{M+1}-2)\sigma_{1}^{2}+\sigma^{2}\right)$ .

From (83) the Markov chain in the proof of Theorem 2 is valid here as well. Continuing the same steps, we conclude that Theorem 2 works here. Therefore, all the discussions in that scenario is still valid here.

*Remark 3**:*

All the results driven are for the case of Haploid cells. In this case, there is one set of chromosomes. But in the case of Diploid cells, each cell carries two sets of chromosomes. It means that in every position in the genome, there are two chromosomes covering it. To extend our results to the case of Diploid cells, we can assume each individual contains the chromosomes of two haploid-celled individuals. So all the results are tailored to the case of Diploid cells if we replace $M$ with $2M$ for the $M$ -individual scheme.

V Conclusion and Future Steps

In this paper, we introduced the problem of privacy in the process of DNA sequencing. Previously, the privacy criterion was inspected in genomic data sets, but their concern of privacy is very different in comparison to our perspective. We seek to satisfy privacy in the process of sequencing that enables to hide the DNA sequence from the sequencing machine, while letting us to construct the sequence in a local processor that is trusted. Previous approaches’ concern was briefly how to make genomic data ready for announcement in a way that the information of no single individual is violated, so one can see how our approach is different.

In this paper, we aimed to theoretically define the problem of privacy in DNA sequencing and introduce an achievable scheme so that it can satisfy our constraints if parameters are adjusted correctly. We used non-colluding sequencers and distributed the genome data between them. Also, we used the idea of pooled sequencing and combined our the real data with known sequences. By setting the number of known sequences and the coverage depth of sequences, we can satisfy the constraints.

As this is the first paper in this problem, there can be done a lot in future works. For instance, The case in which a set of sequencers are collaborating could be concerned, or the case in which fragments are not limited to contain just one SNP. Also, the lower bounds in the theorems in this paper can be improved. At last, we hope this paper has paved the way towards privacy in the process of sequencing.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. A. Phillips, J. R. Trosman, R. K. Kelley, M. J. Pletcher, M. P. Douglas, and C. B. Weldon, “Genomic sequencing: assessing the health care system, policy, and big-data implications,” Health affairs , vol. 33, no. 7, pp. 1246–1253, 2014.
2[2] C. G. Van El, M. C. Cornel, P. Borry, R. J. Hastings, F. Fellmann, S. V. Hodgson, H. C. Howard, A. Cambon-Thomsen, B. M. Knoppers, H. Meijers-Heijboer, et al. , “Whole-genome sequencing in health care,” European Journal of Human Genetics , vol. 21, no. 6, p. 580, 2013.
3[3] E. S. of Human Genetics, Whole genome sequencing helps diagnosis and reduces healthcare costs for newborns in intensive care , 2018. https://medicalxpress.com/news/2018-06-genome-sequencing-diagnosis-healthcare-newborns.html .
4[4] S. D. Grosse and M. J. Khoury, “What is the clinical utility of genetic testing?,” Genetics in Medicine , vol. 8, no. 7, p. 448, 2006.
5[5] A. S. of Clinical Oncology et al. , “American society of clinical oncology policy statement update: genetic testing for cancer susceptibility,” Journal of clinical oncology: official journal of the American Society of Clinical Oncology , vol. 21, no. 12, p. 2397, 2003.
6[6] M. R. Anderlik and M. A. Rothstein, “Privacy and confidentiality of genetic information: what rules for the new science?,” Annual review of genomics and human genetics , vol. 2, no. 1, pp. 401–433, 2001.
7[7] M. White, Why you should be scared of someone stealing your genome , 2013. https://psmag.com/environment/why-you-should-be-scared-of-someone-stealing-your-genome-58082 .
8[8] C. Heeney, N. Hawkins, J. de Vries, P. Boddington, and J. Kaye, “Assessing the privacy risks of data sharing in genomics,” Public health genomics , vol. 14, no. 1, pp. 17–25, 2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Private Shotgun DNA Sequencing: A Structured Approach

Abstract

Index Terms:

I Introduction

II Problem Setting

III Structured Achievable Scheme with Constant Coverage Depth

Theorem 1**.**

Theorem 2**.**

Example 1**.**

III-A Mathematical Model in Data Collector in the Structured Scheme

III-B Mathematical Model in Sequencer in Structured Scheme

III-C Proof of Theorem 1

Proof.

III-D Proof of Theorem 2

Proof.

III-E Discussion

IV Structured Achievable Scheme with Random Coverage Depth

Theorem 3**.**

Remark 1*:*

IV-A Mathematical Model in Data Collector in the Structured Scheme

IV-B Estimation Rule

IV-C Proof of Theorem 3

Proof.

IV-D Discussion

Remark 2*:*

Remark 3*:*

V Conclusion and Future Steps

Theorem 1.

Theorem 2.

Example 1.

Theorem 3.

*Remark 1**:*

*Remark 2**:*

*Remark 3**:*