Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A   Unified Analysis

Yanan Li; Xuebin Ren; Shusen Yang; and Xinyu Yang

arXiv:1906.02606·cs.LG·June 7, 2019

Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

Yanan Li, Xuebin Ren, Shusen Yang, and Xinyu Yang

PDF

TL;DR

This paper introduces a unified framework analyzing how data correlation, prior knowledge, and query sensitivity influence privacy leakage, proposing a new privacy measure called prior differential privacy (PDP) and deriving mathematical expressions for various data models.

Contribution

It proposes the prior differential privacy (PDP) definition and provides a unified analysis of privacy leakage considering data correlation, prior knowledge, and query sensitivity.

Findings

01

Positive, negative, and hybrid correlations affect privacy leakage differently.

02

Closed-form expression of privacy leakage for continuous data is derived.

03

The analysis applies to general linear queries like count, sum, mean, and histogram.

Abstract

It has been widely understood that differential privacy (DP) can guarantee rigorous privacy against adversaries with arbitrary prior knowledge. However, recent studies demonstrate that this may not be true for correlated data, and indicate that three factors could influence privacy leakage: the data correlation pattern, prior knowledge of adversaries, and sensitivity of the query function. This poses a fundamental problem: what is the mathematical relationship between the three factors and privacy leakage? In this paper, we present a unified analysis of this problem. A new privacy definition, named \textit{prior differential privacy (PDP)}, is proposed to evaluate privacy leakage considering the exact prior knowledge possessed by the adversary. We use two models, the weighted hierarchical graph (WHG) and the multivariate Gaussian model to analyze discrete and continuous data,…

Tables6

Table 1. TABLE I: Notations and meanings

notations	descriptions
$𝐱$	A database instance ${x_{1}, x_{2}, \dots, x_{n}}$ .
$𝒰, 𝒦$	The indices set of unknown/known tuples.
$𝐱_{𝒰}, 𝐱_{𝒦}$	The instances of unknown/known tuples.
$s, s_{𝒦}, s_{𝒰}$	The sum of instance $𝐱$ , $𝐱_{𝒦}$ and $𝐱_{𝒰}$ .
$x_{i}, x_{i}^{'}$	Two different values of tuple $i$ .
$𝐱_{- i}$	The database $𝐱$ with $x_{i}$ eliminated.
$𝐱^{'}$	The database $𝐱$ with $x_{i}$ replaced with $x_{i}^{'}$ .
$𝒜_{i, 𝒦}$	An adversary with prior knowledge $𝐱_{𝒦}$ to attack $x_{i}$ .
$l_{𝒜_{i, 𝒦}}$	The privacy leakage caused by the adversary $𝒜_{i, 𝒦}$ .
$r \in ℛ$	The random request generated by $ℳ$ .
$ℳ$	A randomized mechanism over $𝐱$ .
$θ \in 𝚯$	All possible distributions of $𝐱$ .
$L S_{i} (f)$	The local sensitivity of a query function on tuple $i$ .
$G S (f)$	The global sensitivity of a query function on $𝐱$ .

Table 2. TABLE II: Four Joint Distributions

	$x_{1}$ =0	$x_{1}$ =1
$x_{2}$ =0	0.3	0.2
$x_{2}$ =1	0.2	0.3

Table 3. (a) Positive correlation

	$x_{1}$ =0	$x_{1}$ =1
$x_{2}$ =0	0.3	0.2
$x_{2}$ =1	0.2	0.3

Table 4. (b) Negative correlation

	$x_{1}$ =0	$x_{1}$ =1
$x_{2}$ =0	0.2	0.3
$x_{2}$ =1	0.3	0.2

Table 5. (c) Perfect correlation

	$x_{1}$ =0	$x_{1}$ =1
$x_{2}$ =0	0.5	0
$x_{2}$ =1	0	0.5

Table 6. (d) Perfect correlation

	$x_{1}$ =0	$x_{1}$ =1
$x_{2}$ =0	0.5	0
$x_{2}$ =5	0	0.5

Equations155

D P (M) = i, x_{- i}, x_{i}, x_{i}^{'}, S, sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{- i} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{- i} )} \leq ε .

D P (M) = i, x_{- i}, x_{i}, x_{i}^{'}, S, sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{- i} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{- i} )} \leq ε .

p (z) = \frac{1}{2 λ} exp (- ∣ z ∣ / λ),

p (z) = \frac{1}{2 λ} exp (- ∣ z ∣ / λ),

p (r ∣ x) = \frac{1}{2 λ} exp (- ∣ r - f (x) ∣/ λ) .

p (r ∣ x) = \frac{1}{2 λ} exp (- ∣ r - f (x) ∣/ λ) .

l_{A_{i, K}} (θ) = x_{i}, x_{i}^{'}, r sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} .

l_{A_{i, K}} (θ) = x_{i}, x_{i}^{'}, r sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} .

i, K, θ sup l_{A_{i, K}} (θ) \leq ε .

i, K, θ sup l_{A_{i, K}} (θ) \leq ε .

x_{i}, x_{i}^{'}, r sup (lo g \frac{Pr ( x _{i} ∣ r , x _{K} )}{Pr ( x _{i}^{'} ∣ r , x _{K} )} - lo g \frac{Pr ( x _{i} ∣ x _{K} )}{Pr ( x _{i}^{'} ∣ x _{K} )}) .

x_{i}, x_{i}^{'}, r sup (lo g \frac{Pr ( x _{i} ∣ r , x _{K} )}{Pr ( x _{i}^{'} ∣ r , x _{K} )} - lo g \frac{Pr ( x _{i} ∣ x _{K} )}{Pr ( x _{i}^{'} ∣ x _{K} )}) .

Pr (r \in S ∣ x_{i}, x_{K^{'}})

Pr (r \in S ∣ x_{i}, x_{K^{'}})

= x_{j} \sum Pr (x_{j}) Pr (r \in S ∣ x_{i}, x_{K}) .

\frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} = \frac{\sum _{x_{U}} Pr ( x _{U} ∣ x _{i} , x _{K} ) Pr ( r ∣ x _{i} , x _{- i} )}{\sum _{x_{U}} Pr ( x _{U} ∣ x _{i}^{'} , x _{K} ) Pr ( r ∣ x _{i}^{'} , x _{- i} )} \leq x_{U}, x_{U}^{'} sup \frac{Pr ( r ∣ x _{i} , x _{U} , x _{K} )}{Pr ( r ∣ x _{i}^{'} , x _{U}^{'} , x _{K} )} = x_{i \cup U}, x_{i \cup U}^{'} sup \frac{Pr ( r ∣ x _{i \cup U} , x _{K} )}{Pr ( r ∣ x _{i \cup U}^{'} , x _{K} )} .

\frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} = \frac{\sum _{x_{U}} Pr ( x _{U} ∣ x _{i} , x _{K} ) Pr ( r ∣ x _{i} , x _{- i} )}{\sum _{x_{U}} Pr ( x _{U} ∣ x _{i}^{'} , x _{K} ) Pr ( r ∣ x _{i}^{'} , x _{- i} )} \leq x_{U}, x_{U}^{'} sup \frac{Pr ( r ∣ x _{i} , x _{U} , x _{K} )}{Pr ( r ∣ x _{i}^{'} , x _{U}^{'} , x _{K} )} = x_{i \cup U}, x_{i \cup U}^{'} sup \frac{Pr ( r ∣ x _{i \cup U} , x _{K} )}{Pr ( r ∣ x _{i \cup U}^{'} , x _{K} )} .

l_{A_{1, {2}}} (a)

l_{A_{1, {2}}} (a)

= r sup lo g \frac{exp ( - ∣ r - 1∣ )}{exp ( - ∣ r - 2∣ )} = 1.

l_{A_{1, \emptyset}} (a)

l_{A_{1, \emptyset}} (a)

= r sup lo g \frac{\sum _{x_{2}} Pr ( x _{2} ∣ x _{1} = 0 ) exp ( - ∣ r - ( 0 + x _{2} ) ∣ )}{\sum _{x_{2}} Pr ( x _{2} ∣ x _{1} = 1 ) exp ( - ∣ r - ( 1 + x _{2} ) ∣}

\approx 1.19.

l_{A_{1, \emptyset}} (a) > l_{A_{1, {2}}} (a),

l_{A_{1, \emptyset}} (a) > l_{A_{1, {2}}} (a),

l_{A_{1, \emptyset}} (b) < l_{A_{1, {2}}} (b) .

l_{A_{1, \emptyset}} (a^{'}) > l_{A_{1, \emptyset}} (a),

l_{A_{1, \emptyset}} (a^{'}) > l_{A_{1, \emptyset}} (a),

l_{A_{1, \emptyset}} (b^{'}) < l_{A_{1, \emptyset}} (b) .

l_{A_{1, \emptyset}} (c) < l_{A_{1, \emptyset}} (d) .

l_{A_{1, \emptyset}} (c) < l_{A_{1, \emptyset}} (d) .

I C_{j, K^{'}} (x_{i, 1}, x_{i, 2}) = lo g \frac{\sum _{x_{j}} Pr ( x _{j} ∣ x _{i, 1} , x _{K^{'}} ) e ^{- x_{j} / λ}}{\sum _{x_{j}} Pr ( x _{j} ∣ x _{i, 2} , x _{K^{'}} ) e ^{- x_{j} / λ}} .

I C_{j, K^{'}} (x_{i, 1}, x_{i, 2}) = lo g \frac{\sum _{x_{j}} Pr ( x _{j} ∣ x _{i, 1} , x _{K^{'}} ) e ^{- x_{j} / λ}}{\sum _{x_{j}} Pr ( x _{j} ∣ x _{i, 2} , x _{K^{'}} ) e ^{- x_{j} / λ}} .

Γ_{ij, K^{'}} = {I C_{j, K^{'}} (x_{i, m}, x_{i, n}) ∣\forall x_{i, m}, x_{i, n} \in d o m x_{i}, m < n} .

Γ_{ij, K^{'}} = {I C_{j, K^{'}} (x_{i, m}, x_{i, n}) ∣\forall x_{i, m}, x_{i, n} \in d o m x_{i}, m < n} .

l_{A_{i, K^{'}}} = ∣ l_{A_{i, K}} + I C_{ij, K^{'}} ∣,

l_{A_{i, K^{'}}} = ∣ l_{A_{i, K}} + I C_{ij, K^{'}} ∣,

I C_{ij, K^{'}} = ar g max_{γ \in Γ_{ij, K^{'}}} ∣ l_{A_{i, K}} + γ ∣

I C_{ij, K^{'}} = ar g max_{γ \in Γ_{ij, K^{'}}} ∣ l_{A_{i, K}} + γ ∣

I C_{j, K^{'}} (x_{i, m}, x_{i, n}) = I R_{j, K^{'}} (x_{i, m}, x_{i, n}) \cdot \frac{L S _{j} ( f )}{λ},

I C_{j, K^{'}} (x_{i, m}, x_{i, n}) = I R_{j, K^{'}} (x_{i, m}, x_{i, n}) \cdot \frac{L S _{j} ( f )}{λ},

I R_{j, K^{'}} (x_{i, m}, x_{i, n}) = I C_{j, K^{'}} (x_{i, m}, x_{i, n}) / (L S_{j} (f) / λ)

I R_{j, K^{'}} (x_{i, m}, x_{i, n}) = I C_{j, K^{'}} (x_{i, m}, x_{i, n}) / (L S_{j} (f) / λ)

l_{A_{i, [n] \ {i}}} = r, x_{i}, x_{i}^{'} sup lo g \frac{Pr ( r ∣ x _{i} , x _{- i} )}{Pr ( r ∣ x _{i}^{'} , x _{- i} )} = r, x_{i}, x_{i}^{'} sup lo g \frac{exp ( - ∣ r - f ( x ) ∣/ λ )}{exp ( - ∣ r - f ( x ^{'} ) ∣/ λ )} \leq x_{i}, x_{i}^{'} sup ∣ f (x) - f (x^{'}) ∣ / λ = L S_{i} (f) / λ .

l_{A_{i, [n] \ {i}}} = r, x_{i}, x_{i}^{'} sup lo g \frac{Pr ( r ∣ x _{i} , x _{- i} )}{Pr ( r ∣ x _{i}^{'} , x _{- i} )} = r, x_{i}, x_{i}^{'} sup lo g \frac{exp ( - ∣ r - f ( x ) ∣/ λ )}{exp ( - ∣ r - f ( x ^{'} ) ∣/ λ )} \leq x_{i}, x_{i}^{'} sup ∣ f (x) - f (x^{'}) ∣ / λ = L S_{i} (f) / λ .

l_{A_{i, K}} = ∣ \dots ∣∣ L S_{i} (f) / λ + I C_{i j_{1}, [n] \ {i, j_{1}}} ∣ + I C_{i j_{2}, [n] \ {i, j_{1}, j_{2}}} ∣ + \dots + I C_{i j_{k}, [n] \ {i, j_{1}, \dots, j_{k}}} ∣,

l_{A_{i, K}} = ∣ \dots ∣∣ L S_{i} (f) / λ + I C_{i j_{1}, [n] \ {i, j_{1}}} ∣ + I C_{i j_{2}, [n] \ {i, j_{1}, j_{2}}} ∣ + \dots + I C_{i j_{k}, [n] \ {i, j_{1}, \dots, j_{k}}} ∣,

l_{A_{i, K}} = \dots \frac{L S _{i} ( f )}{GS ( f )} + I R_{i j_{1}, [n] \ {i, j_{1}}} \frac{L S _{j_{1}} ( f )}{GS ( f )} + \dots + I R_{i j_{k}, [n] \ {i, j_{1}, \dots, j_{k}}} \frac{L S _{j_{k}} ( f )}{GS ( f )} \frac{GS ( f )}{λ} .

l_{A_{i, K}} = \dots \frac{L S _{i} ( f )}{GS ( f )} + I R_{i j_{1}, [n] \ {i, j_{1}}} \frac{L S _{j_{1}} ( f )}{GS ( f )} + \dots + I R_{i j_{k}, [n] \ {i, j_{1}, \dots, j_{k}}} \frac{L S _{j_{k}} ( f )}{GS ( f )} \frac{GS ( f )}{λ} .

l_{A_{i, K}}

l_{A_{i, K}}

\leq (k + 1) GS (f) / λ .

k = 1 \sum n - 1 (t_{k} + t_{k} lo g t_{k}) \leq k = 1 \sum n - 1 t_{k} + k = 1 \sum n - 1 t_{k} k = 1 \sum n - 1 lo g t_{k} \leq n^{2} 2^{n - 1} (1 + 2 n^{2}) \leq 3 n^{4} 2^{n - 1} .

k = 1 \sum n - 1 (t_{k} + t_{k} lo g t_{k}) \leq k = 1 \sum n - 1 t_{k} + k = 1 \sum n - 1 t_{k} k = 1 \sum n - 1 lo g t_{k} \leq n^{2} 2^{n - 1} (1 + 2 n^{2}) \leq 3 n^{4} 2^{n - 1} .

l_{A_{i, K}} (θ) = ∣ x_{i} - x_{i}^{'} ∣ \leq M, r sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} .

l_{A_{i, K}} (θ) = ∣ x_{i} - x_{i}^{'} ∣ \leq M, r sup lo g \frac{Pr ( r \in S ∣ x _{i} , x _{K} )}{Pr ( r \in S ∣ x _{i}^{'} , x _{K} )} .

Pr (r ∣ x_{i}, x_{K}) = \int_{x_{j}} Pr (x_{j} ∣ x_{i}, x_{K}) Pr (r ∣ x_{j}, x_{i}, x_{K}) d x .

Pr (r ∣ x_{i}, x_{K}) = \int_{x_{j}} Pr (x_{j} ∣ x_{i}, x_{K}) Pr (r ∣ x_{j}, x_{i}, x_{K}) d x .

f (x) = (2 π)^{- \frac{n}{2}} ∣ Σ ∣^{- \frac{1}{2}} exp (- \frac{1}{2} (x - μ)^{'} Σ^{- 1} (x - μ)),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

Yanan Li, Xuebin Ren, Shusen Yang, and Xinyu Yang

This work is supported in part by National Natural Science Foundation of China under Grants 61572398, 61772410, 61802298 and U1811461; the Fundamental Research Funds for the Central Universities under Grant xjj2018237; China Postdoctoral Science Foundation under Grant 2017M623177; the China 1000 Young Talents Program; and the Young Talent Support Plan of Xi’an Jiaotong University. (corresponding author: Shusen Yang). Y. Li is with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with the School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China (e-mail: [email protected]). X. Ren, and X. Yang are with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, (e-mails: {xuebinren, yxyphd}@mail.xjtu.edu.cn). S. Yang is with National Engineering Laboratory for Big Data Analytics (NEL-BDA), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China, and also with the Ministry of Education Key Lab for Intelligent Networks and Network Security (MOE KLINNS Lab), Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China (e-mail: [email protected]).

Abstract

It has been widely understood that differential privacy (DP) can guarantee rigorous privacy against adversaries with arbitrary prior knowledge. However, recent studies demonstrate that this may not be true for correlated data, and indicate that three factors could influence privacy leakage: the data correlation pattern, prior knowledge of adversaries, and sensitivity of the query function. This poses a fundamental problem: what is the mathematical relationship between the three factors and privacy leakage? In this paper, we present a unified analysis of this problem. A new privacy definition, named prior differential privacy (PDP), is proposed to evaluate privacy leakage considering the exact prior knowledge possessed by the adversary. We use two models, the weighted hierarchical graph (WHG) and the multivariate Gaussian model to analyze discrete and continuous data, respectively. We demonstrate that positive, negative, and hybrid correlations have distinct impacts on privacy leakage. Considering general correlations, a closed-form expression of privacy leakage is derived for continuous data, and a chain rule is presented for discrete data. Our results are valid for general linear queries, including count, sum, mean, and histogram. Numerical experiments are presented to verify our theoretical analysis.

Index Terms:

privacy leakage, correlated data, prior knowledge.

I Introduction

Leakage of private information could lead to serious consequences (e.g., financial security and personal safety), and privacy protection has been extensively studied for several decades [1, 2]. In today’s big data era, privacy issues have been attracting increasing attention from both society and academia [3, 4, 5, 6]. Differential privacy (DP) [7, 8, 9] has become the defacto standard for privacy definitions because it can provide a rigorously mathematical proof of privacy guarantees.

In practice, adversaries may be able to acquire prior knowledge (i.e., partial data records), due to database attacks [10], privacy incidents [11], and obligations to release [12]. It is commonly believed that differentially private algorithms are invulnerable to adversaries with arbitrary prior knowledge because any given privacy level can be guaranteed, even when the adversary has knowledge of all data records except certain ones (i.e., the adversary with the strongest prior knowledge). However, this is true only if all data records are independent. It has been shown that the adversary’s prior knowledge can have significant impacts on privacy leakage when data records are correlated [13, 14].

The following example demonstrates how privacy leakage can be affected by correlations and the adversaries’ prior knowledge.

Example 1 Fig. 1 shows a scenario in which an adversary attempts to infer some sensitive information about a database. As shown, the database $\mathbf{x}$ consisting of two attributes, $s_{1}\in\{0,1\}$ , and $s_{2}\in\{0,1\}$ , publishes noisy (via a Laplace mechanism of differential privacy) statistics for privacy-preserving data mining. The adversary may acquire some prior knowledge about the database, i.e., the exact value of $s_{1}$ and the data correlations $Pr(s_{1},s_{2})$ from some public knowledge (e.g., the Internet). After observing the noisy statistics $r=f(s_{1},s_{2})+noise$ , the adversary tries to infer the privacy of $s_{1}$ based on all available information. Assume a noisy statistic $r=s_{1}+s_{2}+noise=2$ , the prior knowledge $s_{2}=1$ , and the adversary’s first impression about $s_{1}$ is $Pr(s_{1}=1)=Pr(s_{1}=0)=0.5$ before inference. The privacy information gain obtained by the adversary in the inference process is summarized in Fig. 2. We use the following three special cases to show the impacts of the correlations and prior knowledge on privacy leakage.

Case 1 (Positive Correlation): $s_{1}$ and $s_{2}$ are perfectly positively correlated with coefficient $1$ , i.e., $s_{1}=s_{2}$ . Without prior knowledge, the adversary will infer $s_{1}+s_{2}=2$ from the observation with high confidence according to the characteristics of the Laplace mechanism. Combined with the correlation $s_{1}=s_{2}$ , he will infer $s_{1}=1$ with high confidence. With the prior knowledge, e.g., $s_{2}=1$ , the adversary can ascertain that $s_{1}=1$ from the correlation $s_{1}=s_{2}$ . 2. 2.

Case 2 (Negative Correlation): $s_{1}$ and $s_{2}$ are perfectly negatively correlated with coefficient $-1$ , i.e., $s_{1}+s_{2}=1$ . Without prior knowledge, the adversary can infer no additional information about $s_{1}$ through $r=2$ due to the negative correlation. However, with the prior knowledge $s_{2}=1$ , the adversary can claim that $s_{1}=0$ . In addition, $r=2$ provides no additional information. 3. 3.

Case 3 (No Correlation): $s_{1}$ and $s_{2}$ are independent. Without prior knowledge, the adversary can infer that $s_{1}=1$ with relatively higher probability than $s_{1}=0$ from the observation $r=2$ . However, with the additional prior knowledge of $s_{2}=1$ , the adversary obtains no more confidence about $s_{1}$ because there is no correlation between $s_{1}$ and $s_{2}$ , i.e., a stronger adversary with extra prior knowledge achieves no privacy gain compared with a weaker adversary.

The above special correlation cases show that an adversary with certain prior knowledge can obtain different privacy gains under different types of correlations. For general correlation cases, i.e., when correlations are weakly positive or weakly negative (cases with red backgrounds in Fig. 2), the adversary can also infer additional information through the published results. Meanwhile, when correlations are perfectly positive or negative, adversaries with different prior knowledge can also gain different privacy information.

As demonstrated in the above examples, prior knowledge can be utilized by adversaries to infer sensitive information, leading to serious threats to various privacy preserving scenarios, such as data publishing [15, 16, 17, 18], continuous data release [19, 20, 21, 22], location based services [23, 24, 25], and social networks [26, 27]. To achieve efficient privacy protection for correlated data, it is essential to conduct rigorous theoretical studies to understand the analytical relationship between prior knowledge and privacy leakage, which is the main goal of this paper.

There have been several research efforts to this fundamental problem. The sequential composition theorem [7] of DP states that correlated data causes linear incrementing of privacy leakage if simply treating the correlated data as a whole. However, this does not utilize the correlation sufficiently and leads to a low utility for weakly correlated data. Therefore, many works [16, 19, 28, 25, 29] have focused on exploiting correlations to achieve high utility without sacrificing the privacy guarantee. However, these works do not consider adversaries with different prior knowledge, which has significant impacts on privacy leakage. Specifically, it has been demonstrated that without assumption on the adversaries’ prior knowledge, no privacy guarantee can be achieved [13, 30]. To measure the impacts of prior knowledge, Pufferfish privacy [12] and Blowfish privacy [31] formally model prior knowledge in their mathematical privacy definitions. However, there are no analytical impacts of correlation and prior knowledge on privacy leakage provided in either work [12, 31].

The state-of-the-art research, Bayesian differential privacy (BDP) [32], explicitly describes the relationship of privacy leakage and prior knowledge for a special case, i.e., when data are positively correlated. However, different types of correlations mean that the maximal influence of the query result caused by one tuple, i.e., the sensitivity, is different, and thus, leading to different privacy leakage. Therefore, it is necessary to discuss privacy leakage under all types of correlations, ranging from $-1$ to $1$ (including negative, independent, and positive correlations). As BDP is based on a Laplacian matrix that can only model the positive correlations for sum queries, the analytical method and conclusions in [32] cannot be generalized to negative correlations or hybrid correlations (i.e., positive and negative coexist).

In summary, the analytical relationship between prior knowledge and privacy leakage under general correlations remains unclear. To address this problem, this paper presents the first unified analysis that considers positive, negative, and hybrid data correlations. Our contributions are as follows:

We propose the definition of prior differential privacy (PDP) to measure privacy leakage caused by an adversary with any prior knowledge under general correlations. Based on PDP, we present a unified formulation (Theorem 2) to measure and discuss (Theorem 3) the impact of privacy leakage under varied prior knowledge and data correlations. Both the formulation and discussion can help us better understand the impact of prior knowledge and data correlation on privacy leakage. 2. 2.

We analyze privacy leakage for both discrete and continuous data. For discrete data, we propose a graph model to present the structure of the adversaries’ prior knowledge, and a chain rule (Theorem 5) to compute the privacy leakage. For continuous data, instead of a Markov random field, we adopt the multivariate Gaussian model to present general data correlations and derive a closed-form expression to compute privacy leakage (Theorem 6). Our analytic method is based on the theory of Bayesian inference. The analytical results can guide us in designing more efficient mechanisms with better utility-privacy tradeoffs. 3. 3.

We demonstrate that the analytic results can be applied to general linear queries, including count, sum, mean, and histogram. Extensive numerical simulation results verify our theoretical analysis.

The remainder of this paper is organized as follows. Section II introduces the related work. Section III introduces the notations and presents some preliminary knowledge. In section IV, a new definition PDP is proposed to analyze the impacts of prior knowledge, and we illustrate that three factors can impact privacy leakage. Section V and Section VI present the theoretical analysis of privacy leakage for both discrete data and continuous data, respectively. Numerical experiments are presented in Section VII, and we conclude this paper in Section VIII.

II Related Work

II-A Data Correlation

Many studies [13, 28, 25, 29] have demonstrated that DP may not guarantee its expected privacy when data are correlated. There are two plausible solutions to protecting the privacy of correlated data records. One is to achieve DP on each data record independently. However, the composition theorem [7] of DP has demonstrated that the privacy guarantee degrades with the number of correlated records. Another is to take the data records as a whole [33, 27, 34]. However, when the number of records is large, or the correlation is weak, the utility will still be low.

Therefore, it is crucial to accurately measure the data correlations to achieve more efficient privacy protection. Considerable work has been done from different perspectives. For general correlations, some work replaces the global sensitivity with new correlation-based parameters, such as correlated sensitivity [35] and correlated degree [36]. For example, in [35], a correlation coefficient matrix was utilized to describe the correlation of a series, and the correlation coefficient was considered as the weight to compute the global sensitivity. By utilizing inter- and intra-coupling, [36] proposed behavior functions to model the degree of correlation. For temporal correlations, most of the research work has focused on saving the privacy budget consumption in time-series data [37, 19, 24, 22]. For example, Dwork [37] proposed a cascade buffer counter algorithm to adaptively update the output result on an $\{0,1\}$ data stream. Fan [19] adopted a PID controller-based sampling strategy to adaptively inject Laplace noise into time-series data to improve the utility. For spatial correlations, the main idea is to group and perturb the statistics over correlated regions to avoid noise overdose [23, 38]. As a typical example, Wang [23] proposed dynamically grouping the sparse regions with similar trends and adding the same noise to reduce errors. In addition, for attribute correlations in multiattribute datasets, the fundamental idea is to reduce the dimensionality via identifying the attribute correlations [39, 40]. For example, Zhang et al. [39] constructed a Bayesian network to model the attribute correlation in high-dimensional data and then synthesized a privacy-preserving dataset in an ad hoc way. However, all these works assumed that adversaries have fixed prior knowledge, and thus, may not achieve the optimal tradeoff against adversaries with prior knowledge. In this paper, we consider both data correlations and flexible prior knowledge.

II-B Prior Knowledge

Prior knowledge can influence privacy leakage when the data are correlated [28, 32, 41], which has been considered in different research in terms of privacy definition and the design of privacy-preserving mechanisms. For example, the Pufferfish framework [12], aiming to help domain experts customize privacy definitions, theoretically has the potential to include all kinds of adversaries. The subsequent work of Blowfish privacy [31] developed mechanisms that permit more utility by specifying secrets about individuals and constraints about the data. In [42], a Wasserstein mechanism was proposed to fulfill the Pufferfish framework. In addition, [41] studied privacy leakage caused by the weakest adversary, and proposed the identity differential privacy (IDP) model. [43] exploited the structural characteristics of databases and the prior knowledge of domain experts to improve utility. However, no theoretical analysis on the relationship between the prior knowledge and privacy leakage has been formulated in all these work. In some research [44, 45], privacy leakage was guaranteed by limiting the difference between prior knowledge and posterior knowledge. However, in these works, the adversaries’ prior knowledge was limited to the probability distribution of the database and did not consider that partial data records may be compromised by specific adversaries. Instead, [32, 28] separated the adversary’s specific prior knowledge of partial tuples from the public knowledge of data correlations, which are derived from data distributions. Based on that, Yang et al. [32] adopted a Gaussian correlation model to study the impact of prior knowledge and demonstrated that the weakest adversary could cause the highest privacy leakage. Similar conclusions can be found in [28], which further identifies the maximally correlated group of data tuples to improve the utility. Nonetheless, the limitation is that their Laplacian matrix based Markov random field model can only be applied to analyze positive correlations on sum queries for continuous data or binary discrete data.

In this paper, we formally derive a formulation to present a unified analysis of the impact of data correlation and prior knowledge on privacy leakage, considering general linear queries on both discrete and continuous data.

III Preliminaries

We describe notations and conceptions in Subsection III-A, and introduce some knowledge of DP that will be used in our analysis in Subsection III-B.

III-A Notations

A database with $n$ tuples (attributes in a table or nodes in a graph), denoted as the set of indices $[n]=\{1,2,\cdots,n\}$ , aims to release the result of a certain query function $s=f(\mathbf{x})$ on an instance of the database, $\mathbf{x}=\{x_{1},x_{2},\cdots,x_{n}\}$ . It should be noted that, in accordance with [32, 29, 28], we use the same term “tuple” to denote the attribute instead of the record in a database. To protect the privacy of all tuples of an instance, it will return the noisy answer $r=\mathcal{M}(f(\mathbf{x}))$ by adding random noise drawn from a distribution. Hence, all possible outputs $S$ constitute a probability distribution Pr $(\mathcal{M}(f(\mathbf{x}))\in S)$ , or equivalently a conditional distribution Pr $(r\in S|f(\mathbf{x})=s)$ . We use a set $\Theta$ to capture the adversary’s beliefs on data correlation. We do not guarantee the privacy against adversaries out of $\Theta$ , because there is no feasibility under arbitrary distributions [11]. The main notations are listed in Table I.

III-A1 Adversary and Prior Knowledge

We denote $\mathcal{A}_{i,\mathcal{K}}$ as an adversary who attempts to infer the information of tuple $x_{i}$ , under the assumption that he knows the values of $\mathbf{x}_{\mathcal{K}}$ . We call $x_{i}$ the attack object, $\mathbf{x}_{\mathcal{K}}$ is the prior knowledge, $\mathcal{K}\subseteq[n]\setminus\{i\}$ , where $[n]=\{1,2,\cdots,n\}$ . Let $\mathcal{U}$ denotes the indices set of unknown tuples, then $[n]=\mathcal{K}\cup\{i\}\cup\mathcal{U}$ and the dataset $\mathbf{x}=\{\mathbf{x}_{\mathcal{K}},x_{i},\mathbf{x}_{\mathcal{U}}\}$ . An adversary $\mathcal{A}_{i,\mathcal{K}}$ is called the strongest adversary when $\mathcal{K}=[n]\setminus\{i\}$ and is called the weakest adversary when $\mathcal{K}=\emptyset$ . $\mathcal{A}_{i,\mathcal{K}^{\prime}}$ is called an ancestor of $\mathcal{A}_{i,\mathcal{K}}$ if $\mathcal{K}^{\prime}$ is a subset of $\mathcal{K}$ and differs by only one tuple, i.e., $\mathcal{K}^{\prime}=\mathcal{K}\backslash\{j\}$ . More tuples in $\mathbf{x}_{\mathcal{K}}$ mean the adversary has stronger prior knowledge.

III-A2 Correlation

To measure data correlations, we adopt the Pearson correlation coefficient, which can identify linear correlations. More importantly, it can be used to distinguish positive correlations and negative correlations. In joint distribution $\theta$ , let $\rho_{ij,\mathcal{K}}$ denote the correlation coefficient of $x_{i}$ and $x_{j}$ under the condition $\mathbf{x}_{\mathcal{K}}$ . In this paper, $\rho_{ij,\mathcal{K}}$ plays an important role in the analysis of how prior knowledge affects privacy leakage.

III-A3 Linear Query

A linear query function can be represented as $f(\mathbf{x})=\sum_{i}a_{i}x_{i}$ , where $x_{i},x_{j}\in\mathbf{x}$ are correlated with the Pearson correlation coefficient $\rho_{ij}$ . The linear query function can be transformed into a sum query $f(\mathbf{y})=\sum_{i}y_{i}$ on a new database $\mathbf{y}$ by letting $a_{i}x_{i}$ as $y_{i}$ , where $y_{i}\in\mathbf{y}$ . Then, the correlation coefficient of $y_{i}$ and $y_{j}$ should be $\rho_{ij}^{\prime}=sign(a_{i}a_{j})\rho_{ij}$ . Combining our new privacy definition PDP (will be discussed in Subsection IV-A), models can deal with general correlations; therefore, we focus our analysis on the sum query without loss of generality, and the conclusion can be straightforwardly extended to general linear queries.

III-B Differential Privacy

Definition 1.

(Differential Privacy [9]). A randomized mechanism $\mathcal{M}$ satisfies $\varepsilon$ -differential privacy ( $\varepsilon$ -DP), if for any $S\subseteq Range\{\mathcal{M}\}$ , the differential value $x_{i},x_{i}^{\prime}$

[TABLE]

Here, $\varepsilon>0$ is the distinguishable bound of all outputs on neighboring datasets $\mathbf{x}$ and $\mathbf{x}^{\prime}$ , where $\mathbf{x}^{\prime}$ is the database $\mathbf{x}$ with $x_{i}$ replaced with $x_{i}^{\prime}$ . A larger $\varepsilon$ corresponds to easier distinguishability of $x_{i}$ and $x_{i}^{\prime}$ , which means more privacy leakage.

For numerical data, a Laplace mechanism [9] can be used to achieve $\varepsilon$ -DP, by adding carefully calibrated noise to the query results. In particular, we draw noise from Laplace distribution $Lap(\lambda)$ with the probability density function

[TABLE]

in which $\lambda={GS(f)}/{\varepsilon}$ . Here, $GS(f)=\sup_{\mathbf{x},\mathbf{x}^{\prime}}\|f(\mathbf{x})-f(\mathbf{x}^{\prime})\|_{1}$ is the global sensitivity of query $f(\cdot)$ , and $LS_{i}(f)=\sup_{\mathbf{x}^{\prime}}\|f(\mathbf{x})-f(\mathbf{x}^{\prime})\|_{1}$ is the local sensitivity of $f$ . Since $r=f(\mathbf{x})+z$ , the probability density function of the output can be represented as

[TABLE]

IV Prior Differential Privacy

To compute the privacy leakage when considering adversaries with different prior knowledge and databases with different joint distributions, we propose a new definition in Subsection IV-A. Furthermore, we illustrate that three factors can affect privacy leakage through three numerical examples in Subsection IV-B.

IV-A Prior Differential Privacy

To evaluate privacy leakage considering adversaries have different prior knowledge, the definition BDP is proposed in [32] based on the Bayesian inference method [11, 46, 14]. However, BDP can only be applied to positive correlations. To overcome the drawback, we propose a definition named Prior Differential Privacy (PDP), which can be applied to databases with general correlations.

Definition 2.

(Prior Differential Privacy) Let $\mathbf{x}$ be a database instance with $n$ tuples, $\mathcal{A}_{i,\mathcal{K}}$ is an adversary with the attack object $x_{i}$ and prior knowledge $\mathbf{x}_{\mathcal{K}},\mathcal{K}\subseteq[n]\setminus\{i\}$ . The joint distribution of $\mathbf{x}$ is denoted as $\theta,\theta\in\Theta$ , where $\Theta$ is a set of distributions. $\mathcal{M}=\mathrm{\Pr}(r\in S|\mathbf{x})$ is a randomized perturbation mechanism, and $S$ is the output space. The privacy leakage of $\mathcal{M}$ w.r.t $\mathcal{A}_{i,\mathcal{K}}$ is the maximum logarithm function for all different values $x_{i}$ , $x_{i}^{\prime}$ , and any output $r\in S$ .

[TABLE]

We say $\mathcal{M}$ satisfies $\varepsilon$ -PDP if Eq. (2) holds for any $i\in[n]$ , $\mathcal{K}\subseteq[n]\backslash\{i\}$ , $\theta\in\Theta$ . That is,

[TABLE]

In Definition 2, $l_{\mathcal{A}_{i,\mathcal{K}}}(\theta,\mathcal{M})$ is the privacy leakage caused by the adversary $\mathcal{A}_{i,\mathcal{K}}$ under the distribution $\theta$ , which represents the data correlation. $\varepsilon$ is the maximal privacy leakage caused by all adversaries with public distribution $\Theta$ . Compared with BDP that only considers a single distribution, PDP considers a set of distributions $\Theta$ . Thus, PDP is more reasonable because the set $\Theta$ can reflect the cognitive diversity of the aggregator and the adversaries.

We show that PDP is in accordance with the Bayesian inference, Eq. (2) can be written as

[TABLE]

Eq. (3) denotes the information gain achieved by the adversary $\mathcal{A}_{i,\mathcal{K}}$ , after the adversary observes the published results $r$ . In addition, the PDP bounds the maximal information gain inferred by all possible adversaries that are no larger than $\varepsilon$ . The next theorem shows that prior knowledge impacts privacy leakage only when the database is correlated.

Theorem 1.

Prior knowledge has no impact on privacy leakage when tuples in the database are mutually independent.

Proof.

For an adversary $\mathcal{A}_{i,\mathcal{K}}$ and its ancestor $\mathcal{A}_{i,\mathcal{K}^{\prime}},\mathcal{K}^{\prime}=\mathcal{K}\backslash\{j\}$ , we get

[TABLE]

The last equality holds when the data tuples are independent, i.e., $\Pr(x_{j}|x_{i},\mathbf{x}_{\mathcal{K}^{\prime}})=\Pr(x_{j})$ . And similarly, $\Pr(r\in S|x_{i}^{\prime},\mathbf{x}_{\mathcal{K}^{\prime}})=\sum_{x_{j}}\Pr(x_{j})\Pr(r\in S|x_{i}^{\prime},\mathbf{x}_{\mathcal{K}}).$ If $l_{\mathcal{A}_{i,\mathcal{K}}}(\theta,\mathcal{M})\leq\varepsilon$ , according to PDP, we have $\frac{\Pr(r\in S|x_{i},\mathbf{x}_{\mathcal{K}})}{\Pr(r\in S|x_{i}^{\prime},\mathbf{x}_{\mathcal{K}})}\in[e^{-\varepsilon},e^{\varepsilon}]$ . Multiplying this fraction by $\Pr(x_{j})$ and summing with respect to $x_{j}$ , we obtain $l_{\mathcal{A}_{i,\mathcal{K}^{\prime}}}(\theta,\mathcal{M})\leq\varepsilon$ on the basis of the definition PDP. Therefore, different prior knowledge $\mathcal{K}$ and $\mathcal{K}^{\prime}$ have the same privacy leakage. ∎

Theorem 1 is also consistent with Eq. (3). If tuples are independent, then $\mathbf{x}_{\mathcal{K}}$ can be omitted in Eq. (3). Therefore, the prior knowledge has no impact on the privacy leakage when tuples are independent.

Remark 1. It is worth noting that DP and PDP are also consistent in nature. They all reflect the maximal distinguishability between distributions of perturbed output calculated on two neighboring datasets. In this paper, neighboring datasets are obtained by modifying one record in the dataset. The difference in DP and PDP is the different forms of neighboring datasets. In DP, the neighboring datasets are $\{x_{i},\mathbf{x_{-i}}\}$ and $\{x_{i}^{\prime},\mathbf{x_{-i}}\}$ . However, in PDP, the neighboring datasets are $\{x_{i},\mathbf{x}_{\mathcal{K}}\}$ and $\{x_{i}^{\prime},\mathbf{x}_{\mathcal{K}}\}$ . For any given $\theta$ and $\mathbf{x}_{\mathcal{K}}$ , we have

[TABLE]

The last equality in Eq. (4) applies DP on datasets differing at most $|\mathcal{U}|$ +1 tuples. The inequality in Eq. (4) holds for the fact that $\frac{\sum_{i}a_{i}c_{i}}{\sum_{i}b_{i}d_{i}}\leq\max_{i,j}\frac{c_{i}}{d_{j}}$ , if all parameters are nonnegative and $\{a_{i}\}$ and $\{b_{i}\}$ are probability simplex. Taking the logarithm and supremum of Eq. (4) over $r$ , we obtain $l_{\mathcal{A}_{i,\mathcal{K}}}(\theta,\mathcal{M})\leq DP(\mathcal{M})$ . Therefore, DP provides an upper bound of privacy leakage for PDP, i.e., we achieve a better trade-off between privacy and utility by adopting PDP than adopting DP.

IV-B Influence Factors

In this section, we demonstrate that three factors, prior knowledge $\mathbf{x}_{\mathcal{K}}$ , joint distribution $\theta$ , and local sensitivity $LS_{i}(f)$ , impact privacy leakage through numerical Example 2 to Example 4.

As shown in Table II, there are four joint distributions of database $\mathbf{x}=\{x_{1},x_{2}\}$ . The first three distributions have the same domain $x_{1},x_{2}\in\{0,1\}$ with different correlations, the third and fourth distributions have the same correlation, but $x_{2}$ has a different domain. Considering a sum query $f(\mathbf{x})=x_{1}+x_{2}$ , set Laplace mechanism scale $\lambda=1$ for simplicity. Denote $l_{\mathcal{A}_{1,\mathcal{K}}}(a)$ as the privacy leakage caused by the adversary $\mathcal{A}_{1,\mathcal{K}}$ when the distribution of $\mathbf{x}$ is Table III(a).

Example 2 (Prior Knowledge). Two adversaries $\mathcal{A}_{1,\{2\}}$ and $\mathcal{A}_{1,\emptyset}$ , attempt to infer the information $x_{1}=0$ or $x_{1}=1$ . $\mathcal{A}_{1,\{2\}}$ knows the information of $x_{2}$ (e.g., $x_{2}=1$ ), and $\mathcal{A}_{1,\emptyset}$ knows nothing about $x_{2}$ . Based on the definition of PDP, we calculate $l_{\mathcal{A}_{1,\{2\}}}(a)$ and $l_{\mathcal{A}_{1,\emptyset}}(a)$ . For $\mathcal{A}_{1,\{2\}}$ and $x_{2}=1$ , we get

[TABLE]

When $\mathcal{A}_{1,\emptyset}$ knows nothing about $x_{2}$ , according to Eq. (2), we have

[TABLE]

The exponential entries are derived from the Laplace mechanism and given $x_{1},x_{2}$ . Similarly, $l_{\mathcal{A}_{1,\emptyset}}(b)\approx 0.82.$ Therefore,

[TABLE]

Example 2 shows that prior knowledge has significant influence when the correlations are different. More importantly, it answers the two problems extended from Example 1. In addition, we note that the privacy leakage of DP is 2 if we simply regard correlated tuples $x_{1}$ , $x_{2}$ as a whole. Therefore, we achieve stricter privacy protection than DP under the same noise mechanism. In other words, we can introduce less noise to obtain the same privacy level.

Example 3 (Correlation). An adversary $\mathcal{A}_{1,\emptyset}$ attempts to infer the information of $x_{1}$ with no prior knowledge about $x_{2}$ . To show the impacts of the correlations, we modify $0.3\rightarrow 0.49,0.2\rightarrow 0.01$ in Tables III(a) and III(b) to obtain two distributions (a’) and (b’), in which $x_{1}$ and $x_{2}$ have stronger correlation. Computations of $l_{\mathcal{A}_{1,\emptyset}}(a^{\prime})$ and $l_{\mathcal{A}_{1,\emptyset}}(b^{\prime})$ are similar to $l_{\mathcal{A}_{1,\emptyset}}(a)$ in Example 2. According to Eq. (2), we obtain $l_{\mathcal{A}_{1,\{2\}}}(a^{\prime})=1.95$ , $l_{\mathcal{A}_{1,\{2\}}}(b^{\prime})=0.05$ . Therefore,

[TABLE]

Example 3 demonstrates that different correlations have significant influences on privacy leakage. Particularly, Eq. (7) shows that the adversary can infer more information of $x_{1}$ through a stronger positive correlation, and Eq. (8) shows the opposite result when the correlation is negative.

Example 4 (Local Sensitivity). An adversary $\mathcal{A}_{1,\emptyset}$ , with no prior knowledge of $x_{2}$ , attempts to infer $x_{1}$ . The difference in distributions Tables III(c) and III(d) is the domain of $x_{2}$ . Based on PDP and the similarity of computations of $l_{\mathcal{A}_{1,\emptyset}}(a)$ in Example 2, we have $l_{\mathcal{A}_{1,\emptyset}}(c)=2$ , and $l_{\mathcal{A}_{1,\emptyset}}(d)=6$ . Therefore,

[TABLE]

For the sum query on $\mathbf{x}$ , the local sensitivity of $x_{i}$ is its own domain. In distribution Table III(c), $LS_{2}(f)/LS_{1}(f)=1$ . In distribution Table III(d), $LS_{2}(f)/LS_{1}(f)=5$ . Example 4 shows that the local sensitivity impacts privacy leakage and a larger sensitivity ratio can lead to higher privacy leakage.

Examples 2-4 demonstrate that three factors impact privacy leakage, and show how to compute the privacy leakage for a database composed of two tuples. In the following sections, we will extend the numerical results to analytical results for both discrete and continuous data.

V General Relationship Analysis and Privacy Leakage Computation for Discrete Data

In this section, we analyze privacy leakage with respect to the three factors when data are discrete. Subsection V-A presents a weighted hierarchical graph (WHG) to model all adversaries with various prior knowledge. Subsection V-B discusses how to calculate the weight of edges in the WHG. Subsection V-C formulates a chain rule to represent the privacy leakage for an adversary with arbitrary prior knowledge. Subsection V-D presents a full-space-searching algorithm to compute the privacy leakage, and a fast-searching algorithm to improve the search efficiency in practice.

V-A Weighted Hierarchical Graph

A hierarchical graph is used to represent adversaries with various prior knowledge. Each node $(i,\mathcal{K})$ denotes an adversary, in which tuple $i$ is the attack object, and tuples set $\mathcal{K}$ denotes the prior knowledge. For a database with $n$ tuples, there are $n$ layers in a graph. From the bottom to the top, the prior knowledge $\mathcal{K}$ decreases by one tuple for each layer, until $\mathcal{K}=\emptyset$ . To compute the privacy leakage of adversaries, we further construct a weighted hierarchical graph (WHG) by assigning weights for the edges in the graph. We first define the value of nodes as the privacy leakage caused by corresponding adversaries. In addition, the edge connecting two nodes denotes the privacy leakage difference between two adversaries with the neighboring prior knowledge sets, i.e., $|\mathcal{K}|-|\mathcal{K}^{\prime}|=1$ . Then, the process of analyzing the privacy leakage is as follows. First, we construct the hierarchical graph for all possible adversaries for a given database. Second, we compute the values of all edges in the graph to obtain the WHG (discussed in Subsection V-B). Third, we compute the values of nodes in the first layer by PDP. Finally, we can obtain all nodes’ values by proposing a chain rule (Theorem 5). Finally, the privacy leakage can be obtained by choosing the maximal node naturally.

For example, we can obtain a WHG consisting of three layers and twelve nodes for a simple database with three tuples, as shown in Fig 3. Based on the node $(2,\{1,3\})$ and edges $e_{23,1}$ and $e_{21,3}$ , we obtain the privacy leakage for nodes $(2,\{1\})$ and $(2,\{3\})$ . Similarly, we can obtain the other four nodes in the second layer. For the node $(2,\emptyset)$ , we compute two values based on $(2,\{1\}),e_{21,\emptyset}$ and $(2,\{3\}),e_{23,\emptyset}$ , and choose the minimum as its privacy leakage. Similarly, we obtain another two nodes in the third layer. Now, the privacy leakage is the maximal node value in the graph.

In the above process, one key problem is to compute the edge value. Therefore, we propose a formula to address the problem.

V-B Impacts of Correlations and Prior Knowledge

In this section, we mainly deduce the formula to compute the edge value, which represents the impact of privacy leakage caused by different prior knowledge. Meanwhile, we show that the edge value is closely related to data correlation.

Note that the edge value shows the gain of privacy leakage when one tuple is removed from the prior knowledge. If the edge value is positive, then the ancestor, a weaker adversary, can cause more privacy leakage. If the edge value is negative, then the ancestor, a stronger adversary, can cause more privacy leakage.

Given $\mathbf{x}_{\mathcal{K}}^{\prime}$ , $\Pr(x_{i},x_{j}|\mathcal{K}^{\prime})$ denotes the conditional distribution derived from the joint distribution $\theta$ , and $\rho_{ij,\mathcal{K}^{\prime}}$ is the corresponding conditional correlation coefficient. The domain of tuple $x_{i}$ is $\{x_{i,1},x_{i,2},\cdots,x_{i,s}\}$ , in which $s$ is the domain size of $x_{i}$ . Based on $\Pr(x_{i},x_{j}|\mathcal{K}^{\prime})$ , the impact of $x_{i}$ on $x_{j}$ , under two different values $x_{i,1},x_{i,2}$ of $x_{i}$ , can be denoted as

[TABLE]

Then, impacts of $x_{i}$ on $x_{j}$ , under all possible pairs $x_{i,m},x_{i,n}$ , can be denoted as a set

[TABLE]

Next, a theorem shows how to compute the edge value.

Theorem 2.

Assume the privacy leakage of an adversary $\mathcal{A}_{i,\mathcal{K}}$ is $l_{\mathcal{A}_{i,\mathcal{K}}}$ , then the privacy leakage of its ancestor $\mathcal{A}_{i,\mathcal{K}^{\prime}}(\mathcal{K}^{\prime}=\mathcal{K}\backslash\{j\})$ is

[TABLE]

where

[TABLE]

is the value of the edge connecting two nodes $(i,\mathcal{K})$ and $(i,\mathcal{K}^{\prime})$ in the WHG.

Proof.

See Appendix A. ∎

Theorem 2 shows the impact on privacy leakage $IC_{ij,\mathcal{K}^{\prime}}$ caused by two adversaries whose prior knowledge differs by one tuple under general correlation. According to Theorem 2, the value $IC_{ij,\mathcal{K}^{\prime}}$ is the element in the set $\Gamma_{ij,\mathcal{K}^{\prime}}$ that maximizes the privacy leakage of node $(i,\mathcal{K}^{\prime})$ . That is, $IC_{ij,\mathcal{K}^{\prime}}$ presents the maximal impact of $x_{i}$ on $x_{j}$ under the conditional distribution $\Pr(x_{i},x_{j}|\mathbf{x}_{\mathcal{K}})$ .

To show the relationship between $IC_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ and the three factors described in Subsection IV-B, we rewrite the $IC_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ as

[TABLE]

where

[TABLE]

is called the increment ratio to denote the impact caused by correlations. $IC_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ represents the variation of privacy leakage when prior knowledge decreases. Therefore, the two components $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ and ${LS_{j}(f)}/{\lambda}$ in Eq. (12) represent the impact of local sensitivity, and correlation, respectively.

Now, we give the relationship between $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ and conditional correlation coefficient $\rho_{ij,\mathcal{K}^{\prime}}$ .

Theorem 3.

(1) For a database $\mathbf{x}$ with all possible joint distributions, $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})\in[-1,1]$ . (2) Under the assumption that the domain size of $x_{i}$ and $x_{j}$ are two, then $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})$ has the following relationship with $\rho_{ij,\mathcal{K}^{\prime}}$ :

if $\rho_{ij,\mathcal{K}^{\prime}}>0$ , then $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})\in(0,1]$ ; 2. 2.

if $\rho_{ij,\mathcal{K}^{\prime}}=0$ , then $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})=0$ ; 3. 3.

if $\rho_{ij,\mathcal{K}^{\prime}}<0$ , then $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})\in[-1,0)$ .

*(3) Under the assumption that the domain size of $x_{i}$ is two, the domain size of $x_{j}$ is greater than two, meanwhile, $\lambda>GS(f)$ , and then the results in Case (2) still applies. *

Proof.

See Appendix B. ∎

Case (1) in Theorem 3 shows that $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ has the same bound as the correlation coefficient. Case (2) shows that the relationship between the edge value and the data correlations, and extends the results of Examples 2-4 to general cases. Case (3) shows that similar results hold for a general $x_{j}$ with a larger domain, as long as $\lambda>GS(f)$ . Since the privacy budget $\varepsilon=GS(f)/\lambda$ in DP is commonly set as $\varepsilon<1$ , the condition $\lambda>GS(f)$ is usually true. Theorem 3 shows the impacts of the correlations and prior knowledge on the privacy leakage of the aggregation of two correlated tuples, which correspond to the different cases in Fig. 2.

Combining Theorem 3 with Eq. (11) and Eq. (12), we note that the weaker adversary causes higher privacy leakage when the tuples are positively correlated because more unknown tuples with positive correlations means a greater sensitivity to the query result. However, when tuples are negatively correlated, the weaker adversary does not cause less privacy leakage because more unknown tuples with negative correlations does not always mean smaller sensitivity or less privacy leakage.

What about when the domain size of $x_{i}$ is greater than two? Do the results in Theorem 3 still hold? Regretfully, the answer is negative. Let the domain size of $x_{i}$ be $s$ , then the number of $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ is $\binom{s}{2}$ . We cannot guarantee all these $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ satisfy Theorem 3. Instead, we have the following analytical results.

If $\rho_{ij,\mathcal{K}^{\prime}}>0$ , at least one $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})\in(0,1]$ ; 2. 2.

if $x_{i}$ and $x_{j}$ are independent, all $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})=0$ ; 3. 3.

if $\rho_{ij,\mathcal{K}^{\prime}}<0$ , at least one $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})\in[-1,0)$ .

Therefore, combining the above analytical results with Eqs. (12) and (11), we also conclude that the weaker adversary can cause higher privacy leakage when $x_{i}$ and $x_{j}$ are positively correlated. The prior knowledge has no impact on privacy leakage when $x_{i}$ and $x_{j}$ are independent, which also corresponds to Theorem 1. However, we cannot derive a deterministic relationship between the privacy leakage and prior knowledge if the tuples are negatively correlated. In this situation, we have to use Eq. (11) to determine their relationship.

V-C Privacy Leakage Formulation

In this subsection, we introduce how to compute the node value, which represents the privacy leakage caused by the adversary with prior knowledge in the WHG. As mentioned in Subsection V-A, the computation relies on two steps. One step computes the node values in the first layer; the other is a chain rule. We first present how to compute the node values in the first layer.

Theorem 4.

For a database $\mathbf{x}$ which has $n$ tuples and follows the joint distribution $\theta$ , the values of the nodes in the first layer are $l_{\mathcal{A}_{i,[n]\backslash\{i\}}}={LS_{i}(f)}/{\lambda},i\in[n]$ , where $LS_{i}(f)$ is the local sensitivity, and $\lambda$ is the parameter in the Laplace mechanism.

Proof.

Based on the definition of PDP, $\forall i\in[n]$ , we have

[TABLE]

∎

Theorem 4 demonstrates that the joint distribution, which represents the correlation, has no impact on privacy leakage when the adversary has the strongest prior knowledge. On the basis of Theorem 4, we deduce the values of the nodes in the second layer through Theorem 2. Similarly, we can obtain the values of the nodes in layer $k+1$ by layer $k$ according to Theorem 2. Finally, we can obtain all nodes’ values. In particular, the following theorem presents a solution to computing the privacy leakage of a certain node in WHG.

Theorem 5.

(Chain Rule) For a node $(i,\mathcal{K})$ in the layer $k+1$ , there exists a path from the bottom node $[n]$ to the node $(i,\mathcal{K}),\mathcal{K}=[n]\backslash\{i,j_{1},\cdots,j_{k}\}$ . From layer 1 to layer $k+1$ , $(i,[n]\backslash\{i\})$ , $(i,[n]\backslash\{i,j_{1}\}),\cdots,(i,[n]\backslash\{i,j_{1},\cdots,j_{k}\})$ are all the nodes in the path. Then, the privacy leakage of the node $(i,\mathcal{K})$ corresponding to this path is

[TABLE]

where $|\ldots||$ denotes $k$ -fold absolute value operation, and $k$ is the length of the path.

Proof.

The result can be obtained by using Theorem 2. In a path from the bottom to the top, there are $k+1$ nodes and $k$ edges, each of which consists of two nodes in the adjacent layers. The chain rule can be obtained by applying Theorem 2 on all $k$ edges in a path. ∎

Theorem 5 shows the computational process for a path from the bottom node to the given node. If there exist multiple paths, we should compute the value of each path by using Eq. (14), and then choose the minimum as the node’s value. There are three factors that can impact privacy leakage. The length of the path in Eq. (14), which represents the amount of prior knowledge. To highlight the other two factors, according to Eq. (12), we rewrite Eq. (14) as follows

[TABLE]

According to Eq. (15), we can see that PDP is superior to group differential privacy in terms of calculating an accurate privacy leakage for the adversary with specific prior knowledge. Particularly, according to Theorem 3, we have $IR_{ij,\mathcal{K}}\in[-1,1],\forall\mathcal{K}\subseteq[n]\backslash\{i\}$ . By setting all $IR_{ij,\mathcal{K}}=1$ in Eq. (15), we have

[TABLE]

Eqs. (16) and (17) show that the privacy leakage under PDP is more accurate than group differential privacy, which is simply derived from the sequential composition theorem. In addition, when the edge values in the WHG are all greater or less than zero, we can deduce some special results in the following corollary.

Corollary 1.

When all $IC_{ij,\mathcal{K}}=1$ , the PDP degrades to group differential privacy. 2. 2.

When all $IC_{ij,\mathcal{K}}>0$ , the maximal privacy leakage is obtained at the top layer, i.e., the weakest adversary causes the highest privacy leakage. 3. 3.

When all $IC_{ij,\mathcal{K}}<0$ , the maximal privacy leakage is obtained in the bottom layer, i.e., the strongest adversary causes the highest privacy leakage.

Case 1) can be derived from Eq. (16) directly. Additionally, it is easy to prove Cases 2) and 3) by using Eq. (15) and summing the nodes’ values in layer order.

Based on Corollary 1, we can easily compute the privacy leakage for these special cases. For example, in Case 2), the privacy leakage increases with the layer number. However, in general cases, when WHG has both positive and negative edges, we have to traverse the whole WHG to compute the privacy leakage.

V-D Algorithms for Computing Privacy Leakage

For a given database $\mathbf{x}$ with $n$ tuples, the number of edges is no fewer than the number of nodes $n2^{n-1}$ . Therefore, it is intractable to traverse the WHG when the number of tuples is large. We first use the full-space-searching algorithm to compute the least upper bound of privacy leakage and then propose a heuristic fast-searching algorithm to reduce the calculation time by limiting the searching space.

In the full-space-searching algorithm, we first initialize the value of the nodes in the first layer by Theorem 4 (line 1). Then, we generate nodes in layer $k+1$ by using the chain rule (Theorem 14) based on the edges’ value (Eq. (11)) between layers $k$ and $k+1$ (lines 3-10). Note that for a given node in layer $k+1$ , there may exist multiple paths from the nodes in layer $k$ to the given node. As mentioned previously, we need to retain the minimal value computed from multiple paths as the node value (line 8). Finally, we obtain the maximal privacy leakage of all nodes in the WHG (line 11).

Proposition 1.

The time complexity of the full-space-searching algorithm is $O(n^{4}2^{n-1})$ .

Proof.

There are two steps to obtain the value of the nodes in the layer $k+1$ from the value of the nodes in the layer $k$ . One is to first obtain new nodes in layer $k+1$ by removing one tuple from the prior knowledge of the nodes in layer $k$ . The second step is to sort and remove the repeated nodes with the same attack tuple and prior knowledge in layer $k+1$ . There are $n\binom{n-1}{n-k}$ nodes in layer $k$ , so the number of nodes after the first step would be $n\binom{n-1}{n-k}(n-k)$ , denoted as $t_{k}$ . The time complexity after the second step is $\Theta(t_{k}\log t_{k})$ . We note that $\sum_{k=1}^{n-1}t_{k}\leq n^{2}2^{n-1}$ . Summing from $k=1$ to $n-1$ , the time complexity of the algorithm is

[TABLE]

∎

As we can see, considerable time will be required to generate new nodes and to remove repeating nodes in the full-space searching algorithm. In addition, the time complexity grows exponentially with the number of tuples $n$ . To reduce the computational time complexity, a fast-searching algorithm is proposed to search a subspace of the original full space with a little sacrifice of the accuracy. Specifically, we only use the top $n$ largest nodes in layer $k$ to generate layer $k+1$ .

Proposition 2.

The time complexity of the fast-searching algorithm is $O(n^{4})$ .

Proof.

According to the fast-searching algorithm, there are, at most, $n$ tuples in layer $k$ . After the subtraction operation, there are at most $n(n-k)$ tuples, denoted as $t_{k}$ . The rest of this proof is the same as that of proposition 1. ∎

VI Gaussian Model-based Analysis for Continuous Data

In this section, we further discuss the impacts of correlation and prior knowledge for the continuous-valued data. In Subsection VI-A, we first explain why the WHG is not suitable for the continuous-valued database. Then, we introduce some properties of the multivariate Gaussian distribution. In Subsection VI-B, we identify an explicit formula to compute the privacy leakage of the multivariate Gaussian model.

VI-A Multivariate Gaussian Model

The necessity to separate the continuous situation from the discrete situation is that the computation method used in Section V is no longer sustainable. In Section V, we investigate how correlation and prior knowledge can impact privacy leakage. Based on the proposed WHG, we deduce the chain rule to compute privacy leakage. One crucial step is to compute the edge value in the WHG. Discrete-valued data can be achieved by using Eqs. (9) and (11), which requires enumerating all the different pairs of value $x_{i,m}$ and $x_{i,n}$ in the domain. Obviously, it is impossible for continuous-valued tuples with an unbounded domain. To deal with this issue, we should clarify the joint distribution. Therefore, although the analytical results in Section V still holds for both continuous-valued data; the edges’ value cannot be directly computed as discrete data.

For continuous data, a common solution is to accurately identify the global sensitivity via bounding the range (i.e., domain) of the tuples [12, 32]. Otherwise, the privacy leakage would be overestimated, and the unboundedness would destroy the utility of privacy-preserving results. Therefore, by bounding the range of $x_{i}$ as $|x_{i}-x_{i}^{\prime}|\leq M,r$ , Eq. (2) becomes

[TABLE]

Different from the sum operation in computing probability in Section V, we use integrate to compute the probability for continuous data. That is

[TABLE]

Here, we choose the multivariate Gaussian distribution (denoted as MGD) to describe the database $\mathbf{x}$ since most of the continuous data can be well modeled by MGD. For a database $\mathbf{x}$ with $n$ tuples, $\mathbf{x}=\{x_{1},\cdots,x_{n}\}$ , $\boldsymbol{\mu}$ is the expectation vector, and $\boldsymbol{\Sigma}=(\rho_{ij})_{n\times n}$ is the covariance matrix. If $\rho_{ij}>0$ , $x_{i}$ and $x_{j}$ are positively correlated. If $\rho_{ij}<0$ , $x_{i}$ and $x_{j}$ are negatively correlated. If $\rho_{ij}=0$ , $x_{i}$ and $x_{j}$ are independent. $\mathbf{x}$ follows the MGD if the density function of $\mathbf{x}$ is

[TABLE]

and denote $\mathbf{x}\sim N_{n}(\boldsymbol{\mu},\mathbf{\Sigma})$ . If $\mathbf{x}$ is blocked as $\{\mathbf{x}_{1},\mathbf{x}_{2}\}$ , then $\boldsymbol{\mu},\mathbf{\Sigma}$ can be written as

[TABLE]

The following lemma shows the properties of the MGD.

Lemma 1.

[47] Given the $n$ -dimensional variable $\mathbf{x}=\{\mathbf{x}_{1},\mathbf{x}_{2}\}$ follows the multivariate Gaussian distribution $N_{n}(\boldsymbol{\mu},\boldsymbol{\Sigma})$ , $\mathbf{x}_{1}\in\mathbb{R}^{p},\mathbf{x}_{2}\in\mathbb{R}^{n-p}$ .

The distribution of $\mathbf{x}_{1}$ given $\mathbf{x}_{2}$ follows the $p$ -dimensional Gaussian distribution $N_{p}(\boldsymbol{\mu}_{1|2},\mathbf{\Sigma}_{1|2}),\mathbf{\Sigma}_{22}\succ 0$ , with

[TABLE] 2. 2.

For any nonzero vector $\mathbf{a}\in\mathbb{R}^{n}$ ,

[TABLE]

VI-B Privacy Leakage Computation

As we can see from Eq. (19), the key to computing privacy leakage is computing the conditional probability when $\mathbf{x}$ follows the MGD. Let $\mathbf{x}_{2}=\{x_{i},\mathbf{x}_{\mathcal{K}}\}$ , $\mathbf{x}_{1}=\mathbf{x}\backslash\{x_{i},\mathbf{x}_{\mathcal{K}}\}=\mathbf{x}_{\mathcal{U}}$ , we have

[TABLE]

in which $\mathrm{Pr}(\mathbf{x}_{1}|\mathbf{x}_{2})$ can be obtained by Lemma 1, and $\mathrm{Pr}(r|\mathbf{x})$ can be calculated according to the Laplace mechanism. In addition, according to Eq. (19), when the attack object $x_{i}$ is replaced with $x_{i}^{\prime}$ , we have to compute

[TABLE]

where $\mathbf{x}_{2}^{\prime}=\{x_{i}^{\prime},\mathbf{x}_{\mathcal{K}}\}$ , $\mathbf{x}^{\prime}=\{x_{i}^{\prime},\mathbf{x}_{\mathcal{K}},\mathbf{x}_{\mathcal{U}}\}$ .

However, it is difficult to directly calculate the above probability formula under the assumption that $|x_{i}-x_{i}^{\prime}|\leq M$ for all possible pairs of $x_{i}$ , and $x_{i}^{\prime}$ . Fortunately, the next lemma shows that we can combine the $\mathrm{Pr}(\mathbf{x}_{1}|\mathbf{x}_{2}^{\prime})\mathrm{Pr}(r|\mathbf{x}^{\prime})$ into a uniform expression, which is useful for computation.

Lemma 2.

[32]

Let $G(x;b)$ be a function on $x\in\mathbb{R}$ , with parameter $b>0$ ,*

[TABLE]

where $\Phi(x)$ is the cumulative distribution function of a standard Gaussian distribution. Then, $\frac{\partial\log G(x;b)}{\partial x}$ is monotonically decreasing with respect to $x$ and

[TABLE]

Based on Lemmas 1 and 2, we propose the next formula to compute the privacy leakage of an adversary $(i,\mathcal{K})$ directly.

Theorem 6.

Given $\mathbf{x}$ follows the multivariate Gaussian distribution $N_{n}(\boldsymbol{\mu,\Sigma})$ , $f(\mathbf{x})=\sum_{i\in[n]}x_{i}$ is a general sum query on $\mathbf{x}$ . Let $\mathcal{M}$ be Laplace mechanism with the perturbed output $r=\mathcal{M}(\mathbf{x})=f(\mathbf{x})+z$ , where $z\sim Lap(\lambda)$ . Given an adversary $\mathcal{A}_{i,\mathcal{K}}$ with $\theta\in\Theta$ and $|x_{i}-x_{i}^{\prime}|\leq M$ , then the privacy leakage can be represented as

[TABLE]

where $\mu_{0i}$ is the coefficient of $x_{i}$ in the expansion of $\mu_{0}$ in Eq. (37).

Proof.

See Appendix C. ∎

Theorem 6 shows the impacts of the correlation and prior knowledge on privacy leakage for continuous data under the multivariate Gaussian distribution. For the given $M$ and $\lambda$ , we can see that the privacy leakage is determined by $\mu_{0i}$ , which is related to the covariance matrix $\mathbf{\Sigma}$ and prior knowledge $\mathcal{K}$ . $\mu_{0i}$ is the coefficient of $x_{i}$ in $\mu_{0}$ , which can be obtained from Eqs. (20) and (22). The details can be found in the proof of Theorem 6 (Appendix C). In the analysis of discrete-valued data without a concrete expression of the data distribution, the chain rule is proposed to compute the privacy leakage. However, for continuous data, based on the assumption of the MGD, we can compute the privacy leakage of an adversary directly without considering every two adjacent adversaries. For a special case that considers the weakest adversary, the privacy leakage has the following explicit form.

Corollary 2.

For an $n$ -dimensional Gaussian distribution $N_{n}(\boldsymbol{\mu,\Sigma}),\mathbf{\Sigma}=(\rho_{ij})_{n\times n}$ , the privacy leakage of the weakest adversary is $l_{\mathcal{A}_{i,\phi}}=|1+\rho_{ii}^{-1}\sum_{j\neq i}\rho_{ij}|M/\lambda,i\in[n]$ .

Proof.

In such case, the $\mathbf{x}_{2}=\{x_{i}\}$ , and $\mathbf{x}_{1}=\mathbf{x}_{-i}$ . According to Lemma 1, $\boldsymbol{\mu}_{1|2}=(\mu_{1},\cdots,\mu_{i-1},\mu_{i+1},\cdots,\mu_{n})^{\top}+(\rho_{i1}+\cdots+\rho_{i,j-1},\rho_{i,j+1},\cdots,\rho_{in})^{\top}\cdot\rho_{ii}^{-1}\cdot(x_{i}-\mu_{i}).$ By the definition of $\mu_{0}$ , we have

[TABLE]

Here, $\mu_{0i}$ , the coefficient of $x_{i}$ in $\mu_{0}$ , is $\rho_{ii}^{-1}\sum_{j\neq i}\rho_{ij}$ . Then, we complete the proof by applying Theorem 6. ∎

The proof of Corollary 2 demonstrates a special case of how to compute the coefficient $\mu_{0i}$ for the weakest adversary in Eq. (25). As specific cases of Theorem 6, Corollary 2 above demonstrates that privacy leakage of the weakest adversary has an explicit relationship to the data correlation. That is, privacy leakage of the weakest adversary depends on the summation of all covariances connecting to $x_{i}$ , which represents the data correlations. Next, the example shows the impacts of the correlations and prior knowledge for tuples that follow a two-dimensional Gaussian distribution.

Example 5 Consider a continuous-valued database $\mathbf{x}=\{x_{1},x_{2}\}$ ; the expectation vector and variance matrix of $\mathbf{x}$ are $\boldsymbol{\mu}=[0,0]$ , $\boldsymbol{\Sigma}=\left(\begin{array}[]{cc}1&\rho_{12}\\ \rho_{12}&1\\ \end{array}\right)$ , respectively. $\rho_{12}\in[-1,1]$ is the correlation coefficient of $x_{1}$ and $x_{2}$ . From Corollary 2, $l_{\mathcal{A}_{1,\emptyset}}=|1+\rho_{12}|M/\lambda$ . From the definition of PDP, we have $l_{\mathcal{A}_{1,\{2\}}}=M/\lambda$ . If $\rho_{12}>0$ , then $l_{\mathcal{A}_{1,\emptyset}}>l_{\mathcal{A}_{1,\{2\}}}$ . This means that when the correlation is positive, a weak adversary has more privacy leakage gain than a strong adversary. If $\rho_{12}<0$ , then $l_{\mathcal{A}_{1,\emptyset}}<l_{\mathcal{A}_{1,\{2\}}}$ . This means that when the correlation is negative, the strong adversary has more privacy gain. These results are also consistent with Examples 2 and 3, which are discrete-valued data.

VII Numerical Simulations

In this section, we conducted extensive experiments to demonstrate the impact of prior knowledge and data correlations on privacy leakage, and validate the effectiveness and efficiency of our proposed algorithms for computing privacy leakage.

VII-A Simulations Setting

We synthesized a database with 15 tuples111As described in Subsection III-A, a tuple refers to an attribute in the database., in which the average Pearson correlation coefficient changes from $-0.8$ to $0.8$ . For the discrete-valued database, we generated a corresponding WHG by assigning beta-distributed edges’ value. For the continuous-valued database, we generated the covariance matrix with covariance $Cov(x_{i},x_{j})=1,(i\neq j)$ for a positive correlation, and $Cov(x_{i},x_{j})=-1,(i\neq j)$ for a negative correlation. We adjusted the principal diagonal element $Cov(x_{i},x_{i})$ to control the correlation coefficient.

In our experiments, we considered an adversary who can infer information from a Laplace-mechanism-based privacy-preserving sum query on the database. The noise scale of the Laplace mechanism was set as $\lambda=1$ , and the domain size of all tuples was set as 1. In all simulations, the prior knowledge was measured by the number of tuples compromised by the adversary, ranging from $14$ to [math]. Then, the privacy leakage the adversary caused was calculated according to our analytical results (Theorems 2, 5, 6) in Sections V and VI.

VII-B Simulation Results

For simplicity, let averCorr denote the average value of the edges in the WHG, and averCoeff denote the average value of the correlation coefficient in the MGD. averCorr and averCoef represent the correlation degree for discrete-valued and continuous-valued data, respectively. According to the structure of the WHG, the layer number in a WHG represents the number of unknown tuples for an adversary. Therefore, a smaller layer number means a stronger adversary with more prior knowledge and vice versa. Fig. 4(a) and Fig. 4(b) shows the privacy leakage of discrete-valued data. Fig. 4(c) and Fig. 4(d) shows the privacy leakage of continuous-valued data.

VII-B1 Privacy Leakage vs Correlation

This subsection investigates the impacts of data correlations on privacy leakage when the prior knowledge is fixed.

Figs. 4(a)-4(d) show that the privacy leakage remains unchanged with averCorr and averCoeff when the adversary has the strongest prior knowledge (layer number=1, i.e., fewest unknown tuples). This is because the uncertainty only occurs from the attack object and no information gain can be obtained from the correlations, which corresponds to our analysis in Theorem 4.

Fig. 4(a) shows that the privacy leakage generally increases with averCorr when averCorr is positive, for discrete-valued data. The main reason is that with the increase in positive correlations, tuples are more likely to show the similar trends and the difference of the sum aggregation becomes much larger, from which the adversary could obtain more information gain of unknown tuples based on his prior knowledge (Theorem 3). Fig. 4(b) shows the similar results when averCorr is negative. Fig. 4(c) and Fig. 4(d) show similar results in the continuous data.

VII-B2 Privacy Leakage vs Prior knowledge

This subsection investigates the impacts of prior knowledge on privacy leakage when the correlation is fixed.

Figs. 4(a) and 4(c) show that the privacy leakage increases with the layer number for discrete data. That is, the privacy leakage decreases with the prior knowledge. This is because, given the positive correlation, more unknown tuples can cause a larger aggregation difference (Theorem 5) and less uncertainty for adversaries from the privacy-preserving results. In particular, the weakest adversary with the least prior knowledge will obtain the largest privacy gain, and thus leading to the highest privacy leakage (Corollary 1).

Fig. 4(b) shows that there are no monotone trends between the privacy leakage and layer number when the data correlation is negative for discrete-valued data because tuples with mutually negative correlations will cancel each other out and show no general trend in the aggregation result, which makes it difficult for any adversary to achieve privacy gain. This corresponds to our analysis that the privacy leakage computed by the chain rule (Theorem 5) does not decrease with $IC_{ij,\mathcal{K}^{\prime}}$ when $IC_{ij,\mathcal{K}^{\prime}}<0$ . Additionally, as we can see, the highest privacy leakage is achieved when the layer number is $1$ (the strongest prior knowledge), which is consistent with Corollary 1.

Fig. 4(d) shows that privacy leakage decreases with the amount of prior knowledge for continuous data because the tuples with a mutually negative correlation will cancel each other out in the aggregation, which reduces the uncertainty of aggregation and makes it difficult for the adversary to infer an individual tuple. Specifically, based on the multivariate Gaussian model, more unknown tuples will lead to a stronger “canceling” effect and less privacy leakage for a weaker adversary.

VII-B3 Accuracy and Time Complexity

This subsection investigates the accuracy and efficiency of the fast-searching algorithm compared to the full-space-searching algorithm. In the simulation, we set the tuple number ranging from $1$ to $15$ , and the average correlation from $0.2$ to $0.8$ . Then, we computed the corresponding maximal privacy leakage in each case. Each simulation was run 30 times; both the average privacy leakage and its variance were reported.

Figs. 5(a), 5(b), and 5(c) compare the privacy leakage when averCorr equals 0.2, 0.5, and 0.8 respectively. The privacy leakage computed with the fast-searching algorithm was generally larger than that of the full-space-searching algorithm and led to overestimating the privacy since the search space of the fast-searching algorithm is a subset of that of the full-space-searching algorithm. However, when the average correlation was stronger (e.g., averCorr=0.5, 0.8), the privacy leakage computed with the fast-searching algorithm was very close to the accurate privacy leakage computed with the full-space-searching algorithm. However, the fast-searching algorithm was far more efficient than the full-space-searching algorithm. In particular, Fig. 5(d) shows the comparison result of the average computational time for both algorithms. As we can see, the fast-searching algorithm with the time complexity of $O(n^{4})$ , required much less computational time than the full-space-searching algorithm with the time complexity of $O(n^{4}2^{n-1})$ .

VIII Conclusion

In this paper, we present a unified analysis to investigate the impacts of general (positive, negative, and hybrid) data correlations and arbitrary prior knowledge possessed by adversaries on privacy leakage. For continuous data, we obtain a closed-form expression of privacy leakage as a function of general data correlation and prior knowledge by using multivariate Gaussian distributions. For discrete data, a chain rule is derived to represent the privacy leakage, by using a WHG that can model the adversaries with arbitrary prior knowledge. All our analytical results are obtained by strictly mathematical proofs and hold for general linear queries. Numerical simulations validate our theoretical analysis. Future work will extend our analysis to nonlinear quires.

Appendix A Proof of Theorem 2

Proof.

For two adversaries $\mathcal{A}_{i,\mathcal{K}}$ and its ancestor $\mathcal{A}_{i,\mathcal{K}^{\prime}}$ , $\mathcal{K}^{\prime}=\mathcal{K}\backslash\{j\}$ , by the law of total probability, we have

[TABLE]

Let $l_{\mathcal{A}_{i,\mathcal{K}}}$ denote the value of the node $(i,\mathcal{K})$ . By the definition of PDP, $\sup_{x_{i},x_{i}^{\prime}}\log\frac{\mathrm{Pr}(r|x_{i},\mathbf{x}_{\mathcal{K}})}{\mathrm{Pr}(r|x_{i}^{\prime},\mathbf{x}_{\mathcal{K}})}\in[-l_{\mathcal{A}_{i,\mathcal{K}}},l_{\mathcal{A}_{i,\mathcal{K}}}]$ . Therefore

[TABLE]

Eq. (26) uses the Laplace mechanism and Eq. (27) is the definition of $IC_{ij,\mathcal{K}^{\prime}}$ . ∎

Appendix B Proof of Theorem 3

To prove Theorem 3, we propose the next lemma, which is used to express the correlation by its conditional distribution.

Lemma 3.

For a database $\mathbf{x}$ with two tuples $x_{1}=\{x_{1,1},x_{1,2}\}$ , and $x_{2}=\{x_{2,1},x_{2,2}\}$ . Let $y_{i}=\mathbb{E}(x_{2}|x_{1}=x_{1,i}),i=1,2,$ is the conditional expectation of $x_{1}$ . The joint distribution is $p_{ij}=\Pr(x_{1}=x_{1,i},x_{2}=x_{2,j}),p_{i\cdot}=p_{i1}+p_{i2},i,j\in\{1,2\}$ . Then, we have the next equivalent conditions of Pearson correlation coefficient of $x_{1}$ and $x_{2}$ , denoted as $\rho_{12}$ .

[TABLE]

Proof.

Based on the definition of the Pearson correlation coefficient, the plus-minus sign of $\rho_{12}$ is determined by its covariance $Cov(x_{1},x_{2})$ . Using the properties of conditional expectation, we have

[TABLE]

Therefore, the covariance $Cov(x_{1},x_{2})$ can be written as

[TABLE]

The last equation uses the fact that $p_{1\cdot}+p_{2\cdot}=1$ . Note that the plus-minus sign of $\rho_{12}$ is equivalent to the sign of $Cov(x_{1},x_{2})$ , then we prove the left half of Eq. (28) by setting $x_{1,2}>x_{1,1}$ as usual.

Next, we prove the right half. Based on the definition of the conditional expectation of $y_{i}$ , we have

[TABLE]

Eq. (B) uses the facts that $\frac{p_{11}}{p_{1\cdot}}+\frac{p_{12}}{p_{1\cdot}}=1$ , and $\frac{p_{21}}{p_{2\cdot}}+\frac{p_{22}}{p_{2\cdot}}=1$ . Set $x_{2,2}>x_{2,1}$ , then we complete the proof of the right half. ∎

Proof of Theorem 3

Proof.

(1) We prove that for any database $\mathbf{x}$ , the value of $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ is bounded in $[-1,1]$ .

Based on $\sum_{x_{j}}\mathrm{Pr}(x_{j}|x_{i},\mathbf{x}_{\mathcal{K}^{\prime}})=1$ , we have

[TABLE]

Eq. (30) holds for all $x_{i}$ . We replace $x_{i}$ with two different values, $x_{i,m}$ and $x_{i,n}$ and have the following inequalities.

[TABLE]

Therefore, according to Eq. (13), the definition of $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})$ , we have $IR_{j,\mathcal{K}^{\prime}}(x_{i,m},x_{i,n})\in[-1,1]$ .

(2) For a database $\mathbf{x}$ , two tuples among which are $x_{i}=\{x_{i,1},x_{i,2}\}$ , and $x_{j}=\{x_{j,1},x_{j,2}\}$ . The conditional joint distribution of $x_{i}$ and $x_{j}$ under $\mathbf{x}_{\mathcal{K}^{\prime}}$ is $\Pr(x_{i},x_{j}|\mathbf{x}_{\mathcal{K}^{\prime}})$ . We will prove that the correlations have a direct relation to $IR_{j,\mathcal{K}^{\prime}}(x_{1,1},x_{1,2})$ . According to Eq. (13),

[TABLE]

Let

[TABLE]

Obviously, $\mu_{1},\mu_{2}\in[0,1]$ , and define the next function.

[TABLE]

where the numerator and denominator are monotonically increasing with respect to $\mu_{1}$ , and $\mu_{2}$ , respectively. Based on these, we prove the three cases in Theorem 3 by using Lemma 3.

If $\rho_{ij,\mathcal{K}^{\prime}}>0$ , by Lemma 3, $1\geq\mu_{1}>\nu_{1}\geq 0$ . Therefore

[TABLE]

So, $f(\mu_{1},\nu_{1})\in\left(0,LS_{j}(f)/{\lambda}\right]$ , and $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})\in(0,1]$ .

If $\rho_{ij,\mathcal{K}^{\prime}}<0$ , by Lemma 3, $0\leq\mu_{1}<\nu_{1}\leq 1$ . Therefore

[TABLE]

So, $f(\mu_{1},\nu_{1})\in\left[-LS_{j}(f)/{\lambda},0\right)$ , and $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})\in[-1,0).$

If $\rho_{ij,\mathcal{K}^{\prime}}=0$ , by Lemma 3, $\mu_{1}=\nu_{1}\in[0,1]$ . Therefore, $f(\mu_{1},\nu_{1})\equiv 0$ , and $IR_{j,\mathcal{K}^{\prime}}(x_{i,1},x_{i,2})=0$ .

(3) The conditions are the same as Case (2) except that $x_{j}=\{x_{j,1},\cdots,x_{j,s}\},s\geq 3.$ Let $y_{m}=\mathbb{E}(x_{j}|x_{i,m},\mathbf{x}_{\mathcal{K}}^{\prime}),m=1,2.$ We claim that the left half of Lemma 3 holds without presenting the similar proof.

Next, we prove Case (3). According to Eq. (13),

[TABLE]

For $k=1,2,\cdots,s$ , let

[TABLE]

Then, we have

[TABLE]

The last approximation is obtained by using $e^{x}\approx 1+x$ and the fact $\sum_{k}\mu_{k}=1$ . Then, we get

[TABLE]

With the additional condition $\lambda>GS(f)$ , then we have $x_{j,k}/\lambda<1,\forall k\in[s]$ . Combining $\sum_{k}\mu_{k}=\sum_{k}\nu_{k}=1$ , we obtain $\sum_{k}\mu_{k}{x_{j,k}}/{\lambda}<1$ , and $\sum_{k}\nu_{k}{x_{j,k}}/{\lambda}<1$ . Based on the extended expression of the left half of Lemma 3. We get

[TABLE]

∎

Appendix C Proof of Theorem 6

Proof of Theorem 6

Proof.

According to the PDP for a continuous-valued database, we compute the following

[TABLE]

for any $\theta,r,|x_{i}-x_{i}^{\prime}|\leq M$ , where $s^{\prime}=s_{\mathcal{U}}+x_{i}^{\prime}+s_{\mathcal{K}}$ . In accordance with Lemma 1, set $\mathbf{x}_{u}=\mathbf{x}_{1}|\mathbf{x}_{2}$ , where $\mathbf{x}_{1}=\mathbf{x}_{\mathcal{U}}$ , $\mathbf{x}_{2}=\{x_{i},\mathbf{x}_{\mathcal{K}}\}$ . Here, $u$ is the number of variables in $\mathbf{x}_{\mathcal{U}}$ . According to Lemma 1, $\mathbf{x}_{u}$ follows $u$ -dimensional Gaussian distribution, with the density function

[TABLE]

where $A=(2\pi)^{-{u}/{2}}\left|\mathbf{\Sigma}_{1|2}\right|^{-{1}/{2}}$ , $\boldsymbol{\mu}_{1|2}=\boldsymbol{\mu}_{1}+\mathbf{\Sigma}_{12}\mathbf{\Sigma}_{22}^{-1}(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}),\mathbf{\Sigma}_{1|2}=\mathbf{\Sigma}_{11}-\mathbf{\Sigma}_{12}\mathbf{\Sigma}_{22}^{-1}\mathbf{\Sigma}_{21}.$ Because we adopt the Laplace mechanism,

[TABLE]

where $s=s_{\mathcal{U}}+x_{i}+s_{\mathcal{K}}$ denotes the sum of unknown tuples, attack object tuple and known tuples. According to Lemma 1, $s_{\mathcal{U}}=\sum_{k\in\mathcal{U}}x_{k}$ follows the Gaussian distribution, i.e.,

[TABLE]

where $\mu_{0}=\mathbf{1}^{\top}\cdot\boldsymbol{\mu}_{1|2},\sigma_{0}^{2}=\mathbf{1}^{\top}\cdot\mathbf{\Sigma}_{1|2}\cdot\mathbf{1}.$

By Eq. (21), $\sigma_{0}^{2}$ is a constant independent of $x_{i}$ . By Eq. (20), $\mu_{0}$ has relation to $x_{i}$ and $\mathbf{x}_{\mathcal{K}}$ . To analyze the influence of $x_{i}$ , we should extract the item including $x_{i}$ . Therefore, we expand $\mu_{0}$ and get

[TABLE]

where $\mu_{00}$ is a symbol to represent that all items have no relation to $x_{i},x_{k}$ . Therefore, $\mu_{0}$ is only dependent on $x_{i}$ for given $x_{k},k\in\mathcal{K}$ . Combining Eq. (36) and Eq. (37), the density function of $s_{\mathcal{U}}$ is

[TABLE]

Let $z=r-s,t=r-s_{\mathcal{K}}-\mu_{00}-\sum_{k\in\mathcal{K}}\mu_{0k}x_{k}-(1+\mu_{0i})x_{i}$ . Substituting Eq. (35) and Eq. (38) into Eq. (33), we have

[TABLE]

where $t^{\prime}=r-s_{\mathcal{K}}-\mu_{00}-\sum_{k\in\mathcal{K}}\mu_{0k}x_{k}-(1+\mu_{0i})x_{i}^{\prime}$ . So

[TABLE]

By the mean value theorem and Lemma 2, we further have

[TABLE]

Under the assumption $|x_{i}-x_{i}^{\prime}|\leq M$ , the privacy leakage is

[TABLE]

∎

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] T. Dalenius, “Towards a methodology for statistical disclosure control,” statistik Tidskrift , vol. 15, no. 429-444, pp. 2–1, 1977.
2[2] L. Cox, “Suppression methodology and statistical disclosure control,” Publications of the American Statistical Association , vol. 75, no. 370, pp. 377–385, 1980.
3[3] A. Juels, “Rfid security and privacy: a research survey,” IEEE Journal on Selected Areas in Communications , vol. 24, no. 2, pp. 381–394, 2006.
4[4] X. Fang, S. Misra, G. Xue, and D. Yang, “Smart grid-the new and improved power grid: A survey,” IEEE Communications Surveys & Tutorials , vol. 14, no. 4, pp. 944–980, 2012.
5[5] X. Yang, T. Wang, X. Ren, and W. Yu, “Survey on improving data utility in differentially private sequential data publishing,” IEEE Transactions on Big Data , no. 1, pp. 1–1, 2017.
6[6] P. Voigt and A. Von dem Bussche, The EU General Data Protection Regulation (GDPR) . Springer, 2017, vol. 18.
7[7] C. Dwork, A. Roth et al. , “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science , vol. 9, no. 3–4, pp. 211–407, 2014.
8[8] C. Dwork, “Differential privacy: A survey of results,” in International Conference on Theory and Applications of Models of Computation , 2008, pp. 1–19.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Impact of Prior Knowledge and Data Correlation on Privacy Leakage: A Unified Analysis

Abstract

Index Terms:

I Introduction

II Related Work

II-A Data Correlation

II-B Prior Knowledge

III Preliminaries

III-A Notations

III-A1 Adversary and Prior Knowledge

III-A2 Correlation

III-A3 Linear Query

III-B Differential Privacy

Definition 1**.**

IV Prior Differential Privacy

IV-A Prior Differential Privacy

Definition 2**.**

Theorem 1**.**

Proof.

IV-B Influence Factors

V General Relationship Analysis and Privacy Leakage Computation for Discrete Data

V-A Weighted Hierarchical Graph

V-B Impacts of Correlations and Prior Knowledge

Theorem 2**.**

Proof.

Theorem 3**.**

Proof.

V-C Privacy Leakage Formulation

Theorem 4**.**

Proof.

Theorem 5**.**

Proof.

Corollary 1**.**

V-D Algorithms for Computing Privacy Leakage

Proposition 1**.**

Proof.

Proposition 2**.**

Proof.

VI Gaussian Model-based Analysis for Continuous Data

VI-A Multivariate Gaussian Model

Lemma 1**.**

VI-B Privacy Leakage Computation

Lemma 2**.**

Theorem 6**.**

Proof.

Corollary 2**.**

Proof.

VII Numerical Simulations

VII-A Simulations Setting

VII-B Simulation Results

VII-B1 Privacy Leakage vs Correlation

VII-B2 Privacy Leakage vs Prior knowledge

VII-B3 Accuracy and Time Complexity

VIII Conclusion

Appendix A Proof of Theorem 2

Proof.

Appendix B Proof of Theorem 3

Lemma 3**.**

Proof.

Proof.

Appendix C Proof of Theorem 6

Proof.

Definition 1.

Definition 2.

Theorem 1.

Theorem 2.

Theorem 3.

Theorem 4.

Theorem 5.

Corollary 1.

Proposition 1.

Proposition 2.

Lemma 1.

Lemma 2.

Theorem 6.

Corollary 2.

Lemma 3.