Asymptotic Limits of Privacy in Bayesian Time Series Matching

Nazanin Takbiri; Dennis L. Goeckel; Amir Houmansadr; Hossein; Pishro-Nik

arXiv:1902.06404·cs.IT·February 19, 2019

Asymptotic Limits of Privacy in Bayesian Time Series Matching

Nazanin Takbiri, Dennis L. Goeckel, Amir Houmansadr, Hossein, Pishro-Nik

PDF

Open Access

TL;DR

This paper establishes theoretical bounds on user privacy in Bayesian time series matching, analyzing how anonymized data can still be vulnerable to re-identification through statistical matching, for i.i.d. and Markov models.

Contribution

It provides the first theoretical bounds on privacy loss in Bayesian time series matching, covering both i.i.d. and Markov-dependent data models.

Findings

01

Derived achievability and converse bounds for i.i.d. data traces.

02

Extended bounds to Markov chain data models.

03

Identified conditions under which privacy can be compromised.

Abstract

Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user's experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users' privacy may still be compromised by an external party who leverages statistical matching methods to match users' traces with their previous activities. In this paper, we obtain the theoretical bounds on user privacy for situations in which user traces are matchable to sequences of prior behavior, despite anonymization of data time series. We provide both achievability and converse results for the case where the data trace of each user consists of independent and identically distributed (i.i.d.) random samples drawn from a multinomial distribution, as well as the case that the users' data…

Figures10

Click any figure to enlarge with its caption.

Equations105

X_{u} = [X_{u} (1), X_{u} (2), \dots X_{u} (m)]^{T}, X = [X_{1}, X_{2}, \dots, X_{n}],

X_{u} = [X_{u} (1), X_{u} (2), \dots X_{u} (m)]^{T}, X = [X_{1}, X_{2}, \dots, X_{n}],

W_{u} = [W_{u} (1), W_{u} (2), \dots W_{u} (l)]^{T}, W = [W_{1}, W_{2}, \dots, W_{n}] .

W_{u} = [W_{u} (1), W_{u} (2), \dots W_{u} (l)]^{T}, W = [W_{1}, W_{2}, \dots, W_{n}] .

Y_{u} = [Y_{u} (1), Y_{u} (2), \dots Y_{u} (m)]^{T}, Y = [Y_{1}, Y_{2}, \dots, Y_{n}] .

Y_{u} = [Y_{u} (1), Y_{u} (2), \dots Y_{u} (m)]^{T}, Y = [Y_{1}, Y_{2}, \dots, Y_{n}] .

\displaystyle 0<\delta_{1}<f_{\textbf{P}}(\textbf{x})<\delta_{2}<\infty.\ \

\displaystyle 0<\delta_{1}<f_{\textbf{P}}(\textbf{x})<\delta_{2}<\infty.\ \

P_{e} (u, k) = P (X_{u} (k) \neq = X_{u} (k)),

P_{e} (u, k) = P (X_{u} (k) \neq = X_{u} (k)),

P_{e}^{*} (u, k) = E in f P (X_{u} (k) \neq = X_{u} (k)) \to 0.

P_{e}^{*} (u, k) = E in f P (X_{u} (k) \neq = X_{u} (k)) \to 0.

n \to \infty lim H (Π (1) ∣ W, Y) \to + \infty,

n \to \infty lim H (Π (1) ∣ W, Y) \to + \infty,

W_{u} (k) \sim B er n o u l l i (p_{u}),

W_{u} (k) \sim B er n o u l l i (p_{u}),

X_{u} (k) \sim B er n o u l l i (p_{u}), Y_{u} (k) \sim B er n o u l l i (p_{Π (u)}) .

X_{u} (k) \sim B er n o u l l i (p_{u}), Y_{u} (k) \sim B er n o u l l i (p_{Π (u)}) .

\displaystyle 0<\delta_{1}<f_{{P}}(x)<\delta_{2}<\infty.\ \

\displaystyle 0<\delta_{1}<f_{{P}}(x)<\delta_{2}<\infty.\ \

I (Π (1); W ∣ Y) \leq I (Π (1); P ∣ Y);

I (Π (1); W ∣ Y) \leq I (Π (1); P ∣ Y);

H (Π (1) ∣ Y) - H (Π (1) ∣ W, Y) \leq H (Π (1) ∣ Y) - H (Π (1) ∣ P, Y),

H (Π (1) ∣ Y) - H (Π (1) ∣ W, Y) \leq H (Π (1) ∣ Y) - H (Π (1) ∣ P, Y),

H (Π (1) ∣ W, Y) \geq H (Π (1) ∣ P, Y) .

H (Π (1) ∣ W, Y) \geq H (Π (1) ∣ P, Y) .

H (Π (1) ∣ W, Y) \to + \infty,

H (Π (1) ∣ W, Y) \to + \infty,

\overline{Y_{u}} = \frac{Y _{u} ( 1 ) + Y _{u} ( 2 ) + \dots + Y _{u} ( m )}{m},

\overline{Y_{u}} = \frac{Y _{u} ( 1 ) + Y _{u} ( 2 ) + \dots + Y _{u} ( m )}{m},

\overline{Y_{Π (u)}} = \frac{X _{u} ( 1 ) + X _{u} ( 2 ) + \dots + X _{u} ( m )}{m},

\overline{Y_{Π (u)}} = \frac{X _{u} ( 1 ) + X _{u} ( 2 ) + \dots + X _{u} ( m )}{m},

\overline{W_{u}} = \frac{W _{u} ( 1 ) + W _{u} ( 2 ) + \dots + W _{u} ( l )}{l} .

\overline{W_{u}} = \frac{W _{u} ( 1 ) + W _{u} ( 2 ) + \dots + W _{u} ( l )}{l} .

P (\overline{X_{u}} - \overline{W_{u}} \leq Δ_{n}) \to 1.

P (\overline{X_{u}} - \overline{W_{u}} \leq Δ_{n}) \to 1.

P (\overline{X_{u}} - \overline{W_{u}} \geq Δ_{n}) =

P (\overline{X_{u}} - \overline{W_{u}} \geq Δ_{n}) =

\leq P (\overline{X_{u}} - p_{u} + \overline{W_{u}} - p_{u} \geq Δ_{n})

\leq P ({\overline{X_{u}} - p_{u} \geq \frac{Δ _{n}}{2}} ⋃ {\overline{W_{u}} - p_{u} \geq \frac{Δ _{n}}{2}})

\leq P (\overline{X_{u}} - p_{u} \geq \frac{Δ _{n}}{2}) + P (\overline{W_{u}} - p_{u} \geq \frac{Δ _{n}}{2})

\leq 2 e^{- \frac{m Δ _{n}^{2}}{12 p _{u}}} + 2 e^{- \frac{l Δ _{n}^{2}}{12 p _{u}}}

= 2 e^{- \frac{c n ^{2 + α} \cdot n ^{- 2 - \frac{α}{2}}}{12 p _{u}}} + 2 e^{- \frac{c ^{'} n ^{2 + α} \cdot n ^{- 2 - \frac{α}{2}}}{12 p _{u}}}

= 2 e^{- \frac{c n ^{\frac{α}{2}}}{12}} + 2 e^{- \frac{c ^{'} n ^{\frac{α}{2}}}{12}} \to 0,

P (\overline{Y_{Π (1)}} - \overline{W_{1}} \leq Δ_{n}) \to 1,

P (\overline{Y_{Π (1)}} - \overline{W_{1}} \leq Δ_{n}) \to 1,

P (u = 2 ⋃ n {∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}}) \to 0.

P (u = 2 ⋃ n {∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}}) \to 0.

P (∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}) \leq 8 Δ_{n} δ_{2},

P (∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}) \leq 8 Δ_{n} δ_{2},

P (u = 2 ⋃ n {∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}})

P (u = 2 ⋃ n {∣ p_{u} - p_{1} ∣ \leq 4 Δ_{n}})

\leq 8 n Δ_{n} δ_{2}

\displaystyle=8{n^{-\frac{\alpha}{4}}}\delta_{2}\to 0,\ \

P (u = 2 ⋃ n {\overline{W_{u}} - \overline{W_{1}} \leq 2 Δ_{n}}) \to 0.

P (u = 2 ⋃ n {\overline{W_{u}} - \overline{W_{1}} \leq 2 Δ_{n}}) \to 0.

P (\overline{W_{u}} - p_{u} \geq Δ_{n}) \leq 2 e^{- \frac{l Δ _{n}^{2}}{3 p _{u}}} \leq 2 e^{- \frac{l Δ _{n}^{2}}{3}} .

P (\overline{W_{u}} - p_{u} \geq Δ_{n}) \leq 2 e^{- \frac{l Δ _{n}^{2}}{3 p _{u}}} \leq 2 e^{- \frac{l Δ _{n}^{2}}{3}} .

P (\overline{W_{1}} - p_{1} \geq Δ_{n}) \leq 2 e^{- \frac{c ^{'} n ^{\frac{α}{2}}}{3}} \to 0,

P (\overline{W_{1}} - p_{1} \geq Δ_{n}) \leq 2 e^{- \frac{c ^{'} n ^{\frac{α}{2}}}{3}} \to 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Privacy, Security, and Data Protection

Full text

Asymptotic Limits of Privacy in Bayesian Time Series Matching

Nazanin Takbiri

Electrical and

Computer Engineering

UMass-Amherst

[email protected]

Dennis L. Goeckel

Electrical and

Computer Engineering

UMass-Amherst

[email protected]

Amir Houmansadr

Information and

Computer Sciences

UMass-Amherst

[email protected]

Hossein Pishro-Nik

Electrical and

Computer Engineering

UMass-Amherst

[email protected] This work was supported by National Science Foundation under grants CCF–1421957 and CNS–1739462.

Abstract

Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user’s experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users’ privacy may still be compromised by an external party who leverages statistical matching methods to match users’ traces with their previous activities. In this paper, we obtain the theoretical bounds on user privacy for situations in which user traces are matchable to sequences of prior behavior, despite anonymization of data time series. We provide both achievability and converse results for the case where the data trace of each user consists of independent and identically distributed (i.i.d.) random samples drawn from a multinomial distribution, as well as the case that the users’ data points are dependent over time and the data trace of each user is governed by a Markov chain model.

Index Terms:

Anonymization, information theoretic privacy, Internet of Things (IoT), Markov chain model, statistical matching, Privacy-Preserving Mechanism (PPM).

I Introduction

The Internet of Things (IoT) is an important emerging technology and is growing at a rapid pace: by 2020, over 50 billion devices will be connected together as part of the IoT network [1]. Environmental monitoring, infrastructure management, energy management, medical and healthcare systems, building and home automation, and transport systems are some examples which indicate that IoT devices will affect nearly every aspect of our daily lives. However, this ubiquity of impact also raises grave privacy concerns. In particular, each IoT user in each application is generating a sequence of data that can be modeled as a random process; for example, in location-based services, each user is generating location traces. These sequences of data in IoT systems often contain sensitive information about users, such as their locations, health information, and hobbies. As a result, such huge amount of data generated by IoT devices can critically damage users’ privacy, thereby providing a significant obstacle to the adaption of IoT applications. Thus, IoT privacy has drawn the attention of the research community [2, 3, 4] to investigate effective privacy-preserving mechanisms (PPMs).

PPMs are used to increase the assurance that private data is not accessible to third parties. Two promising classes of PPMs are identity perturbation and data perturbation [5, 6, 7, 8, 9, 10, 11, 12]. The identity perturbation technique or anonymization is the process of hiding the true identity of the data owner [5, 6, 7, 8, 9]. This technique removes personal identifiers or converts personally identifiable information into aggregated data. The data perturbation or obfuscation is the process of hiding the users’ data by adding noise [10, 11, 12]. However, perturbation techniques reduce utility to provide better privacy protection; thus, obtaining the optimum levels of anonymization and obfuscation is important.

In [7, 13], a comprehensive analysis of the asymptotic (in the length of the time series) optimal matching of time series to source distributions is presented in a non-Bayesian setting, where the number of users is a fixed, finite value. However, in [14, 15, 16, 17, 18, 19, 20], a Bayesian setting was adopted in which the adversary has accurate prior distributions for user behavior through past observations or other sources, and the asymptotic limits of user privacy were obtained.

In addition, Li et al. [21] provide an optimal hypothesis test in the case where the adversary has training sequences from the group of users rather than the exact probability distribution.

In this paper, we adopt the same setting as [21]; however, our work has significantly different flavor than that of [21]. First, [21] finds the optimal test in the non-asymptotic regime where there exist two users, while here, the asymptotic limits of user privacy for the case of a large number of users are obtained. Second, [21] obtains the necessary conditions for breaking privacy, while here, conditions for both perfect anonymity and no privacy are obtained. Third, [21] establishes the optimal test for the case with binary alphabets where each user’s trace consists of independent and identically distributed (i.i.d.) samples drawn from a Bernoulli distribution, while here, we extend our results to the case where each user’s trace is governed by i.i.d. random samples of a multinoulli distribution. We also extend our results for a more general Markov chain model.

The remainder of this paper is organized as follows. Section II discusses the system model and the metrics used in the paper. Achievability and converse results for the two-state i.i.d. model are presented in Section III, and their extensions to the $r$ -state i.i.d. model are presented in Section IV. In addition, achievability and converse results for a more general Markov chain model are presented in Section V. Section VI provides some final conclusions and directions for future work.

II Framework

We assume a system with $n$ users. Each user creates a length- $m$ sequence of data, which is denoted by $\textbf{X}_{u}$ ,

[TABLE]

where $X_{u}(k)$ is the actual data point of user $u$ at time $k$ . For each user, there also exists a length- $l$ sequence of its past behavior which is denoted as $\textbf{W}_{u}$ ,

[TABLE]

where $W_{u}(k)$ is the observation of the prior behavior of user $u$ at time $k$ .

The adversary has access to the observations of the prior users’ behavior and wants to use this knowledge to break users’ privacy despite the usage of some PPMs. As shown in Figure 1, an anonymization technique is employed in order to perturb the users’ identity before the data is provided to the IoT application. In this figure, ${Y}_{u}(k)$ is the reported data point of user $u$ at time $k$ after applying anonymization; hence, the adversary observes

[TABLE]

where Y is the permuted version of X.

II-A Models and Metrics

Data Points Model: We assume there exist $r$ possible values $\{0,1,\cdots,r-1\}$ for each data point. As shown in Figure 1, there exist two traces for each user: one that is termed "training data" and one that is termed "actual data," which needs to be protected from a malicious adversary. Remember that these two traces are generated from the same unknown probability distribution. In other words, for $k\in\{1,2,\cdots,m\}$ and $k^{\prime}\in\{1,2,\cdots,l\}$ , both $X_{u}(k)$ and $W_{u}(k^{\prime})$ are drawn from a user-specific probability distribution denoted as $\textbf{p}_{u}$ . While all $\textbf{p}_{u}$ ’s are unknown to the adversary, each of them is drawn independently from a continuous density function $f_{\textbf{P}}(\textbf{x})$ , where for all $x$ in the support of $f_{\textbf{P}}(\textbf{x})$ , we assume

[TABLE]

Anonymization Mechanism: As shown in Figure 1, the mapping between users and data sequences is randomly permuted in order to achieve privacy. This random permutation is chosen uniformly at random among all $n!$ possible permutations on the set of $n$ users $\left(\mathbf{\Pi}\mathrel{\mathop{\mathchar 58\relax}}\{1,2,\cdots,n\}\mapsto\{1,2,\cdots,n\}\right)$ ; then, $\textbf{Y}_{u}=\textbf{X}_{\Pi^{-1}}$ , $\textbf{Y}_{\Pi(u)}=\textbf{X}_{u}.$

Adversary Model: The adversary tries to match each sequence in the collection of training data traces $\left\{\textbf{W}_{u},u=1,2,\cdots,n\right\}$ with the sequence in the observation data traces $\left\{\textbf{Y}_{u},u=1,2,\cdots,n\right\}$ that is drawn from the same probability distribution, which we term statistical matching. This is equivalent to finding the permutations of the user identities between two collections. Note that the adversary knows the anonymization mechanism; however, he/she does not know the realization of the random permutation function.

Following [17], the definition of no privacy is as follows:

Definition 1.

For an algorithm of the adversary that tries to estimate the actual data point of user $u$ at time $k$ , define the error probability as

[TABLE]

where $X_{u}(k)$ is the actual data point of user $u$ at time $k$ , and $\widetilde{X_{u}(k)}$ is the adversary’s estimated data point of user $u$ at time $k$ . Now, define ${\cal E}$ as the set of all possible estimators of the adversary. Then, user $u$ has no privacy at time $k$ , if and only if for large enough $n$ ,

[TABLE]

Hence, a user has no privacy if there exists an algorithm for the adversary to estimate $X_{u}(k)$ with diminishing error probability as $n$ goes to infinity.

In this paper, we also consider the situation in which there is perfect anonymity.

Definition 2.

User $u$ has perfect anonymity at time $k$ if and only if

[TABLE]

where $H\left(\Pi(1)|\textbf{W},\textbf{Y}\right)$ is the entropy of $\Pi(1)$ given W and Y.

III Two-State i.i.d. Model

In this section, we assume each user’s trace consists of samples from an i.i.d. random process and there are only two possible values for each user data point $X_{u}(k)\in\{0,1\}$ . Thus, both training traces and real data traces are governed by an i.i.d. Bernoulli distribution with parameter $p_{u}$ , where $p_{u}$ is probability that user $u$ taking value of a $1$ , hence,

[TABLE]

and

[TABLE]

As discussed in Section II, while $p_{u}$ ’s are unknown to the adversary, they are drawn independently from a known continuous density function ( $f_{P}(x)$ ), where for all $x\in(0,1)$ , we have

[TABLE]

III-A Perfect Anonymity Analysis

The following theorem states that if $m$ or $l$ are significantly smaller than $n^{2}$ in this two-state model, then all users have perfect anonymity.

Theorem 1.

For the above two-state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

at least one of $m$ or $l$ is less than or equal to $cn^{2-\alpha}$ for any $c,\alpha>0$ ;

then, user $1$ has perfect anonymity at time $k$ .

Proof.

First, consider the case $m\leq l$ . Here, W is considered as the training set and Y is considered as the observed set; thus, given Y, $\textbf{W}\rightarrow\textbf{P}\rightarrow\Pi(1)$ forms a Markov chain. According to the data processing inequality,

[TABLE]

thus,

[TABLE]

and

[TABLE]

In [15, Theorem 1], it is shown that if $m=n^{2-\alpha}$ , $H\left(\Pi(1)|\textbf{P},\textbf{Y}\right)\to+\infty$ , so, we can conclude

[TABLE]

as $n\to\infty$ .

Now, consider the case $l\leq m$ . By symmetry of the problem Y can be considered as the training set and W can be considered as the observed data. Thus, we can similarly prove the same results. ∎

III-B No Privacy Analysis

The following theorem states that if both $m$ and $l$ are significantly larger than $n^{2}$ in this two-state model, then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result break users’ privacy.

Theorem 2.

For the above two-state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

$m=cn^{2+\alpha}$ for any $c,\alpha>0$ ;

•

$l=c^{\prime}n^{2+\alpha}$ for any $c^{\prime},\alpha>0$ ;

then, user $1$ has no privacy at time $k$ .

Proof.

For $u\in\{1,2,\cdots n\}$ , define

[TABLE]

and

[TABLE]

We claim that for $m=cn^{2+\alpha}$ , $l=c^{\prime}n^{2+\alpha}$ and large enough $n$ :

$\mathbb{P}\left(\ \mathinner{\!\left\lvert\overline{{Y}_{\Pi(1)}}-\overline{{W}_{1}}\right\rvert}\leq\Delta_{n}\right)\to 1,$ 2. 2.

$\mathbb{P}\left(\bigcup\limits_{u=2}^{n}\left\{\ \mathinner{\!\left\lvert\overline{{Y}_{\Pi(u)}}-\overline{{W}_{1}}\right\rvert}\leq\Delta_{n}\right\}\right)\to 0,$

where $\Delta_{n}=n^{-(1+\frac{\alpha}{4})}$ . Thus, the adversary can match $\textbf{W}_{1}$ to $\textbf{Y}_{\Pi(1)}$ .

First Step: We want to show

[TABLE]

Note $\mathbb{E}[X_{u}(k)]=\mathbb{E}[W_{u}(k)]=p_{u}$ , so as $n\to\infty$ ,

[TABLE]

where the first inequality follows from the fact that $\mathinner{\!\left\lvert a-b\right\rvert}\leq\mathinner{\!\left\lvert a\right\rvert}+\mathinner{\!\left\lvert b\right\rvert}$ , and as a result, $\mathbb{P}\left(\ \mathinner{\!\left\lvert a-b\right\rvert}\geq\Delta_{n}\right)\leq\mathbb{P}\left(\ \mathinner{\!\left\lvert a\right\rvert}+\mathinner{\!\left\lvert b\right\rvert}\geq\Delta_{n}\right)$ . The union bound yields the third inequality, and the fourth inequality follows from Chernoff bounds. Now, for u=1, we have

[TABLE]

as $n\to\infty.$

Second Step: First, we show as $n\to\infty$ ,

[TABLE]

According to (1), for all $u\in\{2,3,\cdots,n\}$ , we have

[TABLE]

and according to the union bound,

[TABLE]

as $n\to\infty$ . Thus, for $u\in\{2,3,\cdots,n\}$ , the distance between $p_{u}$ and $p_{1}$ is bigger than $4\Delta_{n}$ with high probability.

Next, we show as $n\to\infty$ ,

[TABLE]

Note for all $u\in\{1,2,\cdots,n\}$ , Chernoff bounds yields:

[TABLE]

As a result, for $u=1$ , we have

[TABLE]

as $n\to\infty$ . In other words, with high probability, the distance between $\overline{W_{1}}$ and $p_{1}$ is less than $\Delta_{n}$ .

Now, given the fact that the distance between all $p_{u}$ ’s and $p_{1}$ is bigger than $4\Delta_{n}$ , and the fact that the distance between $\overline{W_{1}}$ and $p_{1}$ is less than $\Delta_{n}$ , for all $u\in\{2,3,\cdots,n\}$ , we have

[TABLE]

Thus,

[TABLE]

as $n\to\infty$ .

Now, we claim that given the fact that the distances between each of the $\overline{{W}_{u}}$ ’s and $\overline{{W}_{1}}$ are bigger than $2\Delta_{n}$ , we have

[TABLE]

Note, using (2), we have

[TABLE]

Thus, by using union bound, we have

[TABLE]

as $n\to\infty$ .

After completing the first and second steps, we can conclude if $m=cn^{2+\alpha}$ and $l=c^{\prime}n^{2+\alpha}$ , users have no privacy as $n\to\infty$ . ∎

IV $r$ -State i.i.d. Model

In this section, we assume each user’s trace consists of samples from an i.i.d. random process, and users’ data points can have $r$ possibilities, where $X_{u}(k)\in\{0,1,\cdots,r-1\}$ . Thus, both training traces and real data traces are governed by an i.i.d. multinoulli distribution with parameter $\textbf{p}_{u}$ , and

[TABLE]

where $p_{u}(i)$ is the probability that a datum of user $u$ has value $i$ .

As discussed in Section II, while $\textbf{p}_{u}$ ’s are unknown to the adversary, they are drawn independently from a known continuous density function $f_{\textbf{P}}(\textbf{x})$ , where for all $\textbf{x }\in\mathcal{R}_{\textbf{p}}$ ,

[TABLE]

we have

[TABLE]

IV-A Perfect Anonymity Analysis

The following theorem states that if $m$ or $l$ are significantly smaller than $n^{\frac{2}{r-1}}$ in this $r$ -state model, then all users have perfect anonymity.

Theorem 3.

For the above $r$ -state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

at least one of $m$ or $l$ is less than or equal to $cn^{\frac{2}{r-1}-\alpha}$ for any $c,\alpha>0$ ;

then, user $1$ has perfect anonymity at time $k$ .

Proof.

We can now repeat the similar reasoning as Theorem 1; then, by using [15, Theorem 2], the proof is complete. ∎

IV-B No Privacy Analysis

The following theorem states that if both $m$ and $l$ are significantly larger than $n^{\frac{2}{r-1}}$ in this $r$ -state model, then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result break users’ privacy.

Theorem 4.

For the above $r$ -state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

$m=cn^{\frac{2}{r-1}+\alpha}$ for any $c,\alpha>0$ ;

•

$l=c^{\prime}n^{\frac{2}{r-1}+\alpha}$ for any $c^{\prime},\alpha>0$ ;

then, user $1$ has no privacy at time $k$ .

Proof.

The proof of Theorem 4 is similar to the proof of Theorem 2, so we just provide the general idea. We similarly define the empirical probability that the user with pseudonym $u$ has data sample $i$ as follows:

[TABLE]

and

[TABLE]

We also have

[TABLE]

The difference from the proof of Theorem 2 is that, for each $u\in\{1,2,\cdots,n\}$ , $\overline{\textbf{Y}_{u}}$ and $\overline{\textbf{W}_{u}}$ are vectors of length $r-1$ . In other words,

[TABLE]

and we claim for $m=cn^{\frac{2}{r-1}+\alpha}$ , $l=c^{\prime}n^{\frac{2}{r-1}+\alpha}$ , and large enough $n$ ,

$\mathbb{P}\left(\ \mathinner{\!\left\lvert\overline{{\textbf{Y}}_{\Pi(1)}}-\overline{{\textbf{W}}_{1}}\right\rvert}\leq\Delta^{\prime}_{n}\right)\to 1,$ 2. 2.

$\mathbb{P}\left(\bigcup\limits_{u=2}^{n}\left\{\ \mathinner{\!\left\lvert\overline{{\textbf{Y}}_{\Pi(u)}}-\overline{{\textbf{W}}_{1}}\right\rvert}\leq\Delta^{\prime}_{n}\right\}\right)\to 0,$

where $\Delta^{\prime}_{n}=n^{-\left(\frac{1}{r-1}+\frac{\alpha}{4}\right)}$ . ∎

V $r$ -State Markov Chain Model

In Section III and IV, the data trace of each user is governed by an i.i.d. random process, while here the data trace of each user is governed by an irreducible and aperiodic $r$ -state Markov chain where $E$ is the set of edges. Let us define the transition probability from state $i$ to state $j$ as:

[TABLE]

thus, $(i,j)\in E$ if and only if $p_{u}(i,j)>0$ .

Here, we assume the same Markov chain structure for all of the users, but different users have different transition matrices. Note that a subset of the transition probabilities with size $|E|-r$ is sufficient for recovering the whole transition matrix. Let this subset be called $\textbf{p}_{u}$ , so

[TABLE]

where $p_{u}(i)$ is the probability that a datum of user $u$ has value $i$ . As discussed in Section II, while $\textbf{p}_{u}$ ’s are unknown to the adversary, they are drawn independently from a known continuous density function $f_{\textbf{P}}(\textbf{x})$ , where for all $\textbf{x }\in\mathcal{R}_{\textbf{p}}$ ,

[TABLE]

we have

[TABLE]

V-A Perfect Anonymity Analysis

The following theorem states that if $m$ or $l$ are significantly smaller than $n^{\frac{2}{|E|-r}}$ in this $r$ -state Markov chain model, then all users have perfect anonymity.

Theorem 5.

For the above $r$ -state Markov chain model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

at least one of $m$ or $l$ is less than or equal to $cn^{\frac{2}{|E|-r}-\alpha}$ for any $c,\alpha>0$ ;

then, user $1$ has perfect anonymity at time $k$ .

Proof.

We can now repeat the similar reasoning as Theorem 1; then, by using [15, Theorem 3], the proof is complete. ∎

V-B No Privacy Analysis

The following theorem states that if both $m$ and $l$ are significantly larger than $n^{\frac{2}{|E|-r}}$ , then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result, break users’ privacy.

Theorem 6.

For the above $r$ -state Markov chain model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and

•

$m=cn^{\frac{2}{|E|-r}+\alpha}$ for any $c,\alpha>0$ ;

•

$l=c^{\prime}n^{\frac{2}{|E|-r}+\alpha}$ for any $c^{\prime},\alpha>0$ ;

then, user $1$ has no privacy at time $k$ .

Proof.

The proof of Theorem 6 is similar to the proof of Theorem 2, so we just provide the general idea. For each $u\in\{1,2,\cdots,n\}$ , we similarly define $\overline{\textbf{Y}_{u}}$ and $\overline{\textbf{W}_{u}}$ as vectors of length $|E|-r$ :

[TABLE]

We claim that for $m=cn^{\frac{2}{|E|-r}+\alpha}$ , $l=c^{\prime}n^{\frac{2}{|E|-r}+\alpha}$ , and large enough $n$ ,

$\mathbb{P}\left(\ \mathinner{\!\left\lvert\overline{{\textbf{Y}}_{\Pi(1)}}-\overline{{\textbf{W}}_{1}}\right\rvert}\leq\Delta^{\prime\prime}_{n}\right)\to 1,$ 2. 2.

$\mathbb{P}\left(\bigcup\limits_{u=2}^{n}\left\{\ \mathinner{\!\left\lvert\overline{{\textbf{Y}}_{\Pi(u)}}-\overline{{\textbf{W}}_{1}}\right\rvert}\leq\Delta^{\prime\prime}_{n}\right\}\right)\to 0,$

where $\Delta^{\prime\prime}_{n}=n^{-\left(\frac{1}{|E|-r}+\frac{\alpha}{4}\right)}$ . ∎

VI Conclusion

In this paper, we have derived the theoretical bounds on user privacy in situations in which user traces are matchable to prior user behavior despite anonymization protection. In particular, the adversary employs statistical matching of the user traces to previous behavior of users within a network to compromise their privacy.

As shown in Figure 2, which displays the characterized privacy limits for the i.i.d. case, we demonstrated that the parameter plane, with coordinates length of learning set ( $l$ ) and length of observed set ( $m$ ), can be divided into two regions: in the first region, all users have perfect anonymity and in the second region no user has any privacy whatsoever. Specifically, we showed that if either $l$ or $m$ is significantly smaller than $n^{\frac{2}{r-1}}$ , users have perfect anonymity and the adversary cannot identify the permutation function $\left(\mathbf{\Pi}\right)$ , and, if both of them are significantly larger than $n^{\frac{2}{r-1}}$ , users have no privacy. It is worth noting that in the case the adversary has the accurate prior information, which is discussed in [15, 16] and is shown in Figure 3, users have no privacy as long as number of adversary observations per user $m$ is larger than $n^{\frac{2}{r-1}}$ .

For the case where the users’ data points are governed by an irreducible and aperiodic $r$ -state Markov chain with $|E|$ edges, we demonstrated similar results: if either $l$ or $m$ is significantly smaller than $n^{\frac{2}{|E|-r}}$ , users have perfect anonymity, and, if both of them are significantly larger than $n^{\frac{2}{|E|-r}}$ , users have no privacy.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Bausch. (2016) The internet of things forecast of 50 billion connected devices by 2020 is grossly over-estimated and entirely misleading. [Online]. Available: https://www.electronicproducts.com/Internet_of_Things/Research/The_Internet_of_Things_forecast_of_50_billion_connected_devices_by_2020_is_grossly_over_estimated_and_entirely_misleading.aspx
2[2] Federal Trade Commission Staff, “Internet of things: Privacy and security in a connected world,” 2015.
3[3] A. Ukil, S. Bandyopadhyay, and A. Pal, “Io T-privacy: To be private or not to be private,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) . Toronto, ON, Canada: IEEE, 2014, pp. 123–124.
4[4] S. Hosseinzadeh, S. Rauti, S. Hyrynsalmi, and V. Leppänen, “Security in the internet of things through obfuscation and diversification,” in IEEE Conference on Computing, Communication and Security (ICCCS) . Pamplemousses, Mauritius: IEEE, 2015, pp. 1–5.
5[5] B. Hoh and M. Gruteser, “Protecting location privacy through path confusion,” in First International Conference on Security and Privacy for Emerging Areas in Communications Networks (Secure Comm) . Pamplemousses, Mauritius: IEEE, 2005, pp. 194–205.
6[6] J. Freudiger, M. Raya, M. Félegyházi, P. Papadimitratos, and J. P. Hubaux, “Mix-zones for location privacy in vehicular networks,” Vancouver, 2007.
7[7] F. M. Naini, J. Unnikrishnan, P. Thiran, and M. Vetterli, “Where you are is who you are: User identification by matching statistics,” IEEE Transactions on Information Forensics and Security , vol. 11, no. 2, pp. 358–372, 2016.
8[8] R. Soltani, D. Goeckel, D. Towsley, and A. Houmansadr, “Towards provably invisible network flow fingerprints,” in 51th Asilomar Conference on Signals, Systems and Computers , Pacific Grove, CA, USA, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Asymptotic Limits of Privacy in Bayesian Time Series Matching

Abstract

Index Terms:

I Introduction

II Framework

II-A Models and Metrics

Definition 1**.**

Definition 2**.**

III Two-State i.i.d. Model

III-A Perfect Anonymity Analysis

Theorem 1**.**

Proof.

III-B No Privacy Analysis

Theorem 2**.**

Proof.

IV rrr-State i.i.d. Model

IV-A Perfect Anonymity Analysis

Theorem 3**.**

Proof.

IV-B No Privacy Analysis

Theorem 4**.**

Proof.

V rrr-State Markov Chain Model

V-A Perfect Anonymity Analysis

Theorem 5**.**

Proof.

V-B No Privacy Analysis

Theorem 6**.**

Proof.

VI Conclusion

Definition 1.

Definition 2.

Theorem 1.

Theorem 2.

IV $r$ -State i.i.d. Model

Theorem 3.

Theorem 4.

V $r$ -State Markov Chain Model

Theorem 5.

Theorem 6.