Asymptotic Limits of Privacy in Bayesian Time Series Matching
Nazanin Takbiri, Dennis L. Goeckel, Amir Houmansadr, Hossein, Pishro-Nik

TL;DR
This paper establishes theoretical bounds on user privacy in Bayesian time series matching, analyzing how anonymized data can still be vulnerable to re-identification through statistical matching, for i.i.d. and Markov models.
Contribution
It provides the first theoretical bounds on privacy loss in Bayesian time series matching, covering both i.i.d. and Markov-dependent data models.
Findings
Derived achievability and converse bounds for i.i.d. data traces.
Extended bounds to Markov chain data models.
Identified conditions under which privacy can be compromised.
Abstract
Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user's experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users' privacy may still be compromised by an external party who leverages statistical matching methods to match users' traces with their previous activities. In this paper, we obtain the theoretical bounds on user privacy for situations in which user traces are matchable to sequences of prior behavior, despite anonymization of data time series. We provide both achievability and converse results for the case where the data trace of each user consists of independent and identically distributed (i.i.d.) random samples drawn from a multinomial distribution, as well as the case that the users' data…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Privacy, Security, and Data Protection
Asymptotic Limits of Privacy in Bayesian Time Series Matching
Nazanin Takbiri
Electrical and
Computer Engineering
UMass-Amherst
Dennis L. Goeckel
Electrical and
Computer Engineering
UMass-Amherst
Amir Houmansadr
Information and
Computer Sciences
UMass-Amherst
Hossein Pishro-Nik
Electrical and
Computer Engineering
UMass-Amherst
[email protected] This work was supported by National Science Foundation under grants CCF–1421957 and CNS–1739462.
Abstract
Various modern and highly popular applications make use of user data traces in order to offer specific services, often for the purpose of improving the user’s experience while using such applications. However, even when user data is privatized by employing privacy-preserving mechanisms (PPM), users’ privacy may still be compromised by an external party who leverages statistical matching methods to match users’ traces with their previous activities. In this paper, we obtain the theoretical bounds on user privacy for situations in which user traces are matchable to sequences of prior behavior, despite anonymization of data time series. We provide both achievability and converse results for the case where the data trace of each user consists of independent and identically distributed (i.i.d.) random samples drawn from a multinomial distribution, as well as the case that the users’ data points are dependent over time and the data trace of each user is governed by a Markov chain model.
Index Terms:
Anonymization, information theoretic privacy, Internet of Things (IoT), Markov chain model, statistical matching, Privacy-Preserving Mechanism (PPM).
I Introduction
The Internet of Things (IoT) is an important emerging technology and is growing at a rapid pace: by 2020, over 50 billion devices will be connected together as part of the IoT network [1]. Environmental monitoring, infrastructure management, energy management, medical and healthcare systems, building and home automation, and transport systems are some examples which indicate that IoT devices will affect nearly every aspect of our daily lives. However, this ubiquity of impact also raises grave privacy concerns. In particular, each IoT user in each application is generating a sequence of data that can be modeled as a random process; for example, in location-based services, each user is generating location traces. These sequences of data in IoT systems often contain sensitive information about users, such as their locations, health information, and hobbies. As a result, such huge amount of data generated by IoT devices can critically damage users’ privacy, thereby providing a significant obstacle to the adaption of IoT applications. Thus, IoT privacy has drawn the attention of the research community [2, 3, 4] to investigate effective privacy-preserving mechanisms (PPMs).
PPMs are used to increase the assurance that private data is not accessible to third parties. Two promising classes of PPMs are identity perturbation and data perturbation [5, 6, 7, 8, 9, 10, 11, 12]. The identity perturbation technique or anonymization is the process of hiding the true identity of the data owner [5, 6, 7, 8, 9]. This technique removes personal identifiers or converts personally identifiable information into aggregated data. The data perturbation or obfuscation is the process of hiding the users’ data by adding noise [10, 11, 12]. However, perturbation techniques reduce utility to provide better privacy protection; thus, obtaining the optimum levels of anonymization and obfuscation is important.
In [7, 13], a comprehensive analysis of the asymptotic (in the length of the time series) optimal matching of time series to source distributions is presented in a non-Bayesian setting, where the number of users is a fixed, finite value. However, in [14, 15, 16, 17, 18, 19, 20], a Bayesian setting was adopted in which the adversary has accurate prior distributions for user behavior through past observations or other sources, and the asymptotic limits of user privacy were obtained.
In addition, Li et al. [21] provide an optimal hypothesis test in the case where the adversary has training sequences from the group of users rather than the exact probability distribution.
In this paper, we adopt the same setting as [21]; however, our work has significantly different flavor than that of [21]. First, [21] finds the optimal test in the non-asymptotic regime where there exist two users, while here, the asymptotic limits of user privacy for the case of a large number of users are obtained. Second, [21] obtains the necessary conditions for breaking privacy, while here, conditions for both perfect anonymity and no privacy are obtained. Third, [21] establishes the optimal test for the case with binary alphabets where each user’s trace consists of independent and identically distributed (i.i.d.) samples drawn from a Bernoulli distribution, while here, we extend our results to the case where each user’s trace is governed by i.i.d. random samples of a multinoulli distribution. We also extend our results for a more general Markov chain model.
The remainder of this paper is organized as follows. Section II discusses the system model and the metrics used in the paper. Achievability and converse results for the two-state i.i.d. model are presented in Section III, and their extensions to the -state i.i.d. model are presented in Section IV. In addition, achievability and converse results for a more general Markov chain model are presented in Section V. Section VI provides some final conclusions and directions for future work.
II Framework
We assume a system with users. Each user creates a length- sequence of data, which is denoted by ,
[TABLE]
where is the actual data point of user at time . For each user, there also exists a length- sequence of its past behavior which is denoted as ,
[TABLE]
where is the observation of the prior behavior of user at time .
The adversary has access to the observations of the prior users’ behavior and wants to use this knowledge to break users’ privacy despite the usage of some PPMs. As shown in Figure 1, an anonymization technique is employed in order to perturb the users’ identity before the data is provided to the IoT application. In this figure, is the reported data point of user at time after applying anonymization; hence, the adversary observes
[TABLE]
where Y is the permuted version of X.
II-A Models and Metrics
Data Points Model: We assume there exist possible values for each data point. As shown in Figure 1, there exist two traces for each user: one that is termed "training data" and one that is termed "actual data," which needs to be protected from a malicious adversary. Remember that these two traces are generated from the same unknown probability distribution. In other words, for and , both and are drawn from a user-specific probability distribution denoted as . While all ’s are unknown to the adversary, each of them is drawn independently from a continuous density function , where for all in the support of , we assume
[TABLE]
Anonymization Mechanism: As shown in Figure 1, the mapping between users and data sequences is randomly permuted in order to achieve privacy. This random permutation is chosen uniformly at random among all possible permutations on the set of users ; then, ,
Adversary Model: The adversary tries to match each sequence in the collection of training data traces with the sequence in the observation data traces that is drawn from the same probability distribution, which we term statistical matching. This is equivalent to finding the permutations of the user identities between two collections. Note that the adversary knows the anonymization mechanism; however, he/she does not know the realization of the random permutation function.
Following [17], the definition of no privacy is as follows:
Definition 1**.**
For an algorithm of the adversary that tries to estimate the actual data point of user at time , define the error probability as
[TABLE]
where is the actual data point of user at time , and is the adversary’s estimated data point of user at time . Now, define as the set of all possible estimators of the adversary. Then, user has no privacy at time , if and only if for large enough ,
[TABLE]
Hence, a user has no privacy if there exists an algorithm for the adversary to estimate with diminishing error probability as goes to infinity.
In this paper, we also consider the situation in which there is perfect anonymity.
Definition 2**.**
User has perfect anonymity at time if and only if
[TABLE]
where is the entropy of given W and Y.
III Two-State i.i.d. Model
In this section, we assume each user’s trace consists of samples from an i.i.d. random process and there are only two possible values for each user data point . Thus, both training traces and real data traces are governed by an i.i.d. Bernoulli distribution with parameter , where is probability that user taking value of a , hence,
[TABLE]
and
[TABLE]
As discussed in Section II, while ’s are unknown to the adversary, they are drawn independently from a known continuous density function (), where for all , we have
[TABLE]
III-A Perfect Anonymity Analysis
The following theorem states that if or are significantly smaller than in this two-state model, then all users have perfect anonymity.
Theorem 1**.**
For the above two-state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
at least one of or is less than or equal to for any ;
then, user has perfect anonymity at time .
Proof.
First, consider the case . Here, W is considered as the training set and Y is considered as the observed set; thus, given Y, forms a Markov chain. According to the data processing inequality,
[TABLE]
thus,
[TABLE]
and
[TABLE]
In [15, Theorem 1], it is shown that if , , so, we can conclude
[TABLE]
as .
Now, consider the case . By symmetry of the problem Y can be considered as the training set and W can be considered as the observed data. Thus, we can similarly prove the same results. ∎
III-B No Privacy Analysis
The following theorem states that if both and are significantly larger than in this two-state model, then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result break users’ privacy.
Theorem 2**.**
For the above two-state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
for any ;
- •
for any ;
then, user has no privacy at time .
Proof.
For , define
[TABLE]
[TABLE]
and
[TABLE]
We claim that for , and large enough :
2. 2.
where . Thus, the adversary can match to .
First Step: We want to show
[TABLE]
Note , so as ,
[TABLE]
where the first inequality follows from the fact that , and as a result, . The union bound yields the third inequality, and the fourth inequality follows from Chernoff bounds. Now, for u=1, we have
[TABLE]
as
Second Step: First, we show as ,
[TABLE]
According to (1), for all , we have
[TABLE]
and according to the union bound,
[TABLE]
as . Thus, for , the distance between and is bigger than with high probability.
Next, we show as ,
[TABLE]
Note for all , Chernoff bounds yields:
[TABLE]
As a result, for , we have
[TABLE]
as . In other words, with high probability, the distance between and is less than .
Now, given the fact that the distance between all ’s and is bigger than , and the fact that the distance between and is less than , for all , we have
[TABLE]
Thus,
[TABLE]
as .
Now, we claim that given the fact that the distances between each of the ’s and are bigger than , we have
[TABLE]
Note, using (2), we have
[TABLE]
Thus, by using union bound, we have
[TABLE]
as .
After completing the first and second steps, we can conclude if and , users have no privacy as . ∎
IV -State i.i.d. Model
In this section, we assume each user’s trace consists of samples from an i.i.d. random process, and users’ data points can have possibilities, where . Thus, both training traces and real data traces are governed by an i.i.d. multinoulli distribution with parameter , and
[TABLE]
where is the probability that a datum of user has value .
As discussed in Section II, while ’s are unknown to the adversary, they are drawn independently from a known continuous density function , where for all ,
[TABLE]
we have
[TABLE]
IV-A Perfect Anonymity Analysis
The following theorem states that if or are significantly smaller than in this -state model, then all users have perfect anonymity.
Theorem 3**.**
For the above -state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
at least one of or is less than or equal to for any ;
then, user has perfect anonymity at time .
Proof.
We can now repeat the similar reasoning as Theorem 1; then, by using [15, Theorem 2], the proof is complete. ∎
IV-B No Privacy Analysis
The following theorem states that if both and are significantly larger than in this -state model, then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result break users’ privacy.
Theorem 4**.**
For the above -state i.i.d. model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
for any ;
- •
for any ;
then, user has no privacy at time .
Proof.
The proof of Theorem 4 is similar to the proof of Theorem 2, so we just provide the general idea. We similarly define the empirical probability that the user with pseudonym has data sample as follows:
[TABLE]
and
[TABLE]
We also have
[TABLE]
The difference from the proof of Theorem 2 is that, for each , and are vectors of length . In other words,
[TABLE]
[TABLE]
and we claim for , , and large enough ,
2. 2.
where . ∎
V -State Markov Chain Model
In Section III and IV, the data trace of each user is governed by an i.i.d. random process, while here the data trace of each user is governed by an irreducible and aperiodic -state Markov chain where is the set of edges. Let us define the transition probability from state to state as:
[TABLE]
thus, if and only if .
Here, we assume the same Markov chain structure for all of the users, but different users have different transition matrices. Note that a subset of the transition probabilities with size is sufficient for recovering the whole transition matrix. Let this subset be called , so
[TABLE]
where is the probability that a datum of user has value . As discussed in Section II, while ’s are unknown to the adversary, they are drawn independently from a known continuous density function , where for all ,
[TABLE]
we have
[TABLE]
V-A Perfect Anonymity Analysis
The following theorem states that if or are significantly smaller than in this -state Markov chain model, then all users have perfect anonymity.
Theorem 5**.**
For the above -state Markov chain model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
at least one of or is less than or equal to for any ;
then, user has perfect anonymity at time .
Proof.
We can now repeat the similar reasoning as Theorem 1; then, by using [15, Theorem 3], the proof is complete. ∎
V-B No Privacy Analysis
The following theorem states that if both and are significantly larger than , then the adversary can find an algorithm to successfully estimate users’ data points with arbitrarily small error probability, and as a result, break users’ privacy.
Theorem 6**.**
For the above -state Markov chain model, if Y is the anonymized version of X, and W is the past behavior of the users as defined above, and
- •
for any ;
- •
for any ;
then, user has no privacy at time .
Proof.
The proof of Theorem 6 is similar to the proof of Theorem 2, so we just provide the general idea. For each , we similarly define and as vectors of length :
[TABLE]
[TABLE]
We claim that for , , and large enough ,
2. 2.
where . ∎
VI Conclusion
In this paper, we have derived the theoretical bounds on user privacy in situations in which user traces are matchable to prior user behavior despite anonymization protection. In particular, the adversary employs statistical matching of the user traces to previous behavior of users within a network to compromise their privacy.
As shown in Figure 2, which displays the characterized privacy limits for the i.i.d. case, we demonstrated that the parameter plane, with coordinates length of learning set () and length of observed set (), can be divided into two regions: in the first region, all users have perfect anonymity and in the second region no user has any privacy whatsoever. Specifically, we showed that if either or is significantly smaller than , users have perfect anonymity and the adversary cannot identify the permutation function , and, if both of them are significantly larger than , users have no privacy. It is worth noting that in the case the adversary has the accurate prior information, which is discussed in [15, 16] and is shown in Figure 3, users have no privacy as long as number of adversary observations per user is larger than .
For the case where the users’ data points are governed by an irreducible and aperiodic -state Markov chain with edges, we demonstrated similar results: if either or is significantly smaller than , users have perfect anonymity, and, if both of them are significantly larger than , users have no privacy.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Bausch. (2016) The internet of things forecast of 50 billion connected devices by 2020 is grossly over-estimated and entirely misleading. [Online]. Available: https://www.electronicproducts.com/Internet_of_Things/Research/The_Internet_of_Things_forecast_of_50_billion_connected_devices_by_2020_is_grossly_over_estimated_and_entirely_misleading.aspx
- 2[2] Federal Trade Commission Staff, “Internet of things: Privacy and security in a connected world,” 2015.
- 3[3] A. Ukil, S. Bandyopadhyay, and A. Pal, “Io T-privacy: To be private or not to be private,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) . Toronto, ON, Canada: IEEE, 2014, pp. 123–124.
- 4[4] S. Hosseinzadeh, S. Rauti, S. Hyrynsalmi, and V. Leppänen, “Security in the internet of things through obfuscation and diversification,” in IEEE Conference on Computing, Communication and Security (ICCCS) . Pamplemousses, Mauritius: IEEE, 2015, pp. 1–5.
- 5[5] B. Hoh and M. Gruteser, “Protecting location privacy through path confusion,” in First International Conference on Security and Privacy for Emerging Areas in Communications Networks (Secure Comm) . Pamplemousses, Mauritius: IEEE, 2005, pp. 194–205.
- 6[6] J. Freudiger, M. Raya, M. Félegyházi, P. Papadimitratos, and J. P. Hubaux, “Mix-zones for location privacy in vehicular networks,” Vancouver, 2007.
- 7[7] F. M. Naini, J. Unnikrishnan, P. Thiran, and M. Vetterli, “Where you are is who you are: User identification by matching statistics,” IEEE Transactions on Information Forensics and Security , vol. 11, no. 2, pp. 358–372, 2016.
- 8[8] R. Soltani, D. Goeckel, D. Towsley, and A. Houmansadr, “Towards provably invisible network flow fingerprints,” in 51th Asilomar Conference on Signals, Systems and Computers , Pacific Grove, CA, USA, 2017.
