FDI: Quantifying Feature-based Data Inferability
Shouling Ji, Haiqin Weng, Yiming Wu, Qinming He, Raheem Beyah, Ting, Wang

TL;DR
This paper introduces a method to quantify feature-based data inferability, analyzing conditions for successful user identification in security contexts and evaluating implications for privacy and security systems.
Contribution
It provides a novel quantification framework for feature-based data inferability under different data models and applies it to real-world security scenarios.
Findings
Explicit conditions for Top-K inferability
Evaluation of user inferability in network forensics
Implications for privacy-preserving inference systems
Abstract
Motivated by many existing security and privacy applications, e.g., network traffic attribution, linkage attacks, private web search, and feature-based data de-anonymization, in this paper, we study the Feature-based Data Inferability (FDI) quantification problem. First, we conduct the FDI quantification under both naive and general data models from both a feature distance perspective and a feature distribution perspective. Our quantification explicitly shows the conditions to have a desired fraction of the target users to be Top-K inferable (K is an integer parameter). Then, based on our quantification, we evaluate the user inferability in two cases: network traffic attribution in network forensics and feature-based data de-anonymization. Finally, based on the quantification and evaluation, we discuss the implications of this research for existing feature-based inference systems.
| # of user-feature relationships | |||
|---|---|---|---|
| Apr-Domain | 5,888 | 290,537 | 3,968,361 |
| Apr-Path | 5,888 | 1,685,439 | 17,389,051 |
| July-Domain | 5,610 | 391,290 | 3,739,246 |
| July-Path | 5,610 | 1,855,415 | 16,010,442 |
| Oct-Domain | 5,268 | 270,604 | 3,868,538 |
| Oct-Path | 5,268 | 1,741,781 | 16,895,932 |
| Dec-Domain | 5,699 | 298,490 | 3,736,956 |
| Dec-Path | 5,699 | 2,159,448 | 16,926,145 |
| # of user-feature relationships | |||
|---|---|---|---|
| Google+ | 107,614 | 19,044 | 387,261 |
| 4,039 | 1,283 | 37,257 | |
| Twiiter | 81,306 | 216,839 | 1,245,234 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Privacy, Security, and Data Protection · Internet Traffic Analysis and Secure E-voting
FDI: Quantifying Feature-based Data Inferability
Shouling Ji*†,‡, Haiqin Weng†, Yiming Wu†, Qinming He†, Raheem Beyah‡*, Ting Wang♯
† Zhejiang University
‡ Georgia Institute of Technology
♮ Huazhong University of Science and Technology
♯ Lehigh University
Abstract
Motivated by many existing security and privacy applications, e.g., network traffic attribution, linkage attacks, private web search, and feature-based data de-anonymization, in this paper, we study the Feature-based Data Inferability (FDI) quantification problem. First, we conduct the FDI quantification under both naive and general data models from both a feature distance perspective and a feature distribution perspective. Our quantification explicitly shows the conditions to have a desired fraction of the target users to be Top- inferable ( is an integer parameter). Then, based on our quantification, we evaluate the user inferability in two cases: network traffic attribution in network forensics and feature-based data de-anonymization. Finally, based on the quantification and evaluation, we discuss the implications of this research for existing feature-based inference systems.
I Introduction
Many existing security and privacy applications/techniques can be characterized as a feature-based inference system, e.g., network traffic attribution in network forensic applications, private web search, feature-based data de-anonymization [1]-[8]. To conduct network traffic attribution, usually, a network traffic attribution system is first learned based on the features extracted from historical network traces. Later, when new network traffic comes, features will be extracted from the new traffic first, and then the data will be automatically attributed to the users who generated them by the system based on the features (as shown in Fig.1) [1]. In fact, the network traffic attribution system can be directly considered as a feature-based inference system, where the system is first learned based on the historical/training data (in detail, features of the historical/training) and then used to infer the new data (in this scenario, users who generate the new traffic) based on their features (as shown in Fig.2). Another example is the code stylometry-based de-anonymization attack to programmers proposed in [6]. In this kind of attack, the code stylometry features of training programs are first extracted to train a de-anonymization model. Then, this model can be used to de-anonymize the programmers of the target programs based on their code stylometry features. For this example, the code stylometry-based de-anonymization model can also be considered as a feature-based inference system to infer (de-anonymize) target data (programmers of targeting programs).
Now, some interesting questions are brought: how to quantify the performance of those feature-based inference systems for security and privacy applications? and what is the performance of existing feature-based inference techniques relative to the inherent theoretical performance bound? Answering these questions are important to accurately evaluate and understand the performance of existing feature-based inference systems/techniques and further develop improved ones. Unfortunately, although we already have many feature-based inference systems/techniques for various security and privacy applications, the answers to the brought questions remain unclear. Therefore, to address these open problems, in this paper, we study the Feature-based Data Inferability (FDI) quantification for existing feature-based inference systems/techniques in various security and privacy applications. Particularly, we make the following contributions in this paper.
- •
We first quantify the FDI under a naive data model, where each user-feature relationship is characterized by a binary function (a user either has a feature or does have a feature). Under the naive model, we quantified the conditions to have a target dataset to be -inferable, i.e., to have target users to be Top- inferable, where is a parameter in , is the number of overlapped users between the training data of the inference model and the targeting data, (thus, is the number of users that can be correctly Top- inferred), and is an integer specifying the desired inference accuracy.
- •
Subsequently, we extend our FDI quantification to a general data model. Under the general data model, we quantify the FDI from both the feature distance perspective and the feature distribution perspective to have a target dataset to be -inferable. Our quantification in the general scenarios provides the answers to the raised open problems, and meanwhile, our quantification provides the theoretical foundation for the first time for existing feature-based inference systems in various security and privacy applications, to the best of our knowledge.
- •
Based on our FDI quantification, we conduct a large-scale evaluation leveraging on real world data. Specifically, we evaluate the user inferability in two cases: network traffic attribution in network forensics and feature-based data de-anonymization. We explicitly demonstrate the -inferability of users in these two cases and analyze the reasons.
- •
In terms of our quantification and evaluation, we discuss the implications of this paper to practical feature-based inference systems/techniques. We also point out the future research directions.
The rest of this paper is organized as follows. In Section II, we describe the motivation applications and formalize the problem. In Section III, we quantify the FDI under both naive and general data models. In Section IV, we evaluate the FDI in two scenarios. We make further discussion in Section V. In Section VI, we summarize the related work and we conclude the paper in Section VII.
II Problem Formalization
In this section, we formalize the studied problem. To make the problem easily understandable and to further motivate our research, we start from introducing motivation examples that our study is applicable for analysis.
II-A Motivation Examples
In this paper, we study data’s feature-based inferability. Our study is motivated by several existing security and privacy applications, e.g., network traffic attribution in network security forensics [1][2][3], linkage attacks and private web search [4][5], and data de-anonymization [6][7][8].
Network traffic attribution is one of the fundamental issues in network security forensics, under which users, who are responsible for the observed activities and behaviors on network interfaces, are inferred [1][3]. Taking the network traffic attribution system Kaleido proposed in [1] and shown in Fig.1 as an example, a typical network traffic attribution system works as follows: ①, based on the historical network traces, a set of features (corresponding to each user) are extracted; ②, a learning model is designed to learn a discriminant model based on the features of historical network traces, which is used for network traffic attribution and/or new user (could be an intruder) identification; ③, when new network traffic comes, the features of the new network traffic are extracted; and ④, taking the features of the new network traffic as input, the discriminant model either attributes the traffic to a set of candidate users or concludes that the traffic is generated by a new user (a set of new users).
Web searching is one of the most fundamental computer applications, by which users obtain desired knowledge and/or find interested websites. Intuitively, users’ web search traces carry users’ interests and intents. Therefore, potential adversaries (e.g., eavesdroppers) may design some linkage attacks and exploit users’ web search traces to infer users’ profiles and other sensitive information [4][5]. The key idea of a linkage attack is that () an adversary first learns a linkage function based on the features of target users’ historical web search data and then () determines whether the new generated web search data/events belong to the target users. To defend against the linkage attack in web search applications, several obfuscation mechanisms have been proposed for private web search [4][5]. The basic idea is to obfuscates users’ web search data by adding some noise, i.e., obfuscating the features of users’ web search data such that the linkage attack cannot effectively infer the generator of the data.
Our study in this paper is also motivated by existing feature-based de-anonymization attacks and techniques, e.g., programmers de-anonymization [6], authorship distribution to underground forums and multi-author detection [7], and movie rating data de-anonymization [8]. In these de-anonymization attacks/techniques, a feature-based de-anonymization model is first learned based on a training dataset. Subsequently, the new coming data (generated by an existing user or a new user) are de-anonymized by the de-anonymization model based on the data’s features.
Mathematically, all the aforementioned security and privacy applications can be reduced to a simple yet general system as shown in Fig.2: ①, a model is learned based on the features of historical data; ②, the target data are input to the model; and ③, inferences, e.g., candidate users who generate the data and/or identified new users, are concluded based on the results of the learned model. Now, after observing the success of the aforementioned security and privacy applications [1]-[8], e.g., Kaleido is able to identify the responsible users with over accuracy, two interesting questions are that why these techniques/attacks are success and given the target data, how to determine the performance of these techniques/attacks relative to the intrinsic inferability of the target data, e.g., how good the accuracy of Kaleido is and is that possible to achieve some better accuracy than ? To answer the two questions, we study the intrinsic inferability of the target data given the historical data (training data). Therefore, our research in this paper can serve as the theoretical foundation of the aforementioned security and privacy applications. Furthermore, our quantification enables the development of a tool to evaluate the relative performance of the aforementioned techniques/attacks and guides the development of future research (as discussed in Section V).
II-B Problem Formalization and Models
Now, we formalize the studied problem. During the formalization, the basic principle is to make the problem sufficiently general and meanwhile mathematically tractable.
We denote the training data (e.g., the historical data in the network traffic attribution scenario) as . Since we do not distinguish a user and the data generated by that user, we assume consists of users (or the data generated by users), and further assume , where is a user (or the data generated by a user). For , it represents a user or the data generated by a user depending on the context. To model the feature extraction process (as shown in Fig.1 and Fig.2), we assume there is a feature extraction mechanism 111In practice, could be any specific feature extraction mechanism, e.g., the ones in [1]-[8]., where denotes some particular feature function and is the dimension of the feature space. Applying to , we can get the features of , denoted by set . In this paper, we focus on the scenario that is a finite set, i.e., is some finite value222With this assumption, the studied problem is still sufficiently general to be applied to many existing security and privacy applications. For instance, in network security forensics [1][2][3], linkage attacks and private web search [4][5], and data de-anonymization [6][7][8], the extracted features of the training data can be modeled by a finite set.. Specifically, for , its features with respect to are denoted by vector , where denotes the feature of with respect to the feature function .
Similar to formalizing the training data and taking account of the security and privacy applications ([1]-[8]), we denote the target data by , where is a user (or the data generated by a target user) in the target data and is the number of users in the target data. As shown in Fig.1 and Fig.2 ([1]-[8]), before inferring the users in , we apply the same to extract the features of denoted by , which is again assumed to be a finite set. For , its features with respect to are denoted by vector , where denotes the feature of with respect to the feature function . After having , the task now is to infer the users in using an inference model (e.g., the network traffic discriminant model as shown in Fig.1).
Based on the aforementioned definitions, the studied problem in this paper can be formalized as follows:
Definition II.1**.**
Feature-based Data Inferability (FDI). Given , , and , we quantify the inferability of with respect to and .
In this paper, we study the intrinsic FDI of the security and privacy applications as shown Section II-A. Mathematically, the FDI study can serve as the theoretical foundation of the applications in Section II-A, e.g., the network traffic distribution system Kaleido proposed in [1]. Practically, the FDI study can be employed to evaluate the relative performance of the existing techniques in the applications of Section II-A, and guide the development of new/improved techniques.
III FDI Quantification
In this section, we conduct the FDI quantification. We start the quantification from a naive scenario. Then, we generalize the FDI quantification to the more practical cases.
To make our following discussion easily understandable, we use the network traffic attribution application in network security forensics as the studying context without of dedicated specification in the rest of this paper. Straightforwardly, our discussion is applicable to the scenarios of the linkage attack and private web search [4][5] and data de-anonymization [6][7][8].
III-A Preliminary
Following the security and privacy applications in [1]-[8], an inferring model can be learned from as shown in Fig.1 and Fig.2, e.g., the discriminant model in the network security forensics application [1][3], the linkage attack model in private web searching [4][5], and the de-anonymization model in [6][7][8]. We denote the inference (attack, de-anonymization) model by . Then, is employed to infer the new coming data, i.e., the target data.
When employing to infer users (data generated by users) in the target data, employs some inference function learned from . We here model the inference function of by . Then, , when inferring using , we denote the process by and denote the inference result by , where denotes a new user (the data generated by a new user) such that . We further explain the inference result definition as follows: when employing to infer the target user (data generated by the target user) , it may be inferred to some candidate users in the training data if the inference function is satisfied. Otherwise, is more confident to infer as a new user that never appeared in . For instance, in the network traffic distribution application, when using Kaleido ( in our definition) to monitor the on-line network traffic, the inference result could be that the traffic is generated by some existing user (used for training Kaleido) or the traffic is generated by some new user that not appeared before (could be some intruder). Now, we are ready to start our quantification.
III-B Warmup: Naive Quantification
In this subsection, we conduct the FDI quantification for a naive scenario, where we assume that , is a binary feature function, i.e., or , or either has feature or not. Then, we have , , i.e., the feature vector of is a -dimensional 0-1 vector with respect to . Furthermore, for , we define . Given two 0-1 vectors and where , we define , where is the logical binary XOR operation.
For and , we denote the scenario that and correspond to the same user (or the data generated by the same user) and otherwise, e.g., the network traffic generated by the same user in different time windows or not. To conduct the FDI quantification, the first step is to understand and quantify the correlation of the features of and . Toward this objective, for and , we assume that for , i.e., the probability that preserves the same property of with respect to a feature is . Now, for and , suppose while . Then, we have the following lemma, which quantifies the inferability of with respect to and 333Note that, all the quantifications in this paper are statistically meaningful, i.e., statistically, with probability of 1, the FDI quantifications hold..
Lemma 1**.**
If and , then such that , i.e., is inferable with respect to and .
Proof: To prove this lemma, we first analyze the difference between and . To facilitate our analysis, we partition the feature space into four disjoint subsets with respect to and , denoted by , , , and respectively as shown in Fig.3, where (the set of features that has while dose not have), (the set of features that both and have), (the set of features that does have while has), and (the set of features that neither nor has). Let for , where is the cardinality of a set. Furthermore, for and , let be the feature vector of with respect to the features in . Evidently, is a subvector of . Furthermore, let . Then, it is easy to show that , .
Let . Since and , we have . Now, we consider each separately: (1) since both and have the features in , we have ; (2) similar to , since neither nor has any feature in , we have ; (3) for , the set of features hold by while not , statistically, we have and , where is a binomial variable with parameters and ; and (4) for , the set of features hold by while not , statistically, we have and . Then, we have
[TABLE]
Now, we consider two cases. First, if , we have . Then, applying the Pedarsani-Grossglauser lemma [9], we have
[TABLE]
Since , we have
[TABLE]
Then, according to the Borel-Cantelli Lemma and statistically, we have , which implies that statistically, .
Second, we consider the case that . In this case, we have . Then, applying the Pedarsani-Grossglauser lemma [9], we have
[TABLE]
Considering that , we have
[TABLE]
According to the Borel-Cantelli Lemma and statistically, we have , i.e., .
Now, we need to show that such that . Based on our proof, it is trivial to show that (1) when , if is an increasing function with respect to , where ; and similarly, when , if is a decreasing function with respect to . Therefore, for our purpose it is easy to design using existing techniques [1]-[8]. To name a naive one, we can set as shown in Algorithm 1.
In Lemma 1, we quantified the condition to successfully infer user from with respect to . We further discuss Lemma 1 as follows. First, one condition is that . This is consistent with our institution. If , the features of each user is uniformly and equiprobably distributed in . Then, theoretically, all the users are equivalent with respect to and thus it is difficult (if not impossible) to successfully infer based on the features in by any model. Second, when , we explicitly specify the condition that is statistically guaranteed to be successfully inferrable with respect to . In our proof, we also give how to design . Note that, the specified condition is sufficient while not necessary to have inferable with respect to . Even if the condition is not satisfied, it is also possible to successfully infer with respect to . Particularly, we show this fact in the following corollary.
Corollary 1**.**
For and , suppose and . If , then such that .
Proof: This corollary can be proven using the similar technique as in Lemma 1.
In Lemma 1, we quantify the FDI of with respect to . Now, we quantify the FDI of with respect to . In practice, we usually infer to a set of candidate users in . For instance, in the network traffic distribution system Kaleido [1], the user responsible for the new coming traffic might be inferred to a set of users. Therefore, given , we define the Top- candidate set of as follows.
Definition III.1**.**
Top- candidate set and Top- inferable. For , suppose that such that . Then, the Top- candidate set of , denoted by , is defined as such that and . is Top- inferable with respect to if such that , i.e., returns a subset of with size and is in that subset.
Now, we quantify the Top- FDI of a user . Let be a subset of such that and . We show the result in the following lemma.
Lemma 2**.**
For , suppose that . Then, is Top- inferable if and such that , where .
Proof: we prove this lemma by considering two cases. First, we consider the case that . We define an event as such that . Then, we have according to Boole’s inequality. From Lemma 1, when , . Then, we have
[TABLE]
According to the Borel-Cantelli Lemma, we have , i.e., .
Second, we consider the case that . In this case, we define as an event that such that . Then, similar to the case that , we have
[TABLE]
Again, according to the Borel-Cantelli Lemma, we have , i.e., .
Now, we discuss how to design and how to find . Based on our proof, if and such that , then (1) when , , which implies that among , there are at least users having their values greater than ; and (2) when , , there are at least users having their values smaller than . According to this observation, we give a preliminary implementation of as shown in Algorithm 2. Basically, if , Algorithm 2 returns a set consisting of users from that have the top- minimum values; and if , Algorithm 2 returns a set consisting of users from that have the top- maximum values. By a contradiction-based technique, we can show that the shown in Algorithm 2 returns a Top- candidate set of , i.e., is Top- inferable.
In Lemma 2, the conditions for a user to be Top- inferable are quantified. If the specified conditions are satisfied, we also provide an implementation of in the proof (Algorithm 2). In fact, there are also many other techniques to implement , e.g., the techniques proposed in [1]-[8]. Further, similar to Lemma 1, the conditions in Lemma 2 are sufficient while not necessary for to be Top- inferable. When the conditions are satisfied, it is statistically guaranteed that is Top- inferable. Otherwise, is still Top- inferable with some probability. Particularly, we show that probability in the following corollary.
Corollary 2**.**
For , suppose that . Then, if , , where and .
Now, we consider an even more general scenario where we try to infer multiple users in . A practical application corresponding to this scenario is to distribute the monitored network traffic generated by multiple users in network forensics [1][3]. Let , i.e., is a set of users that appeared in both and . Furthermore, let be a constant and . Then, we define the -inferability of (i.e., is -inferable) as follows.
Definition III.2**.**
-Inferable. is -inferable if there are at least users in are Top- inferable444Without loss of generality, we assume is an integer in . In the case that is not an integer, we can define as .
Then, we quantify the -inferability of in the following theorem.
Theorem 1**.**
Let be any subset of and . is -inferable if and , such that , and .
Proof: We first prove this theorem for the case that . For , suppose . Evidently, . Now, to prove this theorem, it is sufficient to show that , is Top- inferable. Let be the event that such that is not Top- inferable. Then, we have
[TABLE]
Then, according to Lemma 1 and Lemma 2, we have
[TABLE]
Following the Borel-Cantelli Lemma, we have , i.e., is Top- inferable which implies that is -inferable.
For the case that , we have
[TABLE]
Then, according to Lemma 1 and Lemma 2, we have
[TABLE]
Again, following the Borel-Cantelli Lemma, we have , which implies that is -inferable.
In Theorem 1, we quantify the -inferability of . When comparing Theorem 1 and Lemma 2, we can see that the conditions specified in Theorem 1 is stronger than that in Lemma 2 with respect two aspects. First, in Theorem 1, it is required that for , there exists one desired . This is for the purpose of making Top- inferable. Second, the required is stronger in Theorem 1 than that in Lemma 2. This can be explained from the statistical perspective. In Lemma 2, the objective is make one user statistically Top- inferable while in Theorem 1, the objective is make all the users in statistically Top- inferable (simultaneously).
If the specified conditions in Theorem 1 are satisfied, an interesting question is how to design a to make -inferable. An preliminary implementation of can be built using the procedure in Algorithm 2: for each user in , we use Algorithm 2 to find a for it. Then, by the similar argument as in Lemma 2, we can conclude that is -inferable under .
In this subsection, we conduct the FDI quantification under the assumption that each feature function is binary. Apparently, this assumption may not hold in many real applications. Nevertheless, the quantification in this subsection can shed light on sophisticated FDI analysis. In the following subsections, we consider general FDI quantification by removing this assumption.
III-C General Quantification: From the Distance Perspective
In the previous FDI quantification, we assume that , is a binary function, i.e., . Although this assumption holds in many real applications (e.g., linkage attacks and data de-anonymization attacks), may not be a binary function in many other applications. Therefore, in the following FDI quantification, we assume that can be any function with a real-value output. Furthermore, given , an inference model may assign different weights to each feature (usually, the weights are learned from the features of the training data, i.e., ). To characterize this situation, we model that each feature in corresponds to a weight value in , which can be obtained by a weight function . In addition, to make our FDI quantification sufficiently general and meanwhile mathematically tractable, we model the correlation between the feature function and the weight function by another function , i.e., is a function defined on and 555Here, to make our model sufficiently general, we do not specify the dedicated definition of . In a specifical application, can be specified accordingly. For instance, we may have as in a linear regression model.. Now, for a user (or ), we have its feature vector as , where is the function defined on the feature function and the weight function of .
Given learned from , we quantify the FDI of using . For instance, could be the new monitored network traffic or the new collected web search data. For , to infer to some user in (or the data in generated by the same user) or to determine whether is a new user (or the data generated by a new user), two fundamental approaches are usually employed in : distance-based approach and distribution-based approach [1]-[8]. In the distance-based approach, computes the feature distance between and each in , i.e., the distance between and for . Then, infers to a subset of candidates in (either has the minimum or the maximum distance value). In the distribution-based approach, computes the feature distribution similarity between and each in , i.e., the distribution similarity between and for . Then, infers to a subset of candidates in (usually, the users in who have the most similar feature distributions with that of ). In this paper, we quantify the FDI for both approaches. Specifically, in this subsection, we focus on distance-based FDI quantification.
To facilitate our quantification, we first make the following definitions and assumptions. For , we define their feature distance as . In practice, can be defined in an application-oriented manner. For instance, can be defined using the -norm distance as follows:
[TABLE]
Let be the expectation/mean value of a random variable. Then, we define the expectation value of as . Furthermore, we assume that , i.e., the feature distance between and is lower bounded by 0 (which is an intuitive assumption) and upper bounded by some value . Now, for and , suppose that and . We quantify the inferability of with respect to and in the following lemma.
Lemma 3**.**
(1) When , is inferable if ; (2) When , is inferable if .
Proof: We start from proving the first conclusion. Let , , and . When , we have
[TABLE]
Applying Chernoff bound (as shown in Lemma 7 in the Appendix), we have
[TABLE]
According to the Borel-Cantelli Lemma, we have when , i.e., . Therefore, by comparing the feature distance, we can distinguish from and , i.e., is inferable with respect to and .
Now, we prove the second conclusion. When , Let , , and . Then, we have
[TABLE]
Applying Chernoff bound, we have
[TABLE]
Thus, we have , i.e., . Therefore, is inferable with respect to and by comparing the feature distance.
In Lemma 3, we quantify the feature distance-based FDI conditions of with respect to and . In fact, the proof of Lemma 3 corresponds to an implementation of : when the specified conditions are satisfied, using a procedure as shown in Algorithm 1 can make inferable with respect to and (now, we should change to ). Also, can be implemented using other techniques, e.g., [1]-[8]. When the conditions are satisfied, as long as is an increasing function on , can successfully infer with respect to and .
Now, based on Lemma 3, we study the Top- inferability of with respect to . Again, we assume that such that . The Top- FDI of is quantified in the following lemma.
Lemma 4**.**
* is Top- inferable if such that , for , and , where and .*
Proof: To prove this lemma, it is sufficient to prove that such that , , and . Let be the event that , is not inferable with respect to . Then,
[TABLE]
Since for and based on Lemma 1, we have is not inferable with respect to (this can be proven by considering and respectively). Therefore, we have
[TABLE]
Therefore, , which implies that is inferable with respect to .
Now, let be the procedure as shown in Algorithm 2 while changing to . Based on our proof, we conclude that the obtained of Algorithm 2 satisfies that and (actually, ).
In Lemma 4, we quantified the conditions for a user to be Top- inferable. Based on Lemma 3 and Lemma 4, we can quantify the -inferability of . We show the result in the following theorem.
Theorem 2**.**
Let be any subset of with . is -inferable if for , such that , for , and , where , and .
Proof: To prove this theorem, we take a similar approach as in proving Theorem 1. Let be the event that , is not Top- inferable. Then, we have
[TABLE]
Therefore, , which implies that is Top- inferable, i.e., is -inferable.
In Theorem 2, we quantify the feature distance-based -FDI of . When the specified conditions are satisfied, a can be constructed on top of the procedure in Algorithm 2 (changing the -items to the -items): call Algorithm 2 for each user . Then, according to the similar argument as in Lemma 2, we can show that is -inferable under . Again, since the conditions quantified in Theorem 2 (as well as in Lemma 3 and Lemma 4) are sufficient while not necessary, it is possible to design some sophisticated to achieve better inference performance.
III-D General Quantification: From the Distribution Perspective
In the previous subsection, we conduct the FDI quantification for the applications that employs a feature distance-based inference model. In many other applications, may employ a feature distribution-based inference model [1]-[8], i.e., determine whether and are the same user (or the data generated by the same user) according to the feature distribution similarity of and . To provide the theoretical foundation for this kind of inference models, we quantify the feature distribution-based FDI in this subsection.
For and , there are many approaches to measure the distribution similarity of and . Among them, one of the most widely adopted approaches is the Cosine-similarity based method [2][4][5][7]. Therefore, we focus on quantifying the Cosine similarity-based FDI in this paper. Our technique is expected to shed light on the FDI quantification based on other distribution similarity measurements. Before the quantification, we formally define the Cosine similarity first. Let , , and . Furthermore, let be the magnitude of a vector and . Then, we define the feature distribution similarity between and as
[TABLE]
where the is the dot product here.
Now, given and , we assume that and . We start our quantification from the scenario that is inferable with respect to . Let and be two random variables such that and . Furthermore, we assume that . Then, we have the following lemma to quantify the inferability of with respect to .
Lemma 5**.**
* is inferable with respect to if , where is the expectation value of and is a constant value.*
Proof: To prove this lemma, statistically, it is sufficient to prove that as . According to the Cosine similarity definition, we have
[TABLE]
Therefore, to prove , it is equivalent to prove that . Now, instead of proving directly, we prove , where is some constant value. According to the Chernoff bound, we have
[TABLE]
Thus, , i.e., , which implies is inferable with respect to .
In Lemma 5, we quantify the inferability of with respect to . Following the proof of the lemma, a can be easily constructed such that when the specified conditions are satisfied: simply returns the one who has a higher feature distribution similarity with . Based on Lemma 5, we can further quantify the Top- inferability of . The result is shown in Lemma 6.
Lemma 6**.**
* is Top- inferable if , where , , and is a constant value.*
Proof: This lemma can be proven based on Lemma 5. Let be any subset of such that and . Then, we first prove that . Let be the event that such that . Then, applying Lemma 5, we have
[TABLE]
Therefore, we have . Now, we design a for to be Top- inferable. Similar to the one in Algorithm 2, we can design a under which the users in who have the Top- feature distribution similarity scores (Cosine similarity scores) with are returned as . Then, based on our proof, we have .
In Lemma 6, the feature distribution-based Top- inferability of is quantified. When the specified conditions are satisfied, we also discussed how to implement in the proof. Based on Lemma 5 and Lemma 6 we can quantify the -inferability of . The result is shown in the following theorem.
Theorem 3**.**
* is -inferable if , where and and is a constant value.*
Proof: Let be any subset of with size . To prove this theorem, it is sufficient to prove that all the users in are Top- inferable. Let be the event that such that cannot be Top- inferable. Then, we have
[TABLE]
According to Lemma 6, we have
[TABLE]
Therefore, we have , i.e., statistically, all the users in are -inferable.
In Theorem 3, we quantify the feature distribution similarity-based -FDI of . When the specified conditions are satisfied, we can also design a using the one shown in Lemma 6: finding the for each user using the shown in Lemma 6. According to our proof, we can see that is -inferable under such a . Furthermore, similar to that in Theorem 1 and Theorem 2, the conditions in Theorem 3 are sufficient while not necessary. Therefore, a sophisticated could be implemented to improve the inference performance. Here, our FDI quantification can serve as a theoretical baseline to facilitate and guide the design of better inference models.
III-E Discussion: Inferring New User/Data
In the previous subsections, we focus on quantifying the feature distance and distribution based FDI of the users that appear in both the training data and the target data . In reality, it is possible that there are some new users/data that appear in while not in . Formally, it is possible that while such that . In this case, an ideal inference model will infer as a new user (or data generated by a new user), e.g., an intruder in network forensics applications [1][4]. In practical inference models [1]-[8], a user in is inferred as a new user (or data generated by a new user) if the feature distance is larger than a threshold for , or the feature distribution similarity is smaller than a threshold for .
Theoretically, it is challengeable (or, impossible) to quantify the precise inferability of a new user in general with statistical guarantee (that is why an inference system has false positive and false negative). The reason is that theoretically, the feature characteristics of a new user (data generated by a new user) might be arbitrarily similar to an existing user (e.g., the network intruders keep improving their camouflaging techniques). Nevertheless, our FDI quantification still has meaning implications for inferring new users. For , , and , let when is a feature distance based model and when is a feature distribution similarity based model. Then, when is significantly apart from or depending on (distance or distribution based), can be inferred as a new user (the data generated by a new user) with a higher confidence, i.e., or can be set as the threshold values in practical applications. The behind-the-scene reason for this fact can be explained by the following corollary, which is a direct result of the Chernoff bound.
Corollary 3**.**
(1) Let and . When and , is a new user if or for all . (2) Let . When and , is a new user if for all .
In practice, the accurate value of or is usually difficult to be obtained, if not impossible. Frequently, or can only be estimated based on the observed data and thus it may change with more data coming, i.e., the threshold estimation problem itself is an interesting problem. For our purpose, we propose to quantify the correlation between the threshold setting and the false positive/negative rate of as one of our future research directions.
IV Evaluation
In this section, we evaluate the user inferability of real world security and privacy applications based on our FDI quantification. Specifically, we evaluate two scenarios: network traffic attribution in network forensics and feature-based data de-anonymization (as shown in Section II-A).
IV-A Network Traffic Attribution
IV-A1 Data Collection and Analysis
In this scenario, we evaluate the user inferability of four large-scale network traces generated by the employees of a large enterprise. These four traces are collected in four periods of 2014: April 1 – April 30 which consists of the network traffic generated by 5888 users, July 1 – July 31 which consists of the network traffic generated by 5610 users, October 1 – October 31 which consists of the network traffic generated by 5268 users, and December 1 – December 31 which consists of the network traffic generated by 5699 users. For each network trace, it is composed of three parts: HTTP request headers, netflow measures, and DNS queries.
Here, we do not consider the network traffic payloads, e.g., the HTTP payloads, for the following reasons. First, those data are highly sensitive and using them may cause some legal issues. Second, although network traffic payloads may provide more information, using our network traces is sufficient to infer many users as shown in our experiments. Finally, as indicated in [1], in most of the common available traces, they do not have those payloads. Therefore, studying the common feature-based data inferability would be more useful and general for security and privacy applications.
IV-A2 Feature Extraction
After collecting these four traces, we extract the features of them. Here, we use the feature extraction model proposed in [1]. Although we may extract more features, for our purpose, it is sufficient to extract two kinds of lexical-based features for our FDI analysis: domain feature and path feature (tokenized). Basically, these two features characterize the behaviors of users in terms of what types of websites they have visited and how they interacted with the websites. For instance, given a HTTP request “www.google.com/search?q=ndss+2016&ie=utf-8&oe=utf-8”, we will extract a domain feature as “www.google.com”. For the path features, we tokenize each path (URL) using ‘?’, ‘=’, ‘&”, etc. as delimiters and employ a bag-of-word representation of the tokens. We refer to the interested readers for more details of the feature extraction model to [1]. Finally, we show the feature extraction results of the four traces in Table I, where Apr, July, Oct, and Dec represent the four traces collected in April, July, October, and December of 2014, “-Domain” means the domain features, “-Path” means the tokenized path features, is the number of users in the dataset, and is the number extracted features. Note that, for each user-feature relationship in Table I, there is a weight associated with it, which indicates how many times that a feature appeared in a user’s trace. For instance, if Bob visited “www.google.com” 100 times in April, 2014, then the weight associated with the “Bob – www.google.com/” relationship is 100 in the Apr-Domain dataset in Table I.
Now, we define the degree of each user as the number of features this user has and the degree of each feature as the number users that have this feature. Then, we show the user degree distribution and feature degree distribution of the traces in Table I with respect to the domain feature and the path feature in Fig.4 and Fig.5, respectively. From Fig.4 and Fig.5, we have the following observations: both the user degree and feature degree generally follow a power-law-like distribution [11], especially the feature degree distribution, i.e., most of the users have a small number of features while only a few users have many features, and meantime, most of the features only appear in the trace of a few users while a small number of features appear in the trace of a large number of users. These distributions together suggest that these features could be employed to effectively infer the users.
IV-A3 Evaluation Methodology
To conduct the FDI evaluation, we basically follow the same process as shown in Fig.1 and Fig.2. Meanwhile, since we focus on evaluating the statistically inherent FDI, we also make the evaluation process mathematically tractable. Following the models shown in Fig.1 and Fig.2, we first determine the training data and the testing data. Here, instead of partitioning the raw data into two parts for training and testing respectively (as in many existing literature, e.g., [1]), we take another while theoretically equivalent approach: following the FDI quantification in Section III, we first construct the training dataset by keep all the users and features in each trace while sample the user-feature relationships independently and identically using a probability 666It is not necessary to have the training data and the testing data to have the same group of users or features. If they are not the same, we can either apply our theory to the overlapped users/features, or make them the same by adding isolated users/features that only appeared in the other dataset. Theoretically, different user/feature group will not change the validity of our quantification.. Similarly, we construct the testing/targeting dataset using the same process as in obtaining the training dataset. We use this approach to construct the training and testing data for two reasons. First, mathematically, this approach is equivalent to the traditional method in [1]. In the traditional method, the raw data is partitioned into the training data and the testing data and then features are extracted from both datasets. Apparently, the reason that existing inferring techniques can work is that the training data and the testing data share some common features (or similar distributions over a feature space). Therefore, statistically, we can consider the training and testing data as some sampling versions of the original raw data respectively, i.e., each training and testing data partition method mathematically corresponding to one here. Second, using this approach to obtain the training and testing/targeting data makes it easier to apply our FDI quantification analysis. We will make more discussions on closing the gap between theory and practice in Section V.
After obtaining the training and testing data, we quantify the FDI of the four traces using the general scenario FDI quantification technique in Section III. Specifically, for the network traffic attribution application, most the of existing inference models are based on feature distance [1][2][3]. Therefore, we evaluate the FDI using the distance-based quantification technique here. Following Theorem 2, we can easily construct an inference model on top of the procedure of Algorithm 2 as shown in Section III-C. Then, we apply to quantify the FDI of each dataset. We also make more discussion on the implications of our quantification as well as the implications of the results in Section V.
IV-A4 Results and Analysis
Now, we evaluate the FDI of the four traces following the above evaluation methodology. To reduce any bias, all the experiments are run 10 times (e.g., for the same ). The final results are the average of that of the 10 runs. We show the -inferability of the four datasets with respect to the domain and path features in Fig.6 respectively, where we set , i.e., we are targeting a user to be Top-10 inferable. From Fig.6, we have the following observations.
- •
With the increase of , also increases, which implies that more and more users become Top-10 inferable. The reason is that a large implies more common features are shared by the training data and the targeting data, i.e., there is more knowledge available to an inference model. Therefore, statistically, more users can be successfully Top-10 inferable.
- •
When comparing the domain feature-based data inferability (Fig.6 (a)) with the path feature-based data inferability (Fig.6 (b)), we find that the path features are more powerful in inferring the users than the domain features. This can be explained based on the results in Table I, Fig.4, and Fig.5. First, for each dataset, it has much more path features than domain features (Table I), i.e., much more knowledge can be used to conduct path feature-based inference. Second, the users of each dataset have higher path feature-based degrees than domain feature-based degrees (Fig.4), and meanwhile, both the domain and the path feature degree distributions generally follow similar power-law-like distributions. Thus, users are more distinguishable with respect to the path features than that of the domain features.
In our evaluation, we also examined the data inferability with respect to other settings: changing the value of and combining the domain and path features. The results are as expected and we put them in the technical report [12]. Here, we briefly summarize the results. When increasing (from to ), more users are Top- inferable given the same . The reason is evident since increasing implies decreasing the desired inference accuracy. Statistically, more users become Top- inferable. Furthermore, after combining the domain and path features together, we also have more users inferable compared to the scenario of applying the domain and path features separately. The reason is also straightforward since more features imply more knowledge are available for inferring users, and thus the inference accuracy is improved.
IV-B Data De-anonymization
Now, we evaluate the users’ feature-based inferability in the data de-anonymization application [6][7][8].
IV-B1 Data Collection and Features
In our evaluation, we use three social network datasets, Google+, Facebook, and Twitter, which are publicly available at the Stanford Large Network Dataset Collection [13]. The reason for us to use these datasets is that they are published along with well-defined user features, e.g., birthdays, education, hometown, languages, career, etc. For de-anonymization attacks, an adversary may directly employ these features to de-anonymize users. For our purpose, we can also employ these features to quantify users’ FDI. We show the statistics of these three datasets in Table II. By comparing the datasets in Tables I and II, we can find that the three social datasets have much less features. Furthermore, for the three datasets in II, there is no weight information associated with the user-feature relationships.
We show the user degree distribution of Google+, Facebook, and Twitter in Fig.7. Basically, the user degree of these three datasets also shows a power-law-like distribution (similar to the datasets in Table I, the feature degree of these three datasets show a power-law-like distribution either [12]). This suggests that the users in the three datasets could be inferred (i.e., de-anonymized here) based on the associated features.
IV-B2 FDI Evaluation and Analysis
To evaluate the FDI of Google+, Facebook and Twitter, we take the same methodology as in the previous subsection. We show the FDI of the three datasets in Fig.8, where , i.e., we also target the Top-10 inferability of users. From the result, we have the following observations.
- •
Again, with the increase of , more users become Top-10 inferable in the three datasets. The reason is the same as that in analyzing Fig.6.
- •
Google+ is much less inferable than that of Facebook and Twitter. For instance, when , Facebook users and Twitter users are Top-10 inferable, while only Google+ users are Top-10 inferable. Even if , only Google+ users are Top-10 inferable. This can be explained based on the results in Table II and Fig.7. First, the user-feature relationship of Google+ is much sparse than the other two datasets. Second, the degree of most of the Google+ users is very low. Therefore, there is not too much information can be leveraged to infer the Google+ users.
In reality, it is possible to improve the data de-anonymization performance using more auxiliary information (more features). Here, our FDI quantification results can provide a benchmark for evaluating the performance of a data de-anonymization attack.
V Discussion
In this section, we make more discussion on the proposed FDI quantification technique, followed by pointing out the future research directions.
V-A Theory versus Practice
Motivated by many existing security and privacy applications, in this paper, we study the FDI quantification problem. To the best of our knowledge, we provide the first FDI quantification technique for general feature-based inference models from both the distance perspective and the feature distribution perspective. Using our quantification technique, we also evaluate the FDI of feature-based network forensics and data de-anonymization applications.
Our quantification is important in several perspectives. First, our quantification provides the theoretical foundation of many existing feature-based security and privacy applications, e.g., network traffic attribution in network forensics [1][3], linkage attacks and private web search [4][5], and feature-based data de-anonymization [6][7][8]. Therefore, for such kind of applications, our quantification closes the gap between the practice and theory.
Second, our quantification can be employed to evaluate the performance of the existing techniques in the aforementioned security and privacy applications. Note that, we are aiming to quantify the users who can be inferred with statistical guarantee based on their features as well as other users’ features. Meanwhile, we also provide insights on how to design the inference model (as shown in Section III). Therefore, the quantification results (e.g., the evaluation results in Section IV) can serve as a benchmark to evaluate the performance of existing techniques. For instance, to evaluate the performance of the network traffic attribution system Kaleido [1], we can employ the evaluation results in Section IV directly777We can first derive based on the training and testing data used in Kaleido. Then, we apply our FDI quantification to derive the inherent user inferability. Finally, we can use the theoretical user inferability to evaluate the performance of Kaleido. If Kaleido’s performance meets the theoretical results, we can conclude that Kaleido performs well. Otherwise, we can also tell the room for improving Kaleido.. Similarly, we can also employ the FDI quantification to evaluate existing feature-based query linkage attacks, private searching techniques, data de-anonymization attacks, etc.
Finally, since our quantification can provide a benchmark of existing feature-based security and privacy applications, it is evident that our quantification is helpful for researchers to study and develop new techniques for these applications.
V-B Future Work
In this paper, we take the first step in understanding the theoretical foundation of many existing security and privacy applications to the best of our knowledge. Specifically, we propose the FDI quantification techniques for distance-based and distribution-based inference models. There are still several interesting directions to continue the research. First, it is interesting to further generalize our quantification to the inference models that take account of both feature distance and feature distribution. Second, in addition to the distance/distribution-based models, it is also meaningful to quantify data’s inferability under other models for more security and privacy applications. Third, it is an interesting and meaningful direction to develop some FDI-based evaluation tool which can friendly and conveniently serve the data inferability analysis for existing feature-based inference-oriented security and privacy applications.
VI Related Work
In this section, we survey the related work. Since we did not have other literature studying the theoretical foundation or inferability quantification problem for existing feature-based security and privacy applications to the best of our knowledge, we focus on briefly summarizing the applications that our FDI quantification can be applied to.
User-System Interaction Trace Attribution. In [1], Wang et al. designed a network traffic attribution system Kaleido. Kaleido leverages a class of inductive discriminant models to extract user- and context-aware features of network traffic and then build an efficient inference model to conduct real time traffic attribution over high-volume network traces. Another feature-based network forensics application is [3], where Neasbitt et al. proposed ClickMiner, a novel system that aims to automatically reconstruct user-browser interactions from network traces. A comprehensive survey on network trace-based forensic frameworks can be found in [14].
In addition to network traffic-based forensic applications, there are also many other trace attribution-based security and privacy applications. For instance, in [15], Bergadano et al. proposed to employ keystroke dynamics (traces) to perform user authentication; in [16], Monrose et al. designed a technique to reliably generate a cryptographic key from a user’s voice while speaking a password; and in [17], Zheng et al. implemented an efficient user verification system based on mouse movement traces.
Linkage Attacks and Privacy-preserving Web Search. In [4], Gervais proposed a quantitative framework to understand the web-search privacy given adversary’s background knowledge and attacks. In [18], Peddinti and Saxena analyzed whether query obfuscation can preserve users’ privacy when against an adversarial search engine. In [19], Jones presented attacks to users’ query logs and broke users’ privacy. Recently, Balsa et al. presented a SoK paper on linkage attacks and privacy-preserving web search [5].
Feature-based Data De-anonymization. In [6], Caliskan-Islam et al. presented a novel data de-anonymization attack to programmers leveraging the code stylometry. Afroz et al. presented another stylometry-based de-anonymization attack in [7], by which they can identify anonymous authors of anonymous texts. In [8], Narayanan and Shmatikov presented a new class of statistical de-anonymization attacks to high-dimensional micro-data, e.g., recommendation data, transaction data, and so on. An off-line de-anonymization attack of bubble forms is presented in [20] by Calandrino et al.
Remark. In addition to the aforementioned security and privacy applications, there are also other applications, e.g., feature-based malware detection systems and intrusion detection systems, that our quantification can be applicable for analysis. Although we have many feature-based inference techniques for various security and privacy applications, their theory foundation is remain unclear. Furthermore, there is also no theoretical benchmark to evaluate the performance of existing techniques relative to the inherent performance bound. To remedy the gap, we conduct the first FDI quantification in general scenarios from both distance and distribution perspectives.
VII Conclusion
Considering that many security and privacy applications can be characterized by the feature-based inference problem, we study the FDI issue in this paper. First, we conduct the FDI quantification under a naive data model, under which we demonstrate the conditions to have a desired fraction of target users to be Top- inferable. Subsequently, we extend our quantification to a general data model by conducting the FDI quantification from both a distance perspective and a distribution perspective. Our quantification addressed several important yet open problems and lies the foundation of existing feature-based inference systems/techniques. Third, based on our quantification, we evaluate the user inferability in both the network traffic attribution case and the feature-based data de-anonymization case. Finally, we point out the implications of this research to existing feature-based inference systems/tehcniques for various security and privacy applications.
A General Version of Chernoff Bound
The following version of Chernoff bound applies to bounded variables with any distribution [10].
Lemma 7**.**
*Let be random variables such that for all . Let and set (i.e., the expectation value of ). Then, for all : and *
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] T. Wang, F. Wang, D. Schales, and R. Sailer, Kaleido: Network Traffic Attribution using Multifaceted Footprinting , SDM 2014.
- 2[2] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, Information-theoretic Metric Learning , ICML 2007.
- 3[3] C. Neasbitt, R. Perdisci, K. Li, and T. Nelms, Click Miner: Towards Forensic Reconstruction of User-Browser Interactions from Network Traces , CCS 2014.
- 4[4] A. Gervais, R. Shokri, A. Singla, S. Capkun, and V. Lenders, Quantifying Web-Search Privacy , CCS 2014.
- 5[5] E. Balsa, C. Troncoso, and C. Diaz, OB-PWS: Obfuscation-based Private Web Search , S&P 2012.
- 6[6] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, R. Greenstadt, De-anonymizing Programmers via Code Stylometry , USENIX Security 2015.
- 7[7] S. Afroz, A. Caliskan-Islam, A. Stolerman, R. Greenstadt, and D. Mc Coy, Doppelgänger Finder: Taking Stylometry to the Underground , S&P 2014.
- 8[8] A. Narayanan and V. Shmatikov, Robust De-anonymization of Large Sparse Datasets , S&P 2008.
