Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization
Mathias Lecuyer, Riley Spahn, Roxana Geambasu, Tzu-Kuo Huang, and Siddhartha Sen

TL;DR
Pyramid is a data management system that uses count featurization to minimize data exposure during machine learning training, effectively reducing the amount of data needed while maintaining model quality.
Contribution
It introduces a novel approach leveraging count featurization for data protection and selectivity in big data management, integrated into Spark Velox.
Findings
Pyramid trains models on less than 1% of raw data.
Achieves state-of-the-art model performance with minimal data exposure.
Demonstrates effective data protection in ML workloads.
Abstract
Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that…
| App | Dataset | Obs. | Feat. | Baseline |
| Ad targeting (classification) | Criteo Kaggle [35] | 45M | 39 | neural net in Kaggle [36] |
| Ad targeting (classification) | Criteo Full [37] | 1.2B | 39 | regularized linear model |
| Movie recommendation (classification) | MovieLens [38] | 22M | 21 | matrix factorization [33] |
| Movie recommendation (regression) | MovieLens [38] | 22M | 21 | matrix factorization [33] |
| News personalization (regression) | MSN.com production | 24M | 507 | contextual bandits [39, 40] |
| Dataset | Model | Parameters |
|---|---|---|
| Criteo-Kaggle | B: neural net (nn) | VW. One 35 nodes hidden layer with tanh activation. LR: 0.15. BP: 25. Passes: 20. Early Terminate: 1. |
| logistic regression (log. reg.) | VW. LR: 0.5. BP: 26. | |
| gradient boosted trees (gbt) | Sklearn. 100 trees with 8 leaves. Subsample: 0.5. LR: 0.1. BP: 8. | |
| Criteo-Full | B: ridge regression (rdg. reg.) | VW. L2 penalty: . LR: 0.5. BP: 26. |
| \pbox1.5cmMovieLens Regression | B: singular value decomposition (svd) | VW. Rank 10. L2 penalty: 0.001. LR: 0.015. BP: 18. Passes: 20. LR Decay: 0.97. PowerT: 0. |
| linear regression (lin. reg.) | VW. LR: 0.5. BP: 22. Passes: 5. Early Terminate: 1. | |
| gradient boosted trees (gbt) | Sklearn. 100 trees with 8 leaves. Subsample: 0.5. LR: 0.1. BP: 8. | |
| \pbox1.5cmMovieLens Classification | B: singular value decomposition (svd) | VW. Rank 10. L2 penalty: 0.001. LR: 0.015. BP: 18. Passes: 20. LR decay: 0.97. PowerT: 0. |
| logistic regression (log. reg.) | VW. LR: 0.5. BP: 22. Passes: 5. Early Terminate: 1. | |
| gradient boosted trees (gbt) | Sklearn. 100 trees with 8 leaves. Subsample: 0.5. LR: 0.1. BP: 8. | |
| MSN.com | contextual bandit | VW. IPS context. bandit. LR: 0.02. BP: 18. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Adversarial Robustness in Machine Learning · Data Stream Mining Techniques
Pyramid: Enhancing Selectivity in Big Data Protection with Count
Featurization*††thanks: *Technical report version of the IEEE S&P’17 paper with the same name and authors. This technical report describes a recent addition to Pyramid to make some of our processes differentially private (§-A.
Mathias Lecuyer*∗∗1*
1Columbia University, 2Uber Advanced Technologies Group, and 3Microsoft Research
Riley Spahn*∗∗1* *∗∗*First authors in alphabetical order. 1Columbia University, 2Uber Advanced Technologies Group, and 3Microsoft Research
Roxana Geambasu1
1Columbia University, 2Uber Advanced Technologies Group, and 3Microsoft Research
Tzu-Kuo Huang*†2* *†*Work done while at Microsoft Research. 1Columbia University, 2Uber Advanced Technologies Group, and 3Microsoft Research
Siddhartha Sen3
1Columbia University, 2Uber Advanced Technologies Group, and 3Microsoft Research
Abstract
Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected “just in case” would help these organizations to limit the latter’s exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today’s big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability.
We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data.
I Introduction
Driven by cheap storage and the immense perceived potential of “big data,” both public and private sectors are accumulating vast quantities of personal data: clicks, locations, visited websites, social interactions, and more. Data offers unique opportunities to improve personal and business effectiveness. It can boost applications’ utility by personalizing their features; increase business revenues via targeted product placement; improve social processes such as healthcare, disaster response and crime prevention. Its commercialization potential, whether real or perceived, drives unprecedented efforts to grab and store raw data resources that can later be mined for profit.
Unfortunately, this “collect-everything” mentality poses serious risks for organizations by exposing extensive data stores to external and internal attacks. The hacking and exploiting of sensitive corporate and governmental information have become commonplace [1, 2]. Privacy-transgressing employees have been discovered snooping into data stores to spy on friends, family, and job candidates [3, 4]. Although organizations strive to restrict access to particularly sensitive data (such as passwords, SSNs, emails, banking data), properly managing access controls for diverse and potentially sensitive information remains an unanswered problem.
Compounding this challenge is a significant new thrust in the public and private spheres to integrate data collected from multiple sources into a single, giant repository (or “data lake”) and make that available to any applications that might benefit from it [5, 6, 7]. This practice magnifies the data exposure problem, transforming big data into what some have called a “toxic asset” [8].
Our goal in this paper is to explore a more rigorous and selective approach to big data protection. We hypothesize that not all data that is collected and archived is, or may ever be, needed or used. The ability to distinguish data needed now or in the future from data collected “just in case” could enable organizations to restrict the latter’s exposure to attacks. For example, one could ship unused data to a tightly controlled store, whose read accesses are carefully mediated and audited. Turning this hypothesis into a reality requires finding ways to: (1) minimize data kept in the company’s widely-accessible data lakes, and (2) avoid the need to access the controlled store to meet current and evolving workload needs.
A natural approach might be to monitor data use and retain only the working set of in-use data in accessible storage; data unused for some time is evicted to the protected store [9]. However, many of today’s big data applications involve machine learning (ML) workloads that are periodically retrained to incorporate new data, resulting in frequent accesses to all data. How can we determine and minimize the training set—the “working set” for emerging ML workloads—to adopt a more rigorous and selective approach to big data protection?
We observe that for ML workloads, significant research is devoted to limiting the amount of data required for training. The reasons are many but typically do not involve data protection. Rather, they include increasing performance, dealing with sparsity, and limiting labeling effort. Techniques such as dimensionality reduction [10], feature hashing [11], vector quantization [12], and count featurization [13] are routinely applied in practice to reduce data dimensionality so models can be trained on manageable training sets. Semi-supervised [14] and active learning [15] reduce the amount of labeled data needed for training when labeling requires manual effort.
Can such mechanisms also be used to limit exposure of the data being collected? How can an organization that already uses these methods develop a more robust data protection architecture around them? What kinds of protection guarantees can this architecture provide?
As a first step to answering these questions, we present Pyramid, a limited-exposure big-data management system built around a specific training set minimization method called count featurization [16, 17, 18, 13]. Also called historical statistics, count featurization is a widely used technique for reducing training times by feeding ML algorithms with a limited subset of the collected data combined (or featurized) with historical aggregates from much larger amounts of data. The method is valuable when features with strong predictive power are highly dimensional, requiring large quantities of data (and large amounts of time and resources) to be properly modeled. Applications that use count featurization include targeted advertising, recommender systems, and content personalization systems. Such applications rely on user information to predict clicks, but since there can be hundreds of millions of users, training can be very expensive without some way to aggregate users, like count featurization. The advertising systems at Microsoft, Facebook, and Yahoo are all built upon this mechanism [19], and Microsoft Azure offers it as a service [20].
Pyramid builds on count featurization to construct a selective data protection architecture that minimizes exposure of individual observations (e.g., individual clicks). To highlight, Pyramid: keeps a small, rolling window of accessible raw data (the hot window); summarizes the history with privacy-preserving aggregates (called counts); trains application models with hot raw data featurized with counts; and rolls over the counts to forget all traces of observations past a specified retention period. Counts are infused with differentially private noise [21] to protect individual observations that are no longer in the hot window but still fall within the retention period. Counts can support modifications and additions of many (but not all) types of models; historical raw data, which may be needed for workloads not supported by count featurization, is kept in an encrypted store whose decryption requires special access.
While count featurization is not new, our paper is the first to retrofit it for data protection. Doing so raises significant challenges. We first need to define meaningful requirements and protection guarantees that can be achieved with this mechanism, such as the amount of exposed information or the granularity of protection. We then need to achieve these protection guarantees without affecting model accuracy and scalability, despite using much less raw data. Finally, to make the historical raw data store easier to protect, we need to access it as little as possible. This means supporting workload evolution, such as parameter tuning or trying new algorithms, without the need to go back to historical raw data store.
We overcome these challenges with three main techniques: (1) weighted noise infusion, which automatically shares the privacy budget to give noise-sensitive features less noise; (2) an unbiased private count-median sketch, a data structure akin to a count-min sketch that resolves the large negative bias arising from applying differentially private noise to a count-min sketch; and (3) automatic count selection, which detects potentially useful groups of features to count together, to avoid accesses to the historical data. Together, these techniques reduce the impact of differentially private noise and count featurization.
We built Pyramid and integrated it into Spark Velox, a targeting and personalization framework, to add rigor and selectivity to its data management. We evaluated three applications: a targeted advertising system using the Criteo dataset, a movie recommender using the MovieLens dataset, and MSN’s production news personalization system. Results show that: (1) Pyramid approaches state-of-the-art models while training on less than 1% of the raw data. (2) Protecting historical counts with differential privacy has only 2% impact on accuracy. (3) Pyramid adds just 5% performance overhead.
Overall, we make the following contributions:
Formulating the selective data protection problem for emerging ML workloads as a training set minimization problem, for which many mechanisms already exist. 2. 2.
The design of Pyramid, the first selective data management system that minimizes data exposure in anticipation of attack. Built upon count featurization, Pyramid is particularly suited for targeting and personalization workloads. 3. 3.
A set of new techniques to balance solid protection guarantees with model accuracy and scalability, such as our unbiased private count-median sketches. 4. 4.
Pyramid’s code, both integrated into Spark Velox and as a stand-alone library ready to integrate in other targeting/personalization frameworks. https://columbia.github.io/selective-data-systems/
II Motivation and Goals
This paper argues for needs-based selectivity in big data protection: protecting data differently depending on whether or not it is actually needed to handle a company’s day-to-day workloads. Intuitively, data that is needed day-to-day is less amenable to certain kinds of protection (e.g., auditing or case-by-case access control) than data needed only for exceptional situations. A key question is whether a company’s day-to-day needs can be captured with a limited and well-defined data subset. While we do not claim to answer this question in full, we present with Pyramid the first evidence that selectivity can be achieved in one important big-data workload domain: ML-based targeting and personalization. The following scenario motivates selectivity and shows how and in what contexts Pyramid helps improve protection.
II-A Example Use Case
MediaCo, a media conglomerate, collects observations of user behavior from its hundreds of affiliate news and entertainment sites. Observations include the articles users read and share, the ads they click, and how they respond to A/B testing. MediaCo uses this data to optimize various processes, including recommending articles to users, showing the most relevant articles first, and targeting ads. Initially, MediaCo collected observations from affiliate sites in separate, isolated repositories; different engineering teams used different repos to optimize these processes for each affiliate site. Recently, MediaCo has started to track users across sites using cookies and to integrate all data into a central data lake. Excited about the potential of the much richer information in the data lake, MediaCo plans to provide indiscriminate access to all engineers. However, aware of recent external hacking and insider attacks affecting other companies, it worries about the risks it assumes with such wide access.
MediaCo decides to use Pyramid to limit the exposure of historical observations in anticipation of such attacks. For MediaCo’s main workloads, which consist of targeting and personalization, the company already uses count featurization to address sparsity challenges; hence, Pyramid is directly applicable for those workloads. They configure it by keeping Pyramid’s hot window of raw observations, along with its noise-infused historical statistics, in the widely accessible data lake so all engineers can train their models, tune them, and explore new algorithms every day. Pyramid absorbs many workload needs—current and evolving—as long as the algorithms draw on the same user data to predict the same outcome (e.g., whether a user will click on an ad). MediaCo also configures a one-year retention period for all observations; after this period, Pyramid removes observations from the statistics and launches retraining of all application models to purge the old activity. Finally, MediaCo stores all raw observations in an encrypted store whose read accesses are disabled by default. Access to this store is granted temporarily and on a case-by-case basis to engineers who demonstrate the need for statistics beyond those that Pyramid maintains.
In addition to targeting/personalization workloads, MediaCo has other, potentially non-ML workloads, such as business analytics, trend studies, and forensics; for these, count featurization may not apply. Hence, MediaCo gives direct access to the raw-data store to engineers managing these workloads and isolates their computational resources from the targeting/personalization teams.
With this configuration, MediaCo minimizes access to its collected data on a needs basis. Assuming no entity with full access to the historical raw data is malicious, Pyramid guarantees the following (detailed in §II-B). (1) Any observations preceding the hot window when an attack begins will be hidden from the attacker. (2) Hiding is done at an individual observation level during the retention period and in bulk past the retention period. (3) Only in exceptional circumstances do engineers get access to the historical raw data. With these guarantees, MediaCo negotiates lower data loss insurance premiums and gains PR benefits for its efforts to protect user data.
II-B Threat Model
Fig. 1 illustrates Pyramid’s threat model and guarantees. Pyramid gives guarantees similar to those of forward secrecy: a one time compromise will not allow an adversary to access all past data. Attacks are assumed to have a well-defined start time, , when the adversary gains access to the machines charged with running Pyramid, and a well-defined end time, , when administrators discover and stop the intrusion. Adversaries are assumed to not have had access to the system before , nor to have performed any action in anticipation of their attack (e.g., monitoring external predictions, the hot window, or the models’ state), nor to have continued access after . The attacker’s goal is to exfiltrate individual observations of user activities (e.g., to know if a user clicked on a specific article/ad). Historical raw data is assumed to be protected through independent means and not compromised in this attack. Pyramid’s goal is to limit the hot data in active use, which is widely accessible to the attacker.
Examples of adversaries that fit our threat model can be found among both the internal and external adversaries of a company. An external adversary may be a hacker who breaks into the company’s computing infrastructure at time and starts looking for data that may prove of value (e.g., information about celebrities’ specific activities, what they liked or disliked, where they were in the past, etc.). An internal adversary may be a privacy-transgressing employee who spontaneously decides at to look into some past action of a family member or friend (e.g., to check if the person has visited or liked a particular page).
After compromising Pyramid’s internal state, the attacker will gain access to data in three different representations: the hot data store containing plaintext observations, the historical counts, and the trained models themselves. The plaintext observations in the hot data store are not protected in any way. The historical statistics store contains differentially private count tables of the recent past. The attacker will learn some information from the count tables but individual records will be protected with a differentially private guarantee. Pyramid forces models to be retrained when observations are removed from the hot raw data store, so the attacker will not be able to learn anything from the models beyond what they have already learned above.
Pyramid provides three protection levels:
- P1
No protection for present or future observations. Observations in the hot data store when the attack begins, plus observations added to the hot data store while the attack is ongoing, receive no protection; i.e., observations received between () and receive no protection. 2. P2
Protection for individual observations for the length of the retention period. Statistics about observations are retained in differentially private count tables for a predefined retention period . The attacker may learn broad statistics about observations in the interval but will not be able to confidently determine if a specific observation is present in the table. 3. P3
Protection in bulk past the retention period. Observations past their retention period (i.e., older than ) have been phased out of the historical statistics store and are protected separately by the historical raw data store.
Finally, we assume that no states created based on the hot raw data persist once the hot window is rolled over. While we explicitly launch retraining of models registered with Pyramid, we operate under the assumption that (1) the models’ states are securely erased [22] and (2) no other state was created out of band based on the raw hot data (such as copies made by programmers).
II-C Design Requirements
Given the threat model, our design requirements are:
- R1
Limit widely accessible data. The hot data window is exposed to attackers; hence, Pyramid must limit its size subject to application-level requirements, such as the accuracy of models trained with it. 2. R2
Avoid accesses to historical raw data even for evolving workloads. Pyramid must absorb as many current and evolving workload needs as possible to limit access to the historical raw data. 3. R3
Support retention policies. Pyramid must enforce a company’s retention policies. Although Pyramid provides a differential privacy guarantee, no protection is stronger than securely deleting data. 4. R4
Limit impact on accuracy, performance, scalability. We intend to preserve the functional properties of applications and models running on Pyramid.
III The Pyramid Architecture
Pyramid, the first selective data management architecture, builds upon the ML technique of count-based featurization and augments it with new mechanisms to meet the preceding design requirements.
III-A Background on Count-Based Featurization
Training predictive models can be challenging on data that contains categorical variables (features) with large numbers of possible values (e.g., an ID or an interest vector). Existing ML techniques that handle large feature spaces often make strong assumptions about the data, e.g., assuming a linear relationship between the features and the label (e.g., Lasso [23]). If the data does not meet these assumptions, results can be very poor.
Count-based featurization [13] is a popular approach to handling categorical variables of high cardinality. Rather than directly using the value of a categorical variable, this technique featurizes the data with the number of times a particular feature value (e.g., a user ID) was observed with each label and the conditional probability of the label given the feature value. This substantially reduces dimensionality. Suppose the raw data contains categorical features with an average cardinality of and a label of cardinality , where ; e.g., in click prediction can be millions (number of users), while is 2 (click, non-click). Standard encoding of categorical variables [24] results in a feature space of dimension , whereas with count featurization it is . Count featurization can also be applied to continuous variables or continuous labels by first discretizing them; this increases dimensionality but only by a small factor.
The dramatic dimensionality reduction yields important benefits. It is known that fewer dimensions permit more efficient learning, both statistically and computationally, potentially at the cost of reducing predictive accuracy. However, count featurization makes it feasible to apply advanced, nonlinear models, such as neural networks, boosted trees, and random forests. This combination of succinct data representation and powerful learning models enables substantial reduction of the training data with little loss in predictive performance. Quantified in §V, this is the insight behind our use of count-based featurization to limit data exposure.
III-B Architectural Components
Fig. 2 shows Pyramid’s architecture. Pyramid manages collected data (observations) on behalf of application models hosted by a model management system. In our case, we use Velox [25], built on Spark. Velox facilitates ML-based targeting and personalization services by implementing three functions: (1) fast, but incomplete, incorporation of new observations into models that programmers register with Velox; (2) low-latency prediction serving from these models; and (3) periodic retraining of the models to correct inconsistencies created by the incomplete incorporation of new observations. Velox saves observations in a separate data management component, Spark’s Tachyon. Pyramid replaces this component to ensure rigorous and selective protection of observations.
Pyramid itself consists of four architectural components, shown across the top of the highlighted box in Fig. 2. The first is count featurization, which leverages the known ML mechanism to count featurize observations before feeding them to models for training and prediction. The second, third, and fourth are noise infusion, data retention, and count selection, which augment count featurization with differential privacy and a set of new mechanisms to meet Pyramid’s design requirements. We discuss each component in turn.
III-B1 Count Featurization
Pyramid hijacks the stream of observations collected by Velox (the observe method) and count-featurizes them. An observation is a pair with a feature vector and a label . Application models predict the label (or a probability for each possible label) for a given feature vector by training on count-featurized observations. When an observation arrives, Pyramid incorporates it into two data structures: (1) the hot raw data store, which retains observations from the recent past, and (2) the historical statistics store, which consists of multiple count tables that maintain the number of occurrences of each feature with each label. We maintain count tables for all features in and for some feature combinations. A separate set of count tables is maintained for each time window.
Featurization transforms a feature vector into a count-featurized feature vector , by replacing each feature with the conditional probabilities of each label value given ’s value. The conditional probabilities are computed directly from the count tables as discussed below. To train its models, an application requests a training set from Pyramid (getTrainSet). Pyramid featurizes the hot raw data with historical counts and returns it to the application. To predict the label for a feature vector , the application requests its featurization from Pyramid (featurize); Pyramid returns .
Example. Fig. 3 shows (a) a sample observation format, (b) some count tables used by Pyramid to count-featurize it, and (c) a sample count-featurized observation.
Observation format. In targeting and personalization, an observation’s feature vector typically consists of user features (e.g., id, gender, age, and previously compiled preferences) and contextual information for the observation (e.g., the URL of the article or the ad shown to the user, plus any features of these). The label might indicate whether the user clicked on the article/ad.
Count tables. Once an observation stream of the preceding type is registered with Pyramid, the userId table maintains for each user the number of clicks the user has made on any ad shown and the number of non-clicks; it therefore encodes each user’s propensity to click on ads. The urlHash table maintains for each URL the number of clicks that each user made on any ad shown on that page; it therefore encodes the page’s inherent “ad-clickability.” Pyramid maintains count tables for every feature in and for some feature combinations with predictive potential, such as the \langle{\color[rgb]{1,0,1}\textit{urlHash},\textit{adId}}\rangle table, which encodes the joint probability of a particular ad being clicked when it is shown on a particular page.
Count featurization. To count-featurize a feature vector , Pyramid first replaces each of its features with the conditional probabilities computed from the count tables, e.g., , where from the row matching the value of in the table corresponding to . Pyramid also appends to the conditional probabilities for any feature combinations it maintains. Fig. 3(c) shows an example of feature vector and its count-featurized version . This is a simplified version of the count featurization function. We can also include the raw counts in , and support non-binary categorical labels by including conditional probabilities for each label. To avoid featurizing with an effectively random probability when a given feature value has very few counts, we estimate the variance of our probability estimate and, if it is too high, featurize with a default probability .
Training and prediction. Suppose a boosted-tree model is trained on a count-featurized dataset ( pairs). It might find that for users with a click propensity over 0.04, the chances of a click are high for ads whose clickability exceeds 0.05 placed on websites with ad-clickability over 0.1. In this case, the model would predict a “click” label for the feature vector in Fig. 3(c).
Process. Pyramid count-featurizes all features for each observation type. For categorical features, we featurize them as described above. For low-cardinality features, we can additionally include the raw feature values in alongside the conditional probabilities. Continuous features are first mapped to a discrete space, binning them by percentiles, and then count-featurized as categorical. We do the same with continuous labels.
Pyramid maintains hot windows and count tables as follows. There is one hot window for each observation stream. There is one count table per feature or feature group; it has a column for each label and a row for each value the feature can take. To support granular retention times, each count table is composed of multiple windowed count tables holding data for observations collected during disjoint windows of time. The complete count table is the sum of the associated windowed count tables. When a new observation arrives, it is added to the hot window and made immediately available to the models for (re)training. The hot window is a sliding window that may be sized differently from the count table window. It is also added to the current windowed count table; this count table is withheld when computing the complete count table until it is finished populating. At this point, Pyramid begins using it as part of the featurization process, phases out the oldest count table if it is past its retention period, and begins populating a new count table that has been initialized with differentially private noise. Once count tables are incorporated into the featurization process, they are never updated again.
Count-min sketches (CMSes). A key challenge with count featurization is its storage requirement. For a categorical variable of cardinality and a label of cardinality , the count table is of size . A common solution, used in Azure [20], is to store each table in a Count-Min Sketch (CMS) [26], a data structure that approximates counts in sub-linear space. A CMS consists of a 2D array with an independent hash function for each row. When a new feature arrives, the CMS uses the hash function for each row to assign the feature to a column and increment the value in that cell.
We query the CMS for a feature count by hashing the feature into a column of each row and taking the minimum value. Despite overcounting from collisions, CMS provides sufficiently accurate count estimates to train ML models. With a CMS, we can maintain more and/or larger count tables with bounded storage overheads. This gives developers flexibility in the types of modeling they can do atop in-use data without tapping into the historical data store. The CMS poses challenges to our noise infusion process, as described next.
III-B2 Noise Infusion
Pyramid’s key contribution is to retrofit count featurization, a technique developed for performance and scalability, to protect past observations against exposure to attack. Pyramid infuses noise into the count tables to protect these observations. While we leverage differential privacy methods [21], correctly applying these methods in our context poses scaling challenges. For example, each observation contributes to multiple count tables, increasing the noise required to guarantee differential privacy, and a naïve application degrades accuracy when there are many count tables. We present two techniques to address this challenge. First, we use a weighted noise infusion technique to mitigate the impact of noise, allowing us to navigate the privacy/utility trade-off. Second, for high noise levels, we replace the CMS by a count-median sketch [27], a data structure with weaker accuracy guarantees than CMS but that provides an unbiased frequency estimate, making it more robust to negative noise values. To our knowledge, we are the first to observe that the count-median sketch structure is better suited to differential privacy. After a brief overview of differential privacy, we describe these techniques.
Differential privacy properties. Pyramid’s noise infusion component uses four differential privacy properties:
1. Privacy guarantees: Let be the database of past observations, be a database that differs from by exactly one observation (i.e., adds or removes 1 observation), and the range of all possible count tables that can result from a randomized query that builds a count table from a window of observations. The count table query is -differentially private if . In other words, adding or removing an observation in does not significantly change the probability distribution of possible count tables; therefore, the count table does not leak significant information about any specific observation [21]. is called the query’s privacy budget.
2. Laplace distribution: Let a query’s sensitivity be the magnitude of the change in the query result triggered by adding or removing a single observation. If the query has sensitivity , then adding noise drawn from a Laplace distribution with scale parameter guarantees that the result is -differentially private [21]. Increasing increases the standard deviation of the distribution (stdev of a Laplace distribution with parameter is ).
3. Composability: Differentially private queries are composable: the sum of -differentially private queries is -differentially private [28]. This lets us maintain multiple count tables, possibly with different budgets, and combine them without breaking guarantees. (Advanced composition theorems allow sublinear loss in the privacy budget by relaxing the guarantees to -differential privacy [29], but we do not explore that here.)
4. Post-processing resilience: Any computation on a differentially private data release remains differentially private [29]. This is a crucial point for Pyramid’s protection guarantees: it ensures that guarantee P2, the protection of individual past observations during their lifetime, holds for each model’s internal state and outputs. As long as models comply with retrain calls and erase all internal state when they do, their output is differentially private with regard to observations outside the hot window.
Basic noise infusion process. We apply these known properties when creating count tables for the hot window. Upon creating a count table, we initialize each cell of the CMS storing that table with a random draw from a Laplace distribution. This noise is added only once: the count tables are updated as observations arrive and are sealed when the hot window rolls over. To determine the correct parameter for the Laplace distribution, , we must account for three factors: (1) the internal structure of the CMS, (2) the number of observations we want to hide simultaneously, and (3) the number of count tables (features or feature combinations) we are maintaining.
First, an exact count table has sensitivity since adding or removing an observation can only change one count by 1. For a CMS, each observation is counted once per hash function; hence, the sensitivity is , the number of hash functions. Second, if we aim to hide any group of observations with a privacy budget of , then we make a count table -differentially private by adding noise from a Laplace distribution of parameter in every cell of the CMS. Third, we must maintain multiple count tables for the different features and feature groups. Since each observation affects every count table, we need to split the privacy budget among them, e.g., splitting it evenly by adding noise with to each table.
The third consideration poses a significant challenge for Pyramid: the amount of noise we apply grows linearly with the number of count tables we keep. Since the amount of noise directly affects application accuracy, this yields a protection/accuracy tradeoff, which we address with weighted noise infusion.
Weighted noise infusion process. We note that count tables are not all equally susceptible to noise. For example in our movie recommender, the table most likely contains low values, since each user rates only a few movies ( for the median user). Moreover, we do not expect this count to change significantly when adding more data, since single users will not rate significantly more movies. Each table however contains higher values (1M or more), since each genre characterizes multiple movies, each rated by many users. Sharing noise equally between tables would pollute all counts by a standard deviation of (, , and ), a reasonable amount for s, but devastating for the feature, which essentially becomes random.
Pyramid’s weighted noise infusion distributes the privacy budget unevenly across count tables, adding less noise to low-count features. This way, we retain more utility from those tables, and the composability property of differential privacy preserves our protection guarantees. Each table’s share of noise is determined automatically, based on the count values observed in the hot window. Specifically, the user specifies a quantile, and the privacy budget is shared between each feature proportionally to this quantile of its counts. For instance we use the first percentile, so that 99% of the counts for a feature will be less affected by the noise. Sharing the privacy budget proportionally to the counts is a heuristic that makes the noise’s standard-deviation proportional to the typical counts of each feature. This scheme is also independent of the learning algorithm.
Section V shows that weighted noise infusion is vital for providing protection while preserving accuracy at scale: without it, the cost of hiding single observations is a 15% accuracy loss; with it, the loss is less than 5%.
The weight selection process must be made differentially private lest it may leak information about the hot window used to compute the weights. While our IEEE Security & Privacy paper [30] did not address this problem, we have since modified Pyramid to compute feature weights in a differentially private way. §-A describes our method, which can be summarized as follows. We compute the weights every so often (e.g., every month) using the data in one hot window. We use a configurable portion of one window’s privacy budget and leverage smooth sensitivity [31] to compute differentially private count percentiles, which we then use as feature weights. We compute differentially private percentiles by adapting the J-List algorithm for the differentially private median described in [31]. §-A2 shows that we can make the weighted noise infusion calculation differentially private without reducing the accuracy wins gained from doing weighted noise infusion.
Unbiased private count-median sketch. Another factor that degrades performance when adding differentially private noise is the interaction between the noise and the CMS. In the CMS, the final estimate for a count is for each row . The minimum makes sense here since collisions can only increase the counts. The Laplace distribution however is symmetric around zero, so we may add negative noise to the counts. Taking the minimum of multiple draws—each cell is initiated with a random draw from the distribution—thus selects the most extreme negative values, creating a downward bias that can be very large for a small .
We observe that because the mean of the Laplace distribution is 0, an unbiased estimator would not suffer from this drawback. For tables with large noise, we thus use a count-median sketch [27], which differs in two ways: 1) each row has another hash function that maps the key to a random sign , with each cell updated with ; 2) the estimator is the median of all counts multiplied by their sign, instead of the minimum. The signed update means that collisions have an expected impact of zero, since they have an equal chance of being negative or positive, making the cell an unbiased estimate of the true count. The median is a robust estimate that preserves the unbiased property.
Using this count-median sketch reduces the impact of noise, since values from the Laplace distribution are exponentially concentrated around the mean of zero. §V shows that for small , or a large number of features, it is worth trading the CMS’s better guarantees for reduced noise impact with the count-median sketch.
III-B3 Data Retention
While differential privacy provides a reasonable level of protection for past observations, complete removal of information remains the cleanest, strongest form of protection (design R3 in §II-C). Pyramid supports data expiration with windowed count tables. When an observation arrives, Pyramid updates the count tables for the current count window only. To featurize , Pyramid sums the relevant counts across windows. Periodically, it drops the oldest window and invokes retraining of all models in Velox (retrain method). Our use of count-based featurization supports such behaviors because retraining is cheap (§V-E), so we can afford to do it frequently.
III-B4 Count Selection
Pyramid seeks to support workload evolution (model changes/additions, such as future model M4 in Fig. 2) using only the widely accessible stores without tapping into the historical raw data store. To do so, it uses two approaches. First, it stores the count tables in a very compact representation—the count-median sketches—so it can afford to keep plenty of count tables. Second, it includes an automatic process of count table selection that inspects the data to identify feature combinations worth counting, whether they are used in the current workloads or not. This technique is useful because count featurization tends to obscure correlations between features. For example, different users may have different opinions about specific ads. Although that information could be inferred by a learning algorithm from the raw data points, it is not accessible in the count-featurized data unless we explicitly count the joint occurrences of specific users with specific ads, i.e., maintain a table for the group.
We adapted several feature selection techniques [32] to select feature groups and describe one here. Mutual Information (MI) is a measure of dependence between two random variables. A common feature selection technique keeps features of high MI with the label. We extend this mechanism for group count selection. Our goal is to identify feature groups that provide more information about the label than individual features. For each feature , we find all other features such that and together exhibit higher MI with the label than alone. From these groups, we select a configurable number with highest MIs. To find promising groups of larger sizes, we apply this process greedily, trying out new features with existing groups. For each selected group, Pyramid creates and maintains a count table.
This exploration of promising groups operates on the hot window of raw data. Because the hot raw data is limited, the selection may not be entirely reliable. Therefore, count tables for new groups are added on a “trial basis.” As more data accumulates in the counts, Pyramid re-evaluates them by computing the MI metric on the count tables. With the increased amount of data, Pyramid can make a more reliable decision regarding which count tables to keep and which to drop. Because count selection—like feature selection—is never perfect, we give engineers an API to specify groups that they know are worth counting from domain knowledge. Finally, like the weight selection process, count selection should be made differentially private so the groups selected in a particular hot window, which are preserved over time, do not leak information about the window’s data in the future. §-A3 proposes a method for making count selection private.
III-C Supported Workload Evolution
Count featurization is a model-independent preprocessing step, allowing Pyramid to absorb some common evolutions during an ML application’s life cycle without tapping the historical raw data store. §V-G gives anecdotal evidence of this claim from a production workload. This section reviews the types of workload changes Pyramid currently absorbs.
A developer may want to change four aspects of the model: (1) the algorithm used to train the model (2) hyperparameters for the model or for the underlying optimization algorithm, (3) features used by the model, and (4) the predicted label. Pyramid supports (1) and (2), partially supports (3), and usually does not support (4).
Algorithm changes: Supported. Pyramid allows developers to move between types of models and libraries used to train those models as long as they are using features and labels that are already counted. In our evaluation we experimented with linear models and neural networks in Vowpal Wabbit [33] and gradient boosted trees in scikit-learn [34] using the same count tables.
Hyperparameter tuning: Supported. By far the most common type of model change we encountered, both in our own evaluation and in reports from a production setting, was hyperparameter tuning. For example, a developer may want to change model hyperparameters, such as the number of hidden units in a neural network, or tune parameters of the underlying optimization algorithm, such as the learning rate or an L1/L2 regularization penalty. Changing hyperparameters is independent from the underlying features so is supported by Pyramid.
Feature changes: Partially supported. Pyramid supports making minimal feature changes. A developer may want to perform one of three types of feature changes: adding new features, removing existing features, or adding interactions between existing features. Pyramid trivially supports removing existing features, and lets developers add new features if they are based on existing ones. For example, the developer could not create an feature interaction if the individual features were not already counted together. Introducing new feature combinations or interactions requires creating new count tables. This highlights the importance of count selection to support workload evolution.
Label changes: Mostly unsupported. Changes in predicted labels are not supported except if a new label is a subset of an existing label. For example, a news recommender could not start predicting retention time instead of clicks unless retention time was previously declared as a label. As with features, Pyramid can support label changes when the new label is a subset of an existing one. For example, if a label exists that tracks retention time in time buckets, Pyramid can support new, coarser labels, such as the three classes “0 seconds,” “less than a minute,” and “more than a minute.”
III-D Summary
With these components, Pyramid meets the design requirements noted in §II-C, as follows. R1: By enhancing the training set with historical statistics gathered over a longer period of time, we minimize the hot data. R2: By automatically identifying combinations of features worth maintaining, we avoid having to access the historical raw data for workloads that use the same observation streams to predict the same label. R3: By rolling the count windows and retraining the application models, we support data retention policies, albeit at a coarse level. §V evaluates R4: accuracy and performance impact.
IV Prototype
Pyramid is implemented in 2600 lines of Scala, as a modular library. It integrates into the feature engineering stage of an ML pipeline, before the actual learning algorithms are invoked. The modular backend allows count tables to be stored locally in memory or in a remote datastore such as Redis or Cassandra.
We integrated Pyramid into the Velox model management system [25] with minimal effort, by adding/modifying around 500 lines of code. The changes we made to Velox involve interposing on all of Velox’s interfaces that interact with raw data (e.g., adding observations, making predictions, and retraining). Now prediction requests are passed through the Pyramid featurization layer, which performs count featurization.
One of Velox’s key contributions is performing low latency predictions by pushing models to application servers. To enable low-latency predictions, Pyramid periodically replicates snapshots of the central count tables to the application servers, allowing them to perform featurization locally. §V-E evaluates prediction performance in Velox/Pyramid with and without this optimization.
V Evaluation
We evaluate Pyramid using different versions of three data-driven applications: two ad targeting applications, two movie recommendation applications, and MSN’s production news personalization system. We compare models on count-featurized data to state-of-the-art models trained on raw data, and answer these questions:
- Q1.
Can we accurately learn on less data using counts? 2. Q2.
How does past-data protection impact utility? 3. Q3.
Does counting feature groups improve accuracy? 4. Q4.
How efficient is Pyramid? 5. Q5.
To what problems does Pyramid apply?
Our evaluation yields four findings: (1) On classification problems, count featurization lets models perform within 4% of state-of-the-art models while training on less than 1% of the data. (2) Count featurization enables powerful nonlinear algorithms, such as neural networks and boosted trees, that would be infeasible due to high-cardinality features. (3) Protecting individual past observations with differential privacy adds 1% penalty to the accuracy, which remains within 5% of state-of-the-art models. (4) Pyramid’s performance overheads are small.
V-A Methodology
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Eng, “OPM hack: Government finally starts notifying 21.5 million victims,” http://www.nbcnews.com/tech/security/opm-hack-government-finally-starts-notifying-21-5-million-victims-n 437126 , 2015.
- 2[2] T. Gryta, “T-Mobile customers’ information compromised by data breach at credit agency,” http://www.wsj.com/articles/experian-data-breach-may-have-compromised-roughly-15-million-consumers-1443732359 , 2015.
- 3[3] S. Gorman, “NSA officers spy on love interests,” http://blogs.wsj.com/washwire/2013/08/23/nsa-officers-sometimes-spy-on-love-interests/ , 2013.
- 4[4] C. Ornstein, “Celebrities’ medical records tempt hospital workers to snoop,” https://www.propublica.org/article/clooney-to-kardashian-celebrities-medical-records-hospital-workers-snoop , 2015.
- 5[5] D. Wilson, “Hearst’s VP of data on connecting the data dots,” http://www.pubexec.com/article/hearsts-vp-data-connecting-data-dots/ , 2014.
- 6[6] L. Rao, “Google consolidates privacy policy; will combine user data across services,” http://techcrunch.com/2012/01/24/google-consolidates-privacy-policy-will-combine-user-data-across-services/ , 2012.
- 7[7] O. Chiu, “Introducing Azure Data Lake,” https://azure.microsoft.com/en-us/blog/introducing-azure-data-lake/ , 2015.
- 8[8] B. Schneier, “Data is a toxic asset,” https://www.schneier.com/blog/archives/2016/03/data_is_a_toxic.html , 2015.
