The Adversarial Robustness of Sampling
Omri Ben-Eliezer, Eylon Yogev

TL;DR
This paper investigates the vulnerability of common sampling methods like Bernoulli and reservoir sampling to adaptive adversarial attacks in streaming data, revealing that robustness depends on the complexity of the set system and proposing a modified sample size bound.
Contribution
It demonstrates that standard sampling sizes are vulnerable to adaptive adversaries and proposes a simple modification replacing VC-dimension with the logarithm of the set system's size to ensure robustness.
Findings
Adaptive adversaries can unrepresentatively skew samples with small sizes in certain set systems.
Replacing VC-dimension with log of set system size in sample bounds enhances robustness.
The proposed modification nearly matches the attack's theoretical lower bound.
Abstract
Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
The Adversarial Robustness of Sampling
Omri Ben-Eliezer Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.
Eylon Yogev Department of Computer Science, Technion, Haifa, Israel. Supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 742754, and by a grant from the Israel Science Foundation (no. 950/16).
Abstract
Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample “very unrepresentative” of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner.
Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size is an -approximation of the full data with good probability, where is the VC-dimension of the underlying set system . Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative: We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of ) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack.
However, this attack is “theoretical only”, requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term in the sample size with the cardinality term . That is, the Bernoulli and reservoir sampling algorithms with sample size output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.
1 Introduction
Random sampling is a simple, generic, and universal method to deal with massive amounts of data across all scientific disciplines. It has wide-ranging applications in statistics, databases, networking, data mining, approximation algorithms, randomized algorithms, machine learning, and other fields (see e.g., [CJSS03, JMR05, JPA04, CDK*+*11, CG05, CMY11] and [Cha01, Chapter 4]). Perhaps the central reason for its wide applicability is the fact that it (provably, and with high probability) suffices to take only a small number of random samples from a large dataset in order to “represent” the dataset truthfully (the precise geometric meaning is explained later). Thus, instead of performing costly and sometimes infeasible computations on the full dataset, one can sample a small yet “representative” subset of a data, perform the required analysis on this small subset, and extrapolate (approximate) conclusions from the small subset to the entire dataset.
The analysis of sampling algorithms has mostly been studied in the non-adaptive (or static) setting, where the data is fixed in advance, and then the sampling procedure runs on the fixed data. However, it is not always realistic to assume that the data does not change during the sampling procedure, as described in [MNS11, GHR*+*12, GHS*+*12, HW13, NY15]. In this work, we study the robustness of sampling in an adaptive adversarial environment.
The adversarial environment.
In high-level, the model is a two-player game between a randomized streaming algorithm, called , and an adaptive player, . In each round,
first submits an element to . The choice of the element can depend, possibly in a probabilistic manner, on all elements submitted by up to this point, as well as all information that observed from up to this point. 2. 2.
Next, probabilistically updates its internal state, i.e., the sample that it currently maintains. An update step usually involves an insertion of the newly received element to the sample with some probability, and sometimes deletion of old elements from the sample. 3. 3.
Finally, is allowed to observe the current (updated) state of , before proceeding to the next round.
’s goal is to make the sample as unrepresentative as possible, causing to come with false conclusions about the data stream. The game is formally described in Section 2.
Adversarial scenarios are common and arise in different settings. An adversary uses adversarial examples to fool a trained machine learning model [SZS*+*14, MHS19]; In the field of online learning [Haz16], adversaries are typically adaptive [SS17, LMPL18]. An online store suggests recommended items based on a sample of previous purchases, which in turn influences future sales [Sha12, GHR*+*12]. A network device routes traffic according to statistics pulled from a sampled substream of packets [DLT05], and an adversary that observes the network’s traffic learns the device’s routing choices might cause a denial-of-service attack by generating a small amount of adversarial traffic [NY15]. A high-frequency stock trading algorithm monitors a stream of stock orders places buy/sell requires based on statistics drawn from samples; A competitor might fool the sampling algorithm by observing its requests and modifying future stock orders accordingly. An autonomous vehicle receives physical signals from its immediate environment (which might be adversarial [SBM*+*18]) and has to decide on a suitable course of action.
Even when there is no apparent adversary, the adaptive perspective is sometimes natural and required. For instance, adaptive data analysis [DFH*+*15, WFRS18] aims to understand the challenges arising when data arrives online, such as data reuse, the implicit bias “collected” over time in scientific discovery, and the evolution of statistical hypotheses over time. In graph algorithms, [CGP*+*18] observed that an adversarial analysis of dynamic spanners would yield a simpler (and quantitively better) alternative to their work.
In view of the importance of robustness against adaptive adversaries, and the fact that random sampling is very widely used in practice (including in streaming settings), we ask the following.
*Are sampling algorithms robust against adaptive adversaries? *
Bernoulli and reservoir sampling.
We mainly focus on two of the most basic and well-known sampling algorithms: Bernoulli sampling and reservoir sampling. The Bernoulli sampling algorithm with parameter runs as follows: whenever it receives a stream element , the algorithm stores the element with probability . For a stream of length the sample size is expected to be ; and furthermore, it is well-concentrated around this value. We denote this algorithm by .
The classical reservoir sampling algorithm [Vit85] (see also [Knu97, Section 3.4.2] and a formal description in Section 2) with parameter maintains a uniform sample of fixed size , acting as follows. The first elements it receives, , are simply added to the memory with probability one. When the algorithm receives its element , where , it stores it with probability , by overriding a uniformly random element from the memory (so the memory size is kept fixed to ). We henceforth denote this algorithm by .
Attacking sampling algorithms.
To answer the question above of whether sampling algorithms are robust against adversarially chosen streams, we must first define a notion of a representative sample, as several notions might be appropriate. However, we begin the discussion with an example showing how to attack the Bernoulli (and reservoir) sampling algorithm with respect to merely any definition of “representative”.
Consider a setting where the stream consists of points in the one-dimensional range of real numbers . receives these points and samples each one independently with probability . One can observe that, in the static setting and for sufficiently large , the sampled set will be a good representation of the entire points for various definitions of the term “representation”. For example, the median of the stream will be -close111The term “close” here means that the median of the sampled set will be an element whose order among the elements of the full stream, when the elements are sorted by value from smallest to largest, is within the range , with high probability where the parameter depends on the probability . to the median of the sampled elements with high probability, as long as for some constant (this also holds for any other quantile).
Consider the following adaptive adversary which will demonstrate the difference of the adaptive setting. keeps a “working range” at any point during the game, starting with the full range . In the first round, chooses the number as the first element in the stream. If is sampled, then moves to the range , and otherwise, to the range . Next, submits as the middle of the current range. This continues for steps; Formally, ’s strategy is as follows. Set and . In round , where runs from to , submits to ; If is sampled then sets , and otherwise, it sets . The final stream is .
Note that at any point throughout the process, always submits an element that is larger than all elements in the current sampled set, and also smaller than all the non-sampled elements of the stream. Therefore, the end result is that after this process is over, with probability 1, the sampled elements are precisely the smallest elements in the stream. Of course, the median of the sampled set is far from the median of the stream as such a subset is very unrepresentative of the data. Actually, one might consider it as the “most unrepresentative” subset of the data.
The exact same attack on works almost as effectively against . In this case, the attack will cause all of the sampled elements at the end of the process to lie among the first elements with high probability. For more details, see Section 5.
The good news.
This attack joins a line of attacks in the adversarial model. Lipton and Naughton [LN93] showed that an adversary that can measure the time of operations in a dictionary can use this information to increase the probability of a collision and as a result, significantly decrease the performance of the hashtable. Hardt and Woodruff [HW13] showed that linear sketches are inherently non-robust and cannot be used to compute the Euclidean norm of its input (where in the static setting they are used mainly for this reason). Naor and Yogev [NY15] showed that Bloom filters are susceptible to attacks by an adaptive stream of queries if the adversary is computationally unbounded and they also constructed a robust Bloom filter against computationally bounded adversaries.
In our case, we note that the given attack might categorize it as “theoretical” only. In practice, it is unrealistic to assume that the universe from which can pick elements is an infinite set; how would the attack look, then, if the universe is the discrete set ? splits the range to half for times, meaning that the precision of the elements required is exponential; The analogous attack in the discrete setting requires to be exponentially large with respect to the stream size . Such a universe size is large and “unrealistic”: for to memorize even a single element requires memory size that is linear in , whilst sampling and streaming algorithms usually aim to use an amount sublinear in of memory.
Thus, the question remains whether there exist attacks that can be performed on elements using substantially less precision, that is, on a significantly smaller size of discrete universe. In this work, we bring good news to both the Bernoulli and reservoir sampling algorithms by answering this question negatively. We show that both sampling algorithms, with the right parameters, will output a representative sample with good probability regardless of ’s strategy, thus exhibiting robustness for these algorithms in adversarial settings.
We note that any deterministic algorithm that works in the static setting is inherently robust in the adversarial adaptive setting as well. However, in many cases, deterministic algorithms with small memory simply do not exist, or they are complicated and tailored for a specific task. Here, we enjoy the simplicity of a generic randomized sampling algorithm combined with the robust guarantees of our framework.
What is a representative sample?
Perhaps the most standard and well-known notion of being representative is that of an -approximation, first suggested by Vapnik and Chervonenkis [VC71] (see also [MV17]), which originated as a natural notion of discrepancy [Cha01] in the geometric literature. It is closely related to the celebrated notion of VC-dimension [VC71, Sau72, She72], and captures many quantitative properties that are desired in a random subset. Let be a sequence of elements from the universe (repetitions are allowed) and let . The density of in is the fraction of elements in that are also in (i.e., ).
A set system is simply a pair where is a collection of subsets. A non-empty subsequence of is an -approximation of with respect to the set system if it preserves densities (up to an factor) for all subsets .
Definition 1.1** (-approximation).**
We say that a (non-empty) sample is an -approximation of with respect to if for any subset it holds that
If the universe is well-ordered, it is natural to take as the collection of all consecutive intervals in , that is, (including all singletons ). With this set system in hand, -approximation is a natural form of “good representation” in the streaming setting, pointed out by its deep connection to multiple classical problems in the streaming literature, like approximate median, and more generally, quantile estimation [MRL99, GK01, WLYC13, GK16, KLL16] and range searching [BCEG07]. In particular, if is an -approximation of w.r.t. , then any -quantile of is -close to the -quantile of ; this holds simultaneously for all quantiles (see Section 1.2).
1.1 Our Results
Fix a set system over the universe . A sampling algorithm is called -robust if for any (even computationally unbounded) strategy of , the output sample is an -approximation of the whole stream with respect to , with probability at least .
Our main result is an upper bound (“good news”) on the -robustness of Bernoulli and reservoir sampling, later to be complemented them with near-matching lower bounds.
Theorem 1.2**.**
For any , set system , and stream length , the following holds.
- •
* with parameter is -robust.*
- •
* with parameter is -robust.*
The proof appears in Section 4. As the total number of elements sampled by is well-concentrated around , the above theorem implies that a sample of total size (at least) , obtained by any of the algorithms, or , is an -approximation with probability .
This should be compared with the static setting, where the same result is known as long as for , and for , where is the VC-dimension of and is a constant [VC71, Tal94, LLS01] (see also [MV17]).
As you can see, to make the static sampling algorithm robust in the adaptive setting one solely needs to modify the sample size by replacing the VC-dimension term with the cardinality dimension (and update the multiplicative constant). Below, in our lower bounds, we show that this increase in the sample size is inherent, and not a byproduct of our analysis.
Lower Bounds.
We next show that being adaptively robust comes at a price. That is, the dependence on the cardinality dimension, as opposed to the VC dimension, is necessary. By an improved version of the attack described in the introduction, we show the following:
Theorem 1.3**.**
There exists a constant and a set system with VC-dimension 1, where such that for any :
The algorithm with parameter is not -robust. 2. 2.
The algorithm with parameter is not -robust.
Moreover, for any , there exists as above where .
The proof can be found in Section 5.
Continuous robustness.
The condition of -robustness requires that the sample will be -representative of the stream in the end of the process. What if we wish the sample to be representative of the stream at any point throughout the stream? Formally, we say that a sampling algorithm is -continuously robust if, with probability at least , at any point the sampled set is an -approximation of the first elements of the stream, i.e., of . The next theorem shows that continuous robustness of can be obtained with just a small overhead compared to “standard” robustness. (For one cannot hope for such a result to be true, at least for the above definition of continuous robustness.)
Theorem 1.4**.**
There exists , such that for any , set system , and stream length , with parameter is -continuously robust.
Moreover, if only continuous robustness against a static adversary is desired, then the term can be replaced with the VC-dimension of .
We are not aware of a previous analysis of continuous robustness, even in the static setting. The proof, appearing in Section 6, follows by applying Theorem 1.2 (or its static analogue) in carefully picked “checkpoints” along the stream, where . It shows that if the sample is representative of the stream in any of the points , then with high probability, the sample is also representative in any other point along the stream. (We remark that a similar statement with weaker dependence on can be obtained from Theorem 1.2 by a straightforward union bound.) The proof can be found in Section 6.
Comparison to deterministic sampling algorithms.
Our results show that sampling algorithms provide an -approximation in the adversarial model. One advantage of using the notion of -approximation is its wide array of applications, where for each such task we get a streaming algorithm in the adversarial model as described in the following subsection. We stress that for any specific task a deterministic algorithm that works in the static setting will also automatically be robust in the adversarial setting. However, deterministic algorithms tend to be more complicated, and in some cases they require larger memory. Here, we focus on showing that the most simple and generic sampling algorithms “as is” are robust in our adaptive model and yield a representative sample of the data that can be used for many different applications.
The best known deterministic algorithm for computing an -approximating sample in the streaming model is that of Bagchi et al. [BCEG07]. The sample size they obtain is ; the working space of their algorithm and the processing time per element are of the form , where is the scaffold dimension222The scaffold dimension is a variant of the VC-dimension equal to . of the set system. The exact bounds are rather intricate, see Corollary 4.2 in [BCEG07]. While the space requirement of their approach does not have a dependence on , its dependence on and is generally worse than ours, making their bounds somewhat incomparable to ours. Finally, we note that there exist more efficient methods to generate an -approximation in some special cases, e.g., when the set system constitutes of rectangles or halfspaces [STZ04].
1.2 Applications of Our Results
We next describe several representative applications and usages of -approximations (see also [BCEG07] for more applications in the area of robust statistics). For some of these applications, there exist deterministic algorithms known to require less memory than the simple random sampling models discuss in this paper. However, one area where our generic random sampling approach shines compared to deterministic approaches is the query complexity or running time (under a suitable computational model). Indeed, while deterministic algorithms must inherently query all elements in the stream in order to run correctly, our random sampling methods query just a small sublinear portion of the elements in the stream.
Consequently, to the best of our knowledge, Bernoulli and reservoir sampling are the first two methods known to compute an -approximation (and as a byproduct, solve the tasks described in this subsection) in adversarial situations where it is unrealistic or too costly to query all elements in the stream. The last part of this subsection exhibits an example of one such situation.
Quantile approximation.
As was previously mentioned, -approximations have a deep connection to approximate median (and more generally, quantile estimation). Assume the universe is well-ordered. We say that a streaming algorithm is an -robust quantile sketch if, in our adversarial model, it provides a sample that allows to approximate the rank333The rank of an element in a stream is the total amount of elements in the stream so that . of any element in the stream up to additive error with probability at least . Observe that this is achieved with an -approximation with respect to the set system where . For example, set to be the median of the stream. Since the density of the range is preserved in the sample, we know that the median of the sample will be -close to the median of the stream. This works for any other quantile simultaneously. The sample size is .
Corollary 1.5**.**
For any , well-ordered universe , and stream length , with parameter is an -robust quantile sketch. The same holds for the algorithm with parameter .
A corollary in the same spirit regarding continuously robust quantile sketches can be derived from Theorem 1.4.
Range queries.
Suppose that the universe is of the form for some parameters and . One basic problem is that of range queries: one is given a set of ranges and each query consists of a range where the desired answer is the number of points in the stream that are in this range. Popular choices of such ranges are axis-aligned or rotated boxes, spherical ranges and simplicial ranges. An -approximation allows us to answer such range queries up to an additive error of . Suppose the sampled set is , then an answer is given by computing . For example, when consists of all axis-parallel boxes, and thus the sample size required to answer range queries that are robust against adversarial streams is ; for rotated boxes, one should replace with in this expression. See [BCEG07] for more details on the connection between -approximations and range queries.
Center points.
Our result is also useful for computing -center points. A point in the stream is a -center point if every closed halfspace containing in fact contains at least points of the stream. In [CEM*+*96, Lemma 6.1] it has been shown that an -approximation (with respect to half-spaces) can be used to get a -center point for suitable choices of the parameters. For example, setting we get that a -center of the sample is a -center of the stream . Thus, we can compute a -center of a stream in the adversarial model. See also [BCEG07].
Heavy hitters.
Finding those elements that appear many times in a stream is a fundamental problem in data mining, with a myriad of practical applications. In the heavy hitters problem, there is a threshold and an error parameter . The goal is to output a list of elements such that if an element appears more than times in the stream (i.e., ) it must be included in the list, and if an element appears less than times in the stream (i.e., it cannot be included in the list.
Our results yield a simple and efficient heavy hitters streaming algorithm in the adversarial model. For any universe let be the set of all singletons. Now, pick and use either Bernoulli or reservoir sampling to compute an -approximation of the stream , outputting all elements with . Indeed, if then . On the other hand, if then .
Corollary 1.6**.**
There exists such that for any , universe , and stream length , with parameter solves the heavy hitters problem with error in the adversarial model. The same holds for with parameter .
Clustering.
The task of partitioning data elements into separate groups, where the elements in each group are “similar” and elements in different groups are “dissimilar” is fundamental and useful for numerous applications across computer science. There has been lots of interest on clustering in a streaming setting, see e.g. [GLA16] for a survey on recent results. Our results suggest a generic framework to accelerate clustering algorithms in the adversarial model: Instead of running clustering on the full data, one can simply sample the data to obtain (with high probability, even against an adversary) an -approximation of it, run the clustering algorithm on the sample, and then extrapolate the results to the full dataset.
Sampling in modern data-processing systems.
It is very common to use random sampling (sometimes “in disguise”) in modern data-intensive systems that operate on streaming data, arriving in an online manner. As an illustrative example, consider the following distributed database [OV11] setting. Suppose that a database system must receive and process a huge amount of queries per second. It is unrealistic for a single server to handle all the queries, and hence, for load balancing purposes, each incoming query is randomly assigned to one of query-processing servers. Seeing that the set of queries that each such server receives is essentially a Bernoulli random sample (with parameter ) of the full stream, one hopes that the portion of the stream sampled by each of these servers would truthfully represent the whole data stream (e.g., for query optimization purposes), even if the stream changes with time (either unintentionally or by a malicious adversary). Such “representation guarantees” are also desirable in distributed machine learning systems [GDG*+*17, SKYL17], where each processing unit learns a model according to the portion of the data it received, and the models are then aggregated, with the hope that each of the units processed “similar” data.
In general, modern data-intensive systems like those described above become more and more complicated with time, consisting of a large number of different components. Making these systems robust against environmental changes in the data, let alone adversarial changes, is one of the greatest challenges in modern computer science. From our perspective, the following question naturally emerges:
Is random sampling a risk in modern data processing systems?
Fortunately, our results indicate that the answer to this question is largely negative. Our upper bounds, Theorems 1.2 and 1.4, show that a sufficiently large sample suffices to circumvent adversarial changes of the environment.
1.3 Related Work
Online learning.
One related field to our work is online learning, which was introduced for settings where the data is given in a sequential online manner or where it is necessary for the learning algorithm to adapt to changes in the data. Examples include stock price predictions, ad click prediction, and more (see [Sha12] for an overview and more examples).
Similar to our model, online learning is viewed as a repeated game between a learning algorithm (or a predictor) and the environment (i.e., the adversary). It considers rounds where in each round the environment submits an instance , the learning algorithm then makes a prediction for , the environment, in turn, chooses a loss for this prediction and sends it as feedback to the algorithm. The goal in this model is usually to minimize regret (the sum of losses) compared to the best fixed prediction in hindsight. This is the typical setting (e.g., [HAK07, SST10]), however, many different variants exist (e.g., [DGS15, ZLZ18]).
PAC learning.
In the PAC-learning framework [Val84], the learner algorithm receives samples generated from an unknown distribution and must choose a hypothesis function from a family of hypotheses that best predicts the data with respect to the given distribution. It is known that the number of samples required for a class to be learnable in this model depends on the VC-dimension of the class.
A recent work of Cullina et al. [CBM18] investigates the effect of evasion adversaries on the PAC-learning framework, coining the term of adversarial VC-dimension for the parameter governing the sample complexity. Despite the name similarity, their context is seemingly unrelated to ours (in particular, it is not a streaming setting), and correspondingly, their notion of adversarial VC-dimension does not seem to relate to our work.
Adversarial examples in deep learning.
A very popular line of research in modern deep learning proposes methods to attack neural networks, and countermeasures to these attacks. In such a setting, an adversary performs adaptive queries to the learned model in order to fool the model via a malicious input. The learning algorithms usually have an underlying assumption that the training and test data are generated from the same statistical distribution. However, in practice, the presence of an adaptive adversary violates this assumption. There are many devastating examples of attacks on learning models [SZS*+*14, BCM*+*13, PMG*+*17, BR18, MHS19] and we stress that currently, the understanding of techniques to defend against such adversaries is rather limited [GMP18, MW18, MM19, MHS19].
Maintaining random samples.
Reservoir sampling is a simple and elegant algorithm for maintaining a random sample of a stream [Vit85], and since its proposal, many flavors have been introduced. Chung, Tirthapura, Woodruff [CTW16] generalized reservoir sampling to the setting of multiple distributed streams, which need to coordinate in order to continuously respond to queries over the union of all streams observed so far (see also Cormode et al. [CMYZ12]). Another variant is weighted reservoir sampling where the probability of sampling an element is proportional to a weight associated with the element in the stream [ES06, BOV15]. A distributed version as above was recently considered for the weighted case as well [JSTW19].
1.4 Paper Organization
Section 2 contains an overview of our adversarial model and a more precise and detailed definition than the one given in the introduction. In Section 3 we mention several concentration inequalities required for our analysis. In Section 4 we present and prove our main technical Lemma, from which we derive Theorem 1.2. This includes analysis of both and . In Section 5 we present our “attack”, i.e., our lower bound showing the tightness of our result. Finally, in Section 6, we prove our upper bounds in the continuous setting.
2 The Adversarial Model for Sampling
In this section, we formally define the online adversarial model discussed in this paper. Roughly speaking, we say that is an -robust sampling algorithm for a set system if for any adversary choosing an adaptive stream of elements , the final state of the sampling algorithm is an -approximation of the stream with probability . This is formulated using a game, , between two players, and .
Rules of the game:
is a streaming algorithm, which gets a sequence of elements one by one in an online manner (the sampling algorithms we discuss in this paper do not need to know in advance). Upon receiving an element , can perform an arbitrary computation (the running time can be unbounded) and update a local state . We denote the local state after steps by , and write . 2. 2.
The stream is chosen adaptively by : a probabilistic (unbounded) player that, given all previously sent elements and the current state , chooses the next element to submit. The strategy that Adversary employs along the way, that is, the probability distribution over the choice of given any possible set of values and , is fixed in advance. The underlying (finite or infinite) set from which is allowed to choose elements during the game is called the universe, and denoted by . We assume that does not change along the game. 3. 3.
Once all rounds of the game have ended, outputs . For the sampling algorithms discussed in this paper, is a subsequence of the stream . is usually called the sample obtained by in the game.
For an illustration on the rules of the game see Figure 1.
Using the game defined above, we now describe what it means for a sampling algorithm to be (adversarially) robust.
Definition 2.1** (Robust sampling algorithm).**
We say that a sampling algorithm is -robust with respect to the set system and the stream length if for and any (even unbounded) strategy of , it holds that
[TABLE]
The memory size used by is defined to be the maximal size of throughout the process of .
A stronger requirement that one can impose on the sampling algorithm is to hold an -approximation of the stream at any step during the game. To handle this, we define a continuous variant of which we denote , presented in Figure 2.
For the sampling algorithms that we consider, the state at any time is essentially equal to the sample . In any case, the definition of the framework given in Figure 2 generally allows to contain additional information, if needed. A sampling algorithm is called -continuously robust if the following holds with probability at least : for any strategy of , and all , the sample is an -approximation of the stream at time .
Definition 2.2** (Continuously robust sampling algorithm).**
We say that a sampling algorithm is -continuously robust with respect to the set system and the stream length if for and any (even unbounded) strategy of , it holds that
[TABLE]
The memory size used by is defined to be the maximal size of throughout the process of .
Reservoir sampling.
For completeness, we provide the pseudocode of the reservoir sampling algorithm [Vit85, Knu97]. Here, denotes the (fixed) memory size of the algorithm, denotes the current round number, and is the currently received element.
:
If then parse and output . 2. 2.
Otherwise, parse . 3. 3.
With probability do:
choose uniformly at random and output . 4. 4.
Otherwise, output .
3 Technical Preliminaries
The logarithms in this paper are usually of base , and denoted by . The exponential function is . For an integer we denote by the set . We state some concentration inequalities, useful for our analysis in later sections. We start with the well-known Chernoff’s inequality for sums of independent random variables.
Theorem 3.1** (Chernoff Bound [Che52]; see Theorem 3.2 in [CL06]).**
Let be independent random variables that take the value 1 with probability and 0 otherwise, , and . Then for any ,
[TABLE]
and
[TABLE]
Our analysis of adversarial strategies crucially makes use of martingale inequalities. We thus provide the definition of a martingale.
Definition 3.2**.**
A martingale is a sequence of random variables with finite means, so that for , it holds that .
The most basic and well-known martingale inequality, Azuma’s (or Hoeffding’s) inequality, asserts that martingales with bounded differences are well-concentrated around their mean. For our purposes, this inequality does not suffice, and we need a generalized variant of it, due to McDiarmid [McD98, Theorem 3.15]; see also Theorem 4.1 in [Fre75]. The formulation that we shall use is given as Theorem 6.1 in the survey of Chung and Lu [CL06].
Lemma 3.3** (See [CL06], Theorem 6.1).**
Let be a martingale. Suppose further that for any , the variance satisfies for some values , and there exists some so that always holds. Then, for any , we have
[TABLE]
In particular,
[TABLE]
Unlike Azuma’s inequality, Lemma 3.3 is well-suited to deal with martingales where the maximum value of is large, but the maximum is rarely attained (making the variance much smaller than ). The martingales we investigate in this paper depict this behavior.
4 Adaptive Robustness of Sampling: Main Technical Result
In this section, we prove the main technical lemma underlying our upper bounds for Bernoulli sampling and reservoir sampling. The lemma asserts that for both sampling methods, and any given subset of the universe , the fraction of elements from within the sample typically does not differ by much from the corresponding fraction among the whole stream.
Lemma 4.1**.**
Fix , a universe and a subset , and let be the sequence chosen by in against either or .
For with parameter , we have . 2. 2.
For with memory size , it holds that .
Both of these bounds are tight up to an absolute multiplicative constant, even for a static adversary (that has to submit all elements in advance); see Section 6 for more details.
The proof of Theorem 1.2 follows immediately from Lemma 4.1, and is given below. The proof of Theorem 1.4 requires slightly more effort, and is given in Section 6.
Proof of Theorem 1.2.
Let , , , be as in the statement of the theorem, and let and denote the stream and sample, respectively. We start with the Bernoulli sampling case, and assume that . For each , we apply the first part of Lemma 4.1 with parameters and , concluding that
[TABLE]
In the event that for any , by definition is an -approximation of . Taking a union bound over all , we conclude that the probability of this event not to hold is bounded by , meaning that with as above is -robust.
The proof for is identical, except that we replace the condition on with the condition that , and apply the second part of Lemma 4.1. ∎
It is important to note that the typical proofs given for statements of this type in the static setting (i.e., when Adversary submits all elements in advance, and cannot act adaptively) do not apply for our adaptive setting. Indeed, the usual proof of the static analogue of the above lemma goes along the following lines: Adversary chooses which elements to submit in advance, and in particular, determines the number of elements from sent, call it . Then, the number of sampled elements from is distributed according to the binomial distribution for Bernoulli sampling, and for reservoir sampling. One can then employ Chernoff bound to conclude the proof. This kind of analysis crucially relies on the adversary being static.
Here, we need to deal with an adaptive adversary. Recall that at any given point is modeled as a probabilistic process, that given the sequence of elements sent until now, and the current state of , probabilistically decides which element to submit next. Importantly, this makes for a well-defined probability space, and allows us to analyze ’s behavior with probabilistic tools, specifically with concentration inequalities.
Chernoff bound cannot be used here, as it requires the choices made by the adversary along the process to be independent of each other, which is clearly not the case. In contrast, martingale inequalities are suitable for this setting. We shall thus employ these, specifically Lemma 3.3, to prove both parts of our main result in this section.
4.1 The Bernoulli Sampling Case
We start by proving the Bernoulli sampling case (first statement of Lemma 4.1). Recall that here each element is sampled, independently, with probability . At any given point along the process, let denote the sequence of elements submitted by the adversary until round , and let denote the subsequence of sampled elements from . Note that and , and hence, to prove the lemma, we need to show that .
As a first attempt, it might make sense to try applying a martingale concentration inequality on the sequence of random variables , where we define . Indeed, our end-goal is to bound the probability that significantly deviates from zero. However, a straightforward calculation shows that this is not a martingale, since the condition that does not hold in general. To overcome this, we show that a slightly different formulation of the random variables at hand does yield a martingale. Given the above , for any we define the random variables
[TABLE]
where, as before, the intersection between a set and a sequence is the subsequence of consisting of all elements that also belong to .
Importantly, as is described in the next claim, the sequence of random variables defined above forms a martingale. The claim also demonstrates several useful properties of these random variables, to be used later in combination with Lemma 3.3.
Claim 4.2**.**
The sequence is a martingale. Furthermore, the variance of conditioned on is bounded by , and it always holds that .
We shall prove Claim 4.2 later on; first we use it to complete the proof of the main result.
Proof of Lemma 4.1, Bernoulli
sampling case.
It suffices to prove the following two inequalities for any satisfying the conditions of the lemma for the Bernoulli sampling case:
[TABLE]
Indeed, taking a union bound over these two inequalities, applying the triangle inequality, and observing that , we conclude that , as desired.
The first inequality follows from Claim 4.2 and Lemma 3.3. Indeed, in view of Claim 4.2, we can apply Lemma 3.3 on with parameters , , and . As , we have , and so
[TABLE]
The right hand side is bounded by when , settling the first inequality of (2).
We next prove the second inequality of (2). Observe that . Since each element is added to the sample with probability , independently of other elements, the size of is distributed according to the binomial distribution , regardless of the adversary’s strategy. Applying Chernoff inequality with , we get that
[TABLE]
This probability is bounded by provided that . Conditioning on this event not occurring, we have that
[TABLE]
where the first inequality follows from the fact that densities (in this case, ) are always bounded from above by one, and the second inequality follows from our conditioning. This completes the proof of the second inequality in (2). ∎
The proof of Claim 4.2 is given next.
Proof of Claim 4.2.
We first show that is a martingale. Fix , and suppose that the first rounds of have just ended (so the values of are already fixed), and that now picks an element to submit in round of the game.
If then and and so , which trivially means that as desired.
When , we have
[TABLE]
Recall that uses Bernoulli sampling with probability , that is, is sampled with probability (regardless of the outcome of the previous rounds). Therefore, we have that
[TABLE]
The analysis of both cases and implies that , as desired.
We now turn to prove the other two statements of Claim 4.2. The maximum of the expression is , obtained when . The variance of given is zero given the additional assumption that ; assuming that , the variance satisfies
[TABLE]
Combining both cases, we conclude that , completing the proof. ∎
4.2 The Reservoir Sampling Case
We continue to the proof of the second statement of Lemma 4.1, which considers reservoir sampling. In high level, the proof goes along the same lines, except that we work with a different martingale. Specifically, for we define
[TABLE]
whereas for we simply define . (This is a natural extension of the definition for ; specifically, in view of the definition of , note that as long as no more than elements appear in the stream, the reservoir simply keeps all of the stream’s elements.)
The following claim is the analogue of Claim 4.2 for the setting of reservoir sampling.
Claim 4.3**.**
The sequence is a martingale. Furthermore, the variance of conditioned on is bounded by , and it always holds that .
Proof.
We follow the same kind of analysis as in Claim 4.2. Fix (for the claim holds trivially), and suppose that the first rounds have ended, so are already fixed. Denote the next element that the adversary submits by . First, it is easy to verify that
[TABLE]
The calculation of requires a more subtle case analysis. Given and , the value of is determined by three factors: (i) is or not? (ii) is sampled or not? and (iii) conditioning on being sampled, does it replace an element from in the sample, or an element not in ? We separate the analysis into several cases; in cases where is sampled, we denote the element removed from the sample to make room for by .
Case 1: .
In the cases where is either not sampled, or sampled but with , elements from are neither added nor removed from the sample. That is, . Hence,
[TABLE]
where the first equality is by definition, and the third equality follows again by definition and since for .
It remains to consider the event where is sampled and . The probability that is sampled equals , and conditioning on this occurring, the probability that belongs to is , so the above event holds with probability . In this case, one element from is removed from the sample, that is, , and therefore
[TABLE]
Thus, conditioned on , the expectation of is
[TABLE]
Since when , we deduce that
[TABLE]
Case 2: .
Similarly, whenever we have that . The only case where this does not hold is when is sampled and , which has probability . In this case, , implying that
[TABLE]
Combining these two we get, conditioned on , that the expectation of is
[TABLE]
Finally, since when , we have that
[TABLE]
The analysis of these two cases implies that is indeed a martingale.
It remains to obtain the bounds on the difference and the variance of given . This follows rather easily as a byproduct of the above analysis (and the fact that the density is always bounded between zero and one). When , we know from the analysis that and , whereas if , we have and . In both cases, we conclude that .
We next bound the variance of conditioned on the values of (the analysis also implicitly conditions on the value ; the bound we shall eventually derive holds regardless of this value). We start with the case that , and revisit Case 1 above: with probability , the value of is smaller than its expectation by ; and otherwise (with probability ), the value of is larger than its expectation by . Thus, we have that
[TABLE]
We next address the case where , which correspond to Case 2 above. Here, with probability , the value of is larger than its conditional expectation by ; otherwise, is smaller than the expectation by . Thus,
[TABLE]
As the conditional variance is always bounded by , the bound remains intact if we remove the conditioning on the value of and the predicate assessing whether or not. In other words, , completing the proof. ∎
The proof of the second part of Lemma 4.1 now follows from the last claim.
Proof of Lemma 4.1, reservoir
sampling case.
Observe that
[TABLE]
In view of Claim 4.3, we apply Lemma 3.3 on the martingale with , for any (for , we can set ), and . We get that
[TABLE]
where the second inequality holds for . Therefore, it suffices to require to get the bound . ∎
5 An Adaptive Attack on Sampling
In this section, we present our lower bounds. Specifically, we show that the sample size cannot depend solely on the VC-dimension, but rather that the dependency on the cardinality is necessary. This is done by describing a set system with large and VC-dimension of one, together with a strategy for the adversary that will make the sampled set unrepresentative with respect to . That is, the sampled set will not be an -approximation of with high probability. This is in contrast to the static setting where the same sample size suffices to an -approximation with high probability. Moreover, in the case of the algorithm, the sampled set under attack is extremely unrepresentative, consisting precisely of the smallest elements in the stream (where is the total sample size at the end of the stream).
Proof of Theorem 1.3.
Set the universe to be the well-ordered set for an arbitrary and let . Clearly, has VC-dimension 1. ’s strategy (for both sampling algorithms and ) is described in Figure 3.
Let denote the subsequence of elements sampled by the algorithm along the stream. The expected size of is , and it follows from the well-known Markov inequality (see e.g. [AS16], Appendix A) that (in fact the probability is much smaller, by Chernoff inequality, but we will not need the stronger bound). From here on, we condition on the complementary event: we assume that . The next claim asserts that for of this size, Adversary’s strategy does not fail, in the sense that it never runs out of elements (i.e., for all ).
Claim 5.1**.**
If then for any .
Proof.
For any , set . We prove by induction that . If is sampled, then we have that and otherwise we have that , where the inequality follows from the induction assumption. Since , we get that
[TABLE]
where the third inequality holds since for small enough , and the last inequality follows since and , which means that
[TABLE]
This proves the induction step, and completes the proof of the claim. ∎
The last claim means that if , then the attack in Figure 3 successfully generates a stream of elements. We now show that the sampled set is not an -approximation. We begin by analyzing the algorithm.
Claim 5.2**.**
Consider ’s attack on described in Figure 3. At round of the game,
- •
All elements that were previously submitted by and sampled are no bigger than .
- •
All elements that were previously submitted but not sampled are no smaller than .
- •
The element submitted during round is between and .
Proof.
By induction, where the base case is trivial. Suppose that the claim holds for the first rounds; we now prove it for round . By definition of the attack, and from Claim 5.1 it holds that and so any of the elements for satisfies the desired condition, by the induction assumption. It remains to address the case where . If was sampled, then the attack sets , that is, is a sampled element and satisfies . Otherwise, the attack sets and so is a non-sampled element and satisfies . Finally, always holds. Thus, the three desired conditions are retained. ∎
As the last claim depicts, all sampled elements are smaller than all non-sampled ones at any point along the stream. This, of course, suffices for the sampled set to not be an -approximation of . Denote the sampled set by , and let be the maximal element in (if is empty, we are done). Consider now the range : its density in the sampled set is , namely, , while its density in the stream is . To summarize,
[TABLE]
Altogether, the attack does not fail provided that , which holds with probability at least . Thus, with parameter as in the theorem’s statement is not -robust.
The analysis of the algorithm is very similar. Recall that denotes the sample size, and let be the total number of elements that were sampled during the reservoir sampling process. That is, counts sampled elements that were evicted at a future iteration. We bound as follows. . Again, Markov inequality shows that with probability at least , we will have . Using the previous analysis, we know that all elements are the smallest elements in the stream. The sample set consists of some elements among these elements (in other words, the sample set is not necessarily the set of smallest element, but it is still a subset of the smallest elements). Thus, taking the interval where is the maximal element among the elements, we have that the density of in the sample is . On the other hand, the density of is the stream is
[TABLE]
Together, we entail that
[TABLE]
meaning that with as in the statement of the theorem is not -robust. ∎
6 Continuous Robustness
In this section, we prove that the algorithm is -continuous robust against static and adaptive adversaries. Recall that a sampling algorithm is -continuously robust if the following holds with probability at least : at any point throughout the stream, the current sample held by is an -approximation of the current stream (i.e., of the set of all elements submitted by until now).
With this definition in hand, cannot possibly be continuously robust in general (even in the static setting)444To see this, consider any set system where contains a singleton for some , which is the first element of the stream. With probability this element is not sampled and the density of in the sample at the current point is [math], while its density in the stream is . This violates the -approximation requirement (unless ).. We thus restrict our discussion to from here on, and turn to the proof of Theorem 1.4. The proof examines carefully picked points along the stream, applying Theorem 1.2 on each of the points. It then shows that if the sample is a good approximation of the stream at all of these points, then continuous robustness is guaranteed with high probability.
Proof of Theorem 1.4.
We provide the proof for the setting of an adaptive adversary. The proof for the static setting is essentially identical, with the only difference being that, instead of making black-box applications of Theorem 1.2, we apply the static analogue of it; Recall that the bound in the static analogue is of the form , compared to the bound appearing in the statement of Theorem 1.2.
Let , , , be as in the statement of the theorem. As a warmup, let us analyze a simple yet non-optimal proof based on a naïve union bound. Denote the stream and sample after rounds by and , respectively. Consider for a moment the first rounds of the game as a “standalone” game where the stream length is . Applying the second part of Theorem 1.2 with parameters , where , we get that if the memory size of satisfies
[TABLE]
then regardless of ’s strategy,
[TABLE]
Taking a union bound, the probability that is an -approximation of for all is at least . Thus, it follows that whose parameter satisfies the condition of (3) is -continuously robust.
We now continue to the proof of the improved bound, appearing in the statement of the theorem. The proof is also based, at its core, on a union bound argument, albeit a more efficient one. The key idea is to take a sparse set of “checkpoints” along the stream, where , apply Theorem 1.2 at any of the times to make sure the sample is an -approximation of the stream in any of these times. Finally, we show that with high probability, for any , the approximation is preserved (the approximation factor might become slightly worse, but no worse than ) in the “gaps” between any couple of such neighboring points.
For this, we first need the following simple claims.
Claim 6.1**.**
Let be two sequences of length over , which differ in up to values. Then for any . In particular, if is an -approximation of some sequence , then is an -approximation of .
Proof.
For any subset we have . Dividing by , and recalling that and , we conclude that , that is, . To prove the second part, note that
[TABLE]
for any . ∎
Claim 6.2**.**
Suppose that are three sequences over , where is an -approximation of , and . Then is an -approximation of .
Proof.
For any subset , we have that . We also know that , since is an -approximation of . On the one hand, it follows that
[TABLE]
On the other hand,
[TABLE]
As these inequalities hold for any , the claim follows. ∎
As a consequence of the above two claims, we get the following useful claim. (Recall that for any , the sample and stream after rounds are denoted by and , respectively.)
Claim 6.3**.**
Consider with memory size , and suppose that exactly elements were sampled in rounds of the game, where . If is an -approximation of , then is an -approximation of .
Proof.
By Claim 6.2, is an -approximation of . As differs from by at most elements, we conclude from Claim 6.1 that is an -approximation of . ∎
The last claim equips us with an approach to ensure continuous robustness, which is more efficient compared to the simple union bound approach. Suppose that there exists a set of integers satisfying the following for any .
is an -approximation of , where . 2. 2.
, where . 3. 3.
The number of elements sampled in rounds is bounded by .
We claim that the above three conditions suffice to ensure that is an -approximation of for any . Indeed, for , is trivially an -approximation. When , consider the maximum for which , and apply Claim 6.3 with , , and as dictated above. Since , the claim implies that is an -approximation of , as desired.
Specifically, given satisfying the assumption of Theorem 1.4, we pick recursively as follows: we start with ; and given we set as the largest integer satisfying that . It is not hard to verify that (this implicitly relies on the fact that , ensured by the assumption of the theorem). Note that . We next show that for this choice of , the above three conditions are satisfied simultaneously for all with probability at least . This shall conclude the proof.
For the first condition, apply Theorem 1.2 for any with parameters where , concluding that if the memory size satisfies
[TABLE]
then for any ,
[TABLE]
Taking a union bound, with probability at least the first condition holds for all .
The second condition, regarding the boundedness of as a function of , holds trivially (and deterministically) for our choice of .
Finally, it remains to address the third condition. For any , let denote the total number of sampled elements in rounds of the game. Note that each such is a random variable. We wish to show that
[TABLE]
Indeed, if (4) is true for any , then the probability that the third condition holds for any is at least , which (in combination with our analysis of the other two conditions) completes the proof. Thus, it remains to prove (4).
Recall that the probability of an element to be sampled in round is exactly , and that . Hence, is a sum of up to independent random variables, each of which has probability less than to be sampled. In particular, the mean of is less than . From Chernoff bound (Theorem 3.1), we get the desired bound:
[TABLE]
where the last inequality holds for , for a sufficiently large constant ; note that in the theorem’s statement indeed satisfies this inequality. ∎
Acknowledgments
We are grateful to Moni Naor for suggesting the study of streaming algorithms in the adversarial setting and for helpful and informative discussions about it. We additionally thank Noga Alon, Nati Linial, and Ohad Shamir for invaluable comments and suggestions for the paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[AS 16] Noga Alon and Joel H. Spencer. The Probabilistic Method . Wiley Publishing, 4th edition, 2016.
- 2[BCEG 07] Amitabha Bagchi, Amitabh Chaudhary, David Eppstein, and Michael T. Goodrich. Deterministic sampling and range counting in geometric data streams. ACM Transactions on Algorithms , 3(2):16, 2007.
- 3[BCM + 13] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD , pages 387–402, 2013.
- 4[BOV 15] Vladimir Braverman, Rafail Ostrovsky, and Gregory Vorsanger. Weighted sampling without replacement from data streams. Information Processing Letters , 115(12):923–926, 2015.
- 5[BR 18] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition , 84:317–331, 2018.
- 6[CBM 18] Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. PAC-learning in the presence of evasion adversaries. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems , NIPS, pages 228–239, 2018.
- 7[CDK + 11] Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM Journal on Computing , 40(5):1402–1431, 2011.
- 8[CEM + 96] Kenneth L. Clarkson, David Eppstein, Gary L. Miller, Carl Sturtivant, and Shang-Hua Teng. Approximating center points with iterative radon points. International Journal of Computational Geometry and Applications , 6(3):357–377, 1996.
