The Adversarial Robustness of Sampling

Omri Ben-Eliezer; Eylon Yogev

arXiv:1906.11327·cs.DS·June 28, 2019

The Adversarial Robustness of Sampling

Omri Ben-Eliezer, Eylon Yogev

PDF

TL;DR

This paper investigates the vulnerability of common sampling methods like Bernoulli and reservoir sampling to adaptive adversarial attacks in streaming data, revealing that robustness depends on the complexity of the set system and proposing a modified sample size bound.

Contribution

It demonstrates that standard sampling sizes are vulnerable to adaptive adversaries and proposes a simple modification replacing VC-dimension with the logarithm of the set system's size to ensure robustness.

Findings

01

Adaptive adversaries can unrepresentatively skew samples with small sizes in certain set systems.

02

Replacing VC-dimension with log of set system size in sample bounds enhances robustness.

03

The proposed modification nearly matches the attack's theoretical lower bound.

Abstract

Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet "representative" subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample "very unrepresentative" of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner. Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $Ω (d / ε^{2})$ …

Equations102

Pr [AdaptiveGame (Sampler, Adversary) = 1] \geq 1 - δ

Pr [AdaptiveGame (Sampler, Adversary) = 1] \geq 1 - δ

Pr [ContinuousAdaptiveGame (Sampler, Adversary) = 1] \geq 1 - δ

Pr [ContinuousAdaptiveGame (Sampler, Adversary) = 1] \geq 1 - δ

Pr [X \leq (1 - δ) μ] \leq exp (- \frac{δ ^{2} μ}{2})

Pr [X \leq (1 - δ) μ] \leq exp (- \frac{δ ^{2} μ}{2})

Pr [X \geq (1 + δ) μ] \leq exp (- \frac{δ ^{2} μ}{2 + 2 δ /3}) .

Pr [X \geq (1 + δ) μ] \leq exp (- \frac{δ ^{2} μ}{2 + 2 δ /3}) .

Pr (X - X_{0} \geq λ) \leq exp (- \frac{λ ^{2}}{2 \sum _{i = 1}^{n} ( σ _{i}^{2} ) + M λ /3}) .

Pr (X - X_{0} \geq λ) \leq exp (- \frac{λ ^{2}}{2 \sum _{i = 1}^{n} ( σ _{i}^{2} ) + M λ /3}) .

Pr (∣ X - X_{0} ∣ \geq λ) \leq 2 exp (- \frac{λ ^{2}}{2 \sum _{i = 1}^{n} ( σ _{i}^{2} ) + M λ /3}) .

Pr (∣ X - X_{0} ∣ \geq λ) \leq 2 exp (- \frac{λ ^{2}}{2 \sum _{i = 1}^{n} ( σ _{i}^{2} ) + M λ /3}) .

Pr (∣ d_{R} (X) - d_{R} (S) ∣ \geq ε) \leq δ /∣ R ∣.

Pr (∣ d_{R} (X) - d_{R} (S) ∣ \geq ε) \leq δ /∣ R ∣.

A_{i}^{R} = \frac{i}{n} \cdot d_{R} (X_{i}) = \frac{∣ R \cap X _{i} ∣}{n}; B_{i}^{R} = \frac{∣ R \cap S _{i} ∣}{n p}; Z_{i}^{R} = B_{i}^{R} - A_{i}^{R},

A_{i}^{R} = \frac{i}{n} \cdot d_{R} (X_{i}) = \frac{∣ R \cap X _{i} ∣}{n}; B_{i}^{R} = \frac{∣ R \cap S _{i} ∣}{n p}; Z_{i}^{R} = B_{i}^{R} - A_{i}^{R},

Pr (∣ A_{n}^{R} - B_{n}^{R} ∣ \geq ε /2) \leq δ /2; Pr (∣ B_{n}^{R} - d_{R} (S_{n}) ∣ \geq ε /2) \leq δ /2.

Pr (∣ A_{n}^{R} - B_{n}^{R} ∣ \geq ε /2) \leq δ /2; Pr (∣ B_{n}^{R} - d_{R} (S_{n}) ∣ \geq ε /2) \leq δ /2.

Pr (∣ A_{n}^{R} - B_{n}^{R} ∣ \geq ε /2) \leq 2 exp (- \frac{( ε /2 ) ^{2}}{2 n \cdot \frac{1}{n ^{2} p} + \frac{ε}{6 n p}}) < 2 exp (- \frac{ε ^{2} n p}{9}) .

Pr (∣ A_{n}^{R} - B_{n}^{R} ∣ \geq ε /2) \leq 2 exp (- \frac{( ε /2 ) ^{2}}{2 n \cdot \frac{1}{n ^{2} p} + \frac{ε}{6 n p}}) < 2 exp (- \frac{ε ^{2} n p}{9}) .

\Pr(\big{|}|S_{n}|-np\big{|}\geq\varepsilon np/2)\leq 2\exp\left(-\frac{\left(\varepsilon/2\right)^{2}np}{2+\varepsilon/3}\right)<2\exp\left(-\frac{\varepsilon^{2}np}{10}\right).

\Pr(\big{|}|S_{n}|-np\big{|}\geq\varepsilon np/2)\leq 2\exp\left(-\frac{\left(\varepsilon/2\right)^{2}np}{2+\varepsilon/3}\right)<2\exp\left(-\frac{\varepsilon^{2}np}{10}\right).

\big{|}d_{R}(S_{n})-B^{R}_{n}\big{|}=\bigg{|}1-\frac{|S_{n}|}{np}\bigg{|}\cdot d_{R}(S_{n})\leq\bigg{|}1-\frac{|S_{n}|}{np}\bigg{|}\leq\frac{\varepsilon}{2}\ ,

\big{|}d_{R}(S_{n})-B^{R}_{n}\big{|}=\bigg{|}1-\frac{|S_{n}|}{np}\bigg{|}\cdot d_{R}(S_{n})\leq\bigg{|}1-\frac{|S_{n}|}{np}\bigg{|}\leq\frac{\varepsilon}{2}\ ,

A_{i}^{R} = A_{i - 1}^{R} + \frac{1}{n};

A_{i}^{R} = A_{i - 1}^{R} + \frac{1}{n};

\Rightarrow

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R] = Z_{i - 1}^{R} + p \cdot (\frac{1}{n p} - \frac{1}{n}) + (1 - p) \cdot (- \frac{1}{n}) = Z_{i - 1}^{R} .

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R] = Z_{i - 1}^{R} + p \cdot (\frac{1}{n p} - \frac{1}{n}) + (1 - p) \cdot (- \frac{1}{n}) = Z_{i - 1}^{R} .

Var (Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R) = (1 - p) \cdot (\frac{1}{n})^{2} + p \cdot (\frac{1}{n p} - \frac{1}{n})^{2} = \frac{1}{n ^{2}} (\frac{1}{p} - 1) \leq \frac{1}{n ^{2} p} .

Var (Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R) = (1 - p) \cdot (\frac{1}{n})^{2} + p \cdot (\frac{1}{n p} - \frac{1}{n})^{2} = \frac{1}{n ^{2}} (\frac{1}{p} - 1) \leq \frac{1}{n ^{2} p} .

A_{i}^{R}

A_{i}^{R}

B_{i}^{R}

Z_{i}^{R}

A^{R}_{i}=\left\{\begin{array}[]{cl}A^{R}_{i-1}&x_{i}\notin R\\ A^{R}_{i-1}+1&x_{i}\in R\end{array}\right.

A^{R}_{i}=\left\{\begin{array}[]{cl}A^{R}_{i-1}&x_{i}\notin R\\ A^{R}_{i-1}+1&x_{i}\in R\end{array}\right.

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i - 1}{k} \cdot ∣ R \cap S_{i - 1} ∣ + \frac{1}{k} \cdot ∣ R \cap S_{i - 1} ∣ = B_{i - 1}^{R} + d_{R} (S_{i - 1}),

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i - 1}{k} \cdot ∣ R \cap S_{i - 1} ∣ + \frac{1}{k} \cdot ∣ R \cap S_{i - 1} ∣ = B_{i - 1}^{R} + d_{R} (S_{i - 1}),

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i}{k} \cdot ∣ R \cap S_{i - 1} ∣ - \frac{i}{k} = B_{i - 1}^{R} + d_{R} (S_{i - 1}) - \frac{i}{k} .

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i}{k} \cdot ∣ R \cap S_{i - 1} ∣ - \frac{i}{k} = B_{i - 1}^{R} + d_{R} (S_{i - 1}) - \frac{i}{k} .

(1 - \frac{k}{i} \cdot d_{R} (S_{i - 1})) \cdot (B_{i - 1}^{R} + d_{R} (S_{i - 1})) + \frac{k}{i} \cdot d_{R} (S_{i - 1}) \cdot (B_{i - 1}^{R} + d_{R} (S_{i - 1}) - \frac{i}{k}) = B_{i - 1}^{R} .

(1 - \frac{k}{i} \cdot d_{R} (S_{i - 1})) \cdot (B_{i - 1}^{R} + d_{R} (S_{i - 1})) + \frac{k}{i} \cdot d_{R} (S_{i - 1}) \cdot (B_{i - 1}^{R} + d_{R} (S_{i - 1}) - \frac{i}{k}) = B_{i - 1}^{R} .

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in / R] = Z_{i - 1}^{R} .

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in / R] = Z_{i - 1}^{R} .

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i}{k} \cdot ∣ R \cap S_{i - 1} ∣ + \frac{i}{k} = B_{i - 1}^{R} + d_{R} (S_{i - 1}) + \frac{i}{k} .

B_{i}^{R} = \frac{i}{k} \cdot ∣ R \cap S_{i} ∣ = \frac{i}{k} \cdot ∣ R \cap S_{i - 1} ∣ + \frac{i}{k} = B_{i - 1}^{R} + d_{R} (S_{i - 1}) + \frac{i}{k} .

B_{i - 1}^{R} + d_{R} (S_{i - 1}) + (\frac{k}{i} \cdot (1 - d_{R} (S_{i - 1}))) \cdot \frac{i}{k} = B_{i - 1}^{R} + 1.

B_{i - 1}^{R} + d_{R} (S_{i - 1}) + (\frac{k}{i} \cdot (1 - d_{R} (S_{i - 1}))) \cdot \frac{i}{k} = B_{i - 1}^{R} + 1.

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R] = Z_{i - 1}^{R} .

E [Z_{i}^{R} ∣ Z_{0}^{R}, \dots, Z_{i - 1}^{R}; x_{i} \in R] = Z_{i - 1}^{R} .

Var

Var

= \frac{k}{i} \cdot d_{R} (S_{i - 1}) \cdot (\frac{i}{k} - d_{R} (S_{i - 1}))^{2} + (1 - \frac{k}{i} \cdot d_{R} (S_{i - 1})) \cdot (d_{R} (S_{i - 1}))^{2}

= \frac{i}{k} \cdot d_{R} (S_{i - 1}) - (d_{R} (S_{i - 1}))^{2} \leq \frac{i}{k} .

Var

Var

= \frac{k}{i} \cdot (1 - d_{R} (S_{i - 1})) \cdot (\frac{i}{k} + d_{R} (S_{i - 1}) - 1)^{2} + (1 - \frac{k}{i} \cdot (1 - d_{R} (S_{i - 1}))) \cdot (1 - d_{R} (S_{i - 1}))^{2}

= \frac{i}{k} \cdot (1 - d_{R} (S_{i - 1})) - (1 - d_{R} (S_{i - 1}))^{2} \leq \frac{i}{k} .

Pr (∣ d_{R} (X) - d_{R} (S) ∣ \geq ε)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Adversarial Robustness of Sampling

Omri Ben-Eliezer Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel.

Eylon Yogev Department of Computer Science, Technion, Haifa, Israel. Supported by the European Union’s Horizon 2020 research and innovation program under grant agreement no. 742754, and by a grant from the Israel Science Foundation (no. 950/16).

Abstract

Random sampling is a fundamental primitive in modern algorithms, statistics, and machine learning, used as a generic method to obtain a small yet “representative” subset of the data. In this work, we investigate the robustness of sampling against adaptive adversarial attacks in a streaming setting: An adversary sends a stream of elements from a universe $U$ to a sampling algorithm (e.g., Bernoulli sampling or reservoir sampling), with the goal of making the sample “very unrepresentative” of the underlying data stream. The adversary is fully adaptive in the sense that it knows the exact content of the sample at any given point along the stream, and can choose which element to send next accordingly, in an online manner.

Well-known results in the static setting indicate that if the full stream is chosen in advance (non-adaptively), then a random sample of size $\Omega(d/\varepsilon^{2})$ is an $\varepsilon$ -approximation of the full data with good probability, where $d$ is the VC-dimension of the underlying set system $(U,\mathcal{R})$ . Does this sample size suffice for robustness against an adaptive adversary? The simplistic answer is negative: We demonstrate a set system where a constant sample size (corresponding to a VC-dimension of $1$ ) suffices in the static setting, yet an adaptive adversary can make the sample very unrepresentative, as long as the sample size is (strongly) sublinear in the stream length, using a simple and easy-to-implement attack.

However, this attack is “theoretical only”, requiring the set system size to (essentially) be exponential in the stream length. This is not a coincidence: We show that in order to make the sampling algorithm robust against adaptive adversaries, the modification required is solely to replace the VC-dimension term $d$ in the sample size with the cardinality term $\log|\mathcal{R}|$ . That is, the Bernoulli and reservoir sampling algorithms with sample size $\Omega(\log|\mathcal{R}|/\varepsilon^{2})$ output a representative sample of the stream with good probability, even in the presence of an adaptive adversary. This nearly matches the bound imposed by the attack.

1 Introduction

Random sampling is a simple, generic, and universal method to deal with massive amounts of data across all scientific disciplines. It has wide-ranging applications in statistics, databases, networking, data mining, approximation algorithms, randomized algorithms, machine learning, and other fields (see e.g., [CJSS03, JMR05, JPA04, CDK*+*11, CG05, CMY11] and [Cha01, Chapter 4]). Perhaps the central reason for its wide applicability is the fact that it (provably, and with high probability) suffices to take only a small number of random samples from a large dataset in order to “represent” the dataset truthfully (the precise geometric meaning is explained later). Thus, instead of performing costly and sometimes infeasible computations on the full dataset, one can sample a small yet “representative” subset of a data, perform the required analysis on this small subset, and extrapolate (approximate) conclusions from the small subset to the entire dataset.

The analysis of sampling algorithms has mostly been studied in the non-adaptive (or static) setting, where the data is fixed in advance, and then the sampling procedure runs on the fixed data. However, it is not always realistic to assume that the data does not change during the sampling procedure, as described in [MNS11, GHR*+*12, GHS*+*12, HW13, NY15]. In this work, we study the robustness of sampling in an adaptive adversarial environment.

The adversarial environment.

In high-level, the model is a two-player game between a randomized streaming algorithm, called $\mathsf{Sampler}$ , and an adaptive player, $\mathsf{Adversary}$ . In each round,

$\mathsf{Adversary}$ first submits an element to $\mathsf{Sampler}$ . The choice of the element can depend, possibly in a probabilistic manner, on all elements submitted by $\mathsf{Adversary}$ up to this point, as well as all information that $\mathsf{Adversary}$ observed from $\mathsf{Sampler}$ up to this point. 2. 2.

Next, $\mathsf{Sampler}$ probabilistically updates its internal state, i.e., the sample that it currently maintains. An update step usually involves an insertion of the newly received element to the sample with some probability, and sometimes deletion of old elements from the sample. 3. 3.

Finally, $\mathsf{Adversary}$ is allowed to observe the current (updated) state of $\mathsf{Sampler}$ , before proceeding to the next round.

$\mathsf{Adversary}$ ’s goal is to make the sample as unrepresentative as possible, causing $\mathsf{Sampler}$ to come with false conclusions about the data stream. The game is formally described in Section 2.

Adversarial scenarios are common and arise in different settings. An adversary uses adversarial examples to fool a trained machine learning model [SZS*+*14, MHS19]; In the field of online learning [Haz16], adversaries are typically adaptive [SS17, LMPL18]. An online store suggests recommended items based on a sample of previous purchases, which in turn influences future sales [Sha12, GHR*+*12]. A network device routes traffic according to statistics pulled from a sampled substream of packets [DLT05], and an adversary that observes the network’s traffic learns the device’s routing choices might cause a denial-of-service attack by generating a small amount of adversarial traffic [NY15]. A high-frequency stock trading algorithm monitors a stream of stock orders places buy/sell requires based on statistics drawn from samples; A competitor might fool the sampling algorithm by observing its requests and modifying future stock orders accordingly. An autonomous vehicle receives physical signals from its immediate environment (which might be adversarial [SBM*+*18]) and has to decide on a suitable course of action.

Even when there is no apparent adversary, the adaptive perspective is sometimes natural and required. For instance, adaptive data analysis [DFH*+*15, WFRS18] aims to understand the challenges arising when data arrives online, such as data reuse, the implicit bias “collected” over time in scientific discovery, and the evolution of statistical hypotheses over time. In graph algorithms, [CGP*+*18] observed that an adversarial analysis of dynamic spanners would yield a simpler (and quantitively better) alternative to their work.

In view of the importance of robustness against adaptive adversaries, and the fact that random sampling is very widely used in practice (including in streaming settings), we ask the following.

*Are sampling algorithms robust against adaptive adversaries? *

Bernoulli and reservoir sampling.

We mainly focus on two of the most basic and well-known sampling algorithms: Bernoulli sampling and reservoir sampling. The Bernoulli sampling algorithm with parameter $p\in[0,1]$ runs as follows: whenever it receives a stream element $x_{i}$ , the algorithm stores the element with probability $p$ . For a stream of length $n$ the sample size is expected to be $np$ ; and furthermore, it is well-concentrated around this value. We denote this algorithm by $\mathsf{BernoulliSample}$ .

The classical reservoir sampling algorithm [Vit85] (see also [Knu97, Section 3.4.2] and a formal description in Section 2) with parameter $k\in[n]$ maintains a uniform sample of fixed size $k$ , acting as follows. The first $k$ elements it receives, $x_{1},\ldots,x_{k}$ , are simply added to the memory with probability one. When the algorithm receives its ${i}\textsuperscript{th}$ element $x_{i}$ , where $i>k$ , it stores it with probability $k/i$ , by overriding a uniformly random element from the memory (so the memory size is kept fixed to $k$ ). We henceforth denote this algorithm by $\mathsf{ReservoirSample}$ .

Attacking sampling algorithms.

To answer the question above of whether sampling algorithms are robust against adversarially chosen streams, we must first define a notion of a representative sample, as several notions might be appropriate. However, we begin the discussion with an example showing how to attack the Bernoulli (and reservoir) sampling algorithm with respect to merely any definition of “representative”.

Consider a setting where the stream consists of $n$ points $x_{1},\ldots,x_{n}$ in the one-dimensional range of real numbers $[0,1]$ . $\mathsf{BernoulliSample}$ receives these points and samples each one independently with probability $p<1$ . One can observe that, in the static setting and for sufficiently large $p$ , the sampled set will be a good representation of the entire $n$ points for various definitions of the term “representation”. For example, the median of the stream will be $\varepsilon$ -close111The term “close” here means that the median of the sampled set will be an element whose order among the elements of the full stream, when the elements are sorted by value from smallest to largest, is within the range $(1\pm\epsilon)n/2$ , with high probability where the parameter $\epsilon$ depends on the probability $p$ . to the median of the sampled elements with high probability, as long as $p=\frac{c}{\varepsilon^{2}n}$ for some constant $c>0$ (this also holds for any other quantile).

Consider the following adaptive adversary which will demonstrate the difference of the adaptive setting. $\mathsf{Adversary}$ keeps a “working range” at any point during the game, starting with the full range $[0,1]$ . In the first round, $\mathsf{Adversary}$ chooses the number $x_{1}=0.5$ as the first element in the stream. If $x_{1}$ is sampled, then $\mathsf{Adversary}$ moves to the range $[0.5,1]$ , and otherwise, to the range $[0,0.5]$ . Next, $\mathsf{Adversary}$ submits $x_{2}$ as the middle of the current range. This continues for $n$ steps; Formally, $\mathsf{Adversary}$ ’s strategy is as follows. Set $a_{1}=0$ and $b_{1}=1$ . In round $i$ , where $i$ runs from $1$ to $n$ , $\mathsf{Adversary}$ submits $x_{i}=\frac{a_{i}+b_{i}}{2}$ to $\mathsf{BernoulliSample}$ ; If $x_{i}$ is sampled then $\mathsf{Adversary}$ sets $a_{i+1}=x_{i},b_{i+1}=b_{i}$ , and otherwise, it sets $a_{i+1}=a_{i},b_{i+1}=x_{i}$ . The final stream is $x_{1},\ldots,x_{n}$ .

Note that at any point throughout the process, $\mathsf{Adversary}$ always submits an element that is larger than all elements in the current sampled set, and also smaller than all the non-sampled elements of the stream. Therefore, the end result is that after this process is over, with probability 1, the $k$ sampled elements are precisely the smallest $k$ elements in the stream. Of course, the median of the sampled set is far from the median of the stream as such a subset is very unrepresentative of the data. Actually, one might consider it as the “most unrepresentative” subset of the data.

The exact same attack on $\mathsf{BernoulliSample}$ works almost as effectively against $\mathsf{ReservoirSample}$ . In this case, the attack will cause all of the $k$ sampled elements at the end of the process to lie among the first $O(k\ln n)$ elements with high probability. For more details, see Section 5.

The good news.

This attack joins a line of attacks in the adversarial model. Lipton and Naughton [LN93] showed that an adversary that can measure the time of operations in a dictionary can use this information to increase the probability of a collision and as a result, significantly decrease the performance of the hashtable. Hardt and Woodruff [HW13] showed that linear sketches are inherently non-robust and cannot be used to compute the Euclidean norm of its input (where in the static setting they are used mainly for this reason). Naor and Yogev [NY15] showed that Bloom filters are susceptible to attacks by an adaptive stream of queries if the adversary is computationally unbounded and they also constructed a robust Bloom filter against computationally bounded adversaries.

In our case, we note that the given attack might categorize it as “theoretical” only. In practice, it is unrealistic to assume that the universe from which $\mathsf{Adversary}$ can pick elements is an infinite set; how would the attack look, then, if the universe is the discrete set $[N]=\{1,\ldots,N\}$ ? $\mathsf{Adversary}$ splits the range $[0,1]$ to half for $n$ times, meaning that the precision of the elements required is exponential; The analogous attack in the discrete setting requires $N$ to be exponentially large with respect to the stream size $n$ . Such a universe size is large and “unrealistic”: for $\mathsf{Sampler}$ to memorize even a single element requires memory size that is linear in $n$ , whilst sampling and streaming algorithms usually aim to use an amount sublinear in $n$ of memory.

Thus, the question remains whether there exist attacks that can be performed on elements using substantially less precision, that is, on a significantly smaller size of discrete universe. In this work, we bring good news to both the Bernoulli and reservoir sampling algorithms by answering this question negatively. We show that both sampling algorithms, with the right parameters, will output a representative sample with good probability regardless of $\mathsf{Adversary}$ ’s strategy, thus exhibiting robustness for these algorithms in adversarial settings.

We note that any deterministic algorithm that works in the static setting is inherently robust in the adversarial adaptive setting as well. However, in many cases, deterministic algorithms with small memory simply do not exist, or they are complicated and tailored for a specific task. Here, we enjoy the simplicity of a generic randomized sampling algorithm combined with the robust guarantees of our framework.

What is a representative sample?

Perhaps the most standard and well-known notion of being representative is that of an $\varepsilon$ -approximation, first suggested by Vapnik and Chervonenkis [VC71] (see also [MV17]), which originated as a natural notion of discrepancy [Cha01] in the geometric literature. It is closely related to the celebrated notion of VC-dimension [VC71, Sau72, She72], and captures many quantitative properties that are desired in a random subset. Let $X=(x_{1},\ldots,x_{n})$ be a sequence of elements from the universe $U$ (repetitions are allowed) and let $R\subseteq U$ . The density of $R$ in $X$ is the fraction of elements in $X$ that are also in $R$ (i.e., $d_{R}(X)=\Pr_{i\in[n]}[x_{i}\in R]$ ).

A set system is simply a pair $(U,\mathcal{R})$ where $\mathcal{R}\subseteq 2^{U}$ is a collection of subsets. A non-empty subsequence $S$ of $X$ is an $\varepsilon$ -approximation of $X$ with respect to the set system $(U,\mathcal{R})$ if it preserves densities (up to an $\varepsilon$ factor) for all subsets $R\in\mathcal{R}$ .

Definition 1.1 ( $\varepsilon$ -approximation).

We say that a (non-empty) sample $S$ is an $\varepsilon$ -approximation of $X$ with respect to $\mathcal{R}$ if for any subset $R\in\mathcal{R}$ it holds that $\left|d_{R}(X)-d_{R}(S)\right|\leq\varepsilon.$

If the universe $U$ is well-ordered, it is natural to take $\mathcal{R}$ as the collection of all consecutive intervals in $U$ , that is, $\mathcal{R}=\{[a,b]:a\leq b\in U\}$ (including all singletons $[a,a]$ ). With this set system in hand, $\varepsilon$ -approximation is a natural form of “good representation” in the streaming setting, pointed out by its deep connection to multiple classical problems in the streaming literature, like approximate median, and more generally, quantile estimation [MRL99, GK01, WLYC13, GK16, KLL16] and range searching [BCEG07]. In particular, if $S$ is an $\varepsilon$ -approximation of $X$ w.r.t. $(U,\mathcal{R})$ , then any $q$ -quantile of $S$ is $\varepsilon$ -close to the $q$ -quantile of $X$ ; this holds simultaneously for all quantiles (see Section 1.2).

1.1 Our Results

Fix a set system $(U,\mathcal{R})$ over the universe $U$ . A sampling algorithm is called $(\varepsilon,\delta)$ -robust if for any (even computationally unbounded) strategy of $\mathsf{Adversary}$ , the output sample $S$ is an $\varepsilon$ -approximation of the whole stream $X$ with respect to $(U,\mathcal{R})$ , with probability at least $1-\delta$ .

Our main result is an upper bound (“good news”) on the $(\varepsilon,\delta)$ -robustness of Bernoulli and reservoir sampling, later to be complemented them with near-matching lower bounds.

Theorem 1.2.

For any $0<\varepsilon,\delta<1$ , set system $(U,\mathcal{R})$ , and stream length $n$ , the following holds.

•

$\mathsf{BernoulliSample}$ * with parameter $p\geq 10\cdot\frac{\ln|\mathcal{R}|+\ln(4/\delta)}{\varepsilon^{2}n}$ is $(\varepsilon,\delta)$ -robust.*

•

$\mathsf{ReservoirSample}$ * with parameter $k\geq 2\cdot\frac{\ln|\mathcal{R}|+\ln(2/\delta)}{\varepsilon^{2}}$ is $(\varepsilon,\delta)$ -robust.*

The proof appears in Section 4. As the total number of elements sampled by $\mathsf{BernoulliSample}$ is well-concentrated around $np$ , the above theorem implies that a sample of total size (at least) $\Theta(\frac{\ln|\mathcal{R}|+\ln\frac{1}{\delta}}{\varepsilon^{2}})$ , obtained by any of the algorithms, $\mathsf{BernoulliSample}$ or $\mathsf{ReservoirSample}$ , is an $\varepsilon$ -approximation with probability $1-\delta$ .

This should be compared with the static setting, where the same result is known as long as $p\geq c\cdot\frac{d+\ln\frac{1}{\delta}}{\varepsilon^{2}n}$ for $\mathsf{BernoulliSample}$ , and $k\geq c\cdot\frac{d+\ln\frac{1}{\delta}}{\varepsilon^{2}}$ for $\mathsf{ReservoirSample}$ , where $d$ is the VC-dimension of $(U,\mathcal{R})$ and $c>0$ is a constant [VC71, Tal94, LLS01] (see also [MV17]).

As you can see, to make the static sampling algorithm robust in the adaptive setting one solely needs to modify the sample size by replacing the VC-dimension term $d$ with the cardinality dimension $\ln|\mathcal{R}|$ (and update the multiplicative constant). Below, in our lower bounds, we show that this increase in the sample size is inherent, and not a byproduct of our analysis.

Lower Bounds.

We next show that being adaptively robust comes at a price. That is, the dependence on the cardinality dimension, as opposed to the VC dimension, is necessary. By an improved version of the attack described in the introduction, we show the following:

Theorem 1.3.

There exists a constant $c>0$ and a set system $(U,\mathcal{R})$ with VC-dimension 1, where such that for any $0<\varepsilon,\delta<1/2$ :

The $\mathsf{BernoulliSample}$ algorithm with parameter $p<c\cdot\frac{\ln|\mathcal{R}|}{n\ln n}$ is not $(\varepsilon,\delta)$ -robust. 2. 2.

The $\mathsf{ReservoirSample}$ algorithm with parameter $k<c\cdot\frac{\ln|\mathcal{R}|}{\ln n}$ is not $(\varepsilon,\delta)$ -robust.

Moreover, for any $n^{6\ln n}\leq N\leq 2^{n/2}$ , there exists $(U,\mathcal{R})$ as above where $|\mathcal{R}|=|U|=N$ .

The proof can be found in Section 5.

Continuous robustness.

The condition of $(\varepsilon,\delta)$ -robustness requires that the sample will be $\varepsilon$ -representative of the stream in the end of the process. What if we wish the sample to be representative of the stream at any point throughout the stream? Formally, we say that a sampling algorithm is $(\varepsilon,\delta)$ -continuously robust if, with probability at least $1-\delta$ , at any point $i\in[n]$ the sampled set $S_{i}$ is an $\varepsilon$ -approximation of the first $i$ elements of the stream, i.e., of $X_{i}=(x_{1},\ldots,x_{i})$ . The next theorem shows that continuous robustness of $\mathsf{ReservoirSample}$ can be obtained with just a small overhead compared to “standard” robustness. (For $\mathsf{BernoulliSample}$ one cannot hope for such a result to be true, at least for the above definition of continuous robustness.)

Theorem 1.4.

There exists $c>0$ , such that for any $0<\varepsilon,\delta<1/2$ , set system $(U,\mathcal{R})$ , and stream length $n$ , $\mathsf{ReservoirSample}$ with parameter $k\geq c\cdot\frac{\ln|\mathcal{R}|+\ln 1/\delta+\ln 1/\varepsilon+\ln\ln n}{\varepsilon^{2}}$ is $(\varepsilon,\delta)$ -continuously robust.

Moreover, if only continuous robustness against a static adversary is desired, then the $\ln|\mathcal{R}|$ term can be replaced with the VC-dimension of $(U,\mathcal{R})$ .

We are not aware of a previous analysis of continuous robustness, even in the static setting. The proof, appearing in Section 6, follows by applying Theorem 1.2 (or its static analogue) in carefully picked “checkpoints” $k=i_{1}\leq i_{2}\leq\ldots\leq i_{t}=n$ along the stream, where $t=O(\varepsilon^{-1}\ln n)$ . It shows that if the sample $S_{i}$ is representative of the stream $X_{i}$ in any of the points $i=i_{1},\ldots,i_{t-1}$ , then with high probability, the sample is also representative in any other point along the stream. (We remark that a similar statement with weaker dependence on $n$ can be obtained from Theorem 1.2 by a straightforward union bound.) The proof can be found in Section 6.

Comparison to deterministic sampling algorithms.

Our results show that sampling algorithms provide an $\varepsilon$ -approximation in the adversarial model. One advantage of using the notion of $\varepsilon$ -approximation is its wide array of applications, where for each such task we get a streaming algorithm in the adversarial model as described in the following subsection. We stress that for any specific task a deterministic algorithm that works in the static setting will also automatically be robust in the adversarial setting. However, deterministic algorithms tend to be more complicated, and in some cases they require larger memory. Here, we focus on showing that the most simple and generic sampling algorithms “as is” are robust in our adaptive model and yield a representative sample of the data that can be used for many different applications.

The best known deterministic algorithm for computing an $\varepsilon$ -approximating sample in the streaming model is that of Bagchi et al. [BCEG07]. The sample size they obtain is $O(\varepsilon^{-2}\ln 1/\varepsilon)$ ; the working space of their algorithm and the processing time per element are of the form $\varepsilon^{-2d-O(1)}(\ln n)^{O(d)}$ , where $d$ is the scaffold dimension222The scaffold dimension is a variant of the VC-dimension equal to $\lceil\ln|\mathcal{R}|/\ln|U|\rceil$ . of the set system. The exact bounds are rather intricate, see Corollary 4.2 in [BCEG07]. While the space requirement of their approach does not have a dependence on $\ln|\mathcal{R}|$ , its dependence on $\varepsilon$ and $\ln n$ is generally worse than ours, making their bounds somewhat incomparable to ours. Finally, we note that there exist more efficient methods to generate an $\varepsilon$ -approximation in some special cases, e.g., when the set system constitutes of rectangles or halfspaces [STZ04].

1.2 Applications of Our Results

We next describe several representative applications and usages of $\varepsilon$ -approximations (see also [BCEG07] for more applications in the area of robust statistics). For some of these applications, there exist deterministic algorithms known to require less memory than the simple random sampling models discuss in this paper. However, one area where our generic random sampling approach shines compared to deterministic approaches is the query complexity or running time (under a suitable computational model). Indeed, while deterministic algorithms must inherently query all elements in the stream in order to run correctly, our random sampling methods query just a small sublinear portion of the elements in the stream.

Consequently, to the best of our knowledge, Bernoulli and reservoir sampling are the first two methods known to compute an $\varepsilon$ -approximation (and as a byproduct, solve the tasks described in this subsection) in adversarial situations where it is unrealistic or too costly to query all elements in the stream. The last part of this subsection exhibits an example of one such situation.

Quantile approximation.

As was previously mentioned, $\varepsilon$ -approximations have a deep connection to approximate median (and more generally, quantile estimation). Assume the universe $U$ is well-ordered. We say that a streaming algorithm is an $(\varepsilon,\delta)$ -robust quantile sketch if, in our adversarial model, it provides a sample that allows to approximate the rank333The rank of an element $x_{i}$ in a stream $x_{1},\ldots,x_{n}$ is the total amount of elements $x_{j}$ in the stream so that $x_{j}\leq x_{i}$ . of any element in the stream up to additive error $\varepsilon n$ with probability at least $1-\delta$ . Observe that this is achieved with an $\varepsilon$ -approximation with respect to the set system $(U,\mathcal{R})$ where $\mathcal{R}=\{[1,b]:b\in U\}$ . For example, set $b$ to be the median of the stream. Since the density of the range $[1,b]$ is preserved in the sample, we know that the median of the sample will be $\varepsilon$ -close to the median of the stream. This works for any other quantile simultaneously. The sample size is $\Theta(\frac{\ln|U|+\ln(1/\delta)}{\varepsilon^{2}})$ .

Corollary 1.5.

For any $0<\varepsilon,\delta<1$ , well-ordered universe $U$ , and stream length $n$ , $\mathsf{BernoulliSample}$ with parameter $p\geq 10\cdot\frac{\ln|U|+\ln(4/\delta)}{\varepsilon^{2}n}$ is an $(\varepsilon,\delta)$ -robust quantile sketch. The same holds for the $\mathsf{ReservoirSample}$ algorithm with parameter $k\geq 2\cdot\frac{\ln|U|+\ln(2/\delta)}{\varepsilon^{2}}$ .

A corollary in the same spirit regarding continuously robust quantile sketches can be derived from Theorem 1.4.

Range queries.

Suppose that the universe is of the form $U=[m]^{d}$ for some parameters $m$ and $d$ . One basic problem is that of range queries: one is given a set of ranges $\mathcal{R}$ and each query consists of a range $R\in\mathcal{R}$ where the desired answer is the number of points in the stream that are in this range. Popular choices of such ranges are axis-aligned or rotated boxes, spherical ranges and simplicial ranges. An $\varepsilon$ -approximation allows us to answer such range queries up to an additive error of $\varepsilon n$ . Suppose the sampled set is $S$ , then an answer is given by computing $d_{R}(S)\cdot n/|S|$ . For example, when $\mathcal{R}$ consists of all axis-parallel boxes, $\ln|\mathcal{R}|=O(d\ln m)$ and thus the sample size required to answer range queries that are robust against adversarial streams is $|S|=O\left(\frac{d\ln|m|+\ln(1/\delta)}{\varepsilon^{2}}\right)$ ; for rotated boxes, one should replace $d$ with $d^{2}$ in this expression. See [BCEG07] for more details on the connection between $\varepsilon$ -approximations and range queries.

Center points.

Our result is also useful for computing $\beta$ -center points. A point $c$ in the stream is a $\beta$ -center point if every closed halfspace containing $c$ in fact contains at least $\beta n$ points of the stream. In [CEM*+*96, Lemma 6.1] it has been shown that an $\varepsilon$ -approximation (with respect to half-spaces) can be used to get a $\beta$ -center point for suitable choices of the parameters. For example, setting $\varepsilon=\beta/5$ we get that a $6\beta/5$ -center of the sample $S$ is a $\beta$ -center of the stream $X$ . Thus, we can compute a $\beta$ -center of a stream in the adversarial model. See also [BCEG07].

Heavy hitters.

Finding those elements that appear many times in a stream is a fundamental problem in data mining, with a myriad of practical applications. In the heavy hitters problem, there is a threshold $\alpha$ and an error parameter $\varepsilon$ . The goal is to output a list of elements such that if an element $x$ appears more than $\alpha n$ times in the stream (i.e., $d_{x}(X)\geq\alpha$ ) it must be included in the list, and if an element appears less than $(\alpha-\varepsilon)n$ times in the stream (i.e., $d_{x}(X)\leq\alpha-\varepsilon$ it cannot be included in the list.

Our results yield a simple and efficient heavy hitters streaming algorithm in the adversarial model. For any universe $U$ let $\mathcal{R}=\{\{a\}:a\in U\}$ be the set of all singletons. Now, pick $\varepsilon^{\prime}=\varepsilon/3$ and use either Bernoulli or reservoir sampling to compute an $\varepsilon^{\prime}$ -approximation $S$ of the stream $X$ , outputting all elements $x\in S$ with $d_{\{x\}}(S)\geq\alpha-\varepsilon^{\prime}$ . Indeed, if $d_{a}(X)\geq\alpha$ then $d_{x}(S)\geq\alpha-\varepsilon^{\prime}$ . On the other hand, if $d_{x}(X)\leq\alpha-\varepsilon$ then $d_{x}(S)\leq\alpha-\varepsilon+\varepsilon^{\prime}<\alpha-\varepsilon^{\prime}$ .

Corollary 1.6.

There exists $c>0$ such that for any $0<\varepsilon,\delta<1/2$ , universe $U$ , and stream length $n$ , $\mathsf{BernoulliSample}$ with parameter $p\geq c\cdot\frac{\ln|U|+\ln(1/\delta)}{\varepsilon^{2}n}$ solves the heavy hitters problem with error $\varepsilon$ in the adversarial model. The same holds for $\mathsf{ReservoirSample}$ with parameter $k\geq c\cdot\frac{\ln|U|+\ln(1/\delta)}{\varepsilon^{2}}$ .

Clustering.

The task of partitioning data elements into separate groups, where the elements in each group are “similar” and elements in different groups are “dissimilar” is fundamental and useful for numerous applications across computer science. There has been lots of interest on clustering in a streaming setting, see e.g. [GLA16] for a survey on recent results. Our results suggest a generic framework to accelerate clustering algorithms in the adversarial model: Instead of running clustering on the full data, one can simply sample the data to obtain (with high probability, even against an adversary) an $\varepsilon$ -approximation of it, run the clustering algorithm on the sample, and then extrapolate the results to the full dataset.

Sampling in modern data-processing systems.

It is very common to use random sampling (sometimes “in disguise”) in modern data-intensive systems that operate on streaming data, arriving in an online manner. As an illustrative example, consider the following distributed database [OV11] setting. Suppose that a database system must receive and process a huge amount of queries per second. It is unrealistic for a single server to handle all the queries, and hence, for load balancing purposes, each incoming query is randomly assigned to one of $K$ query-processing servers. Seeing that the set of queries that each such server receives is essentially a Bernoulli random sample (with parameter $p=1/K$ ) of the full stream, one hopes that the portion of the stream sampled by each of these servers would truthfully represent the whole data stream (e.g., for query optimization purposes), even if the stream changes with time (either unintentionally or by a malicious adversary). Such “representation guarantees” are also desirable in distributed machine learning systems [GDG*+*17, SKYL17], where each processing unit learns a model according to the portion of the data it received, and the models are then aggregated, with the hope that each of the units processed “similar” data.

In general, modern data-intensive systems like those described above become more and more complicated with time, consisting of a large number of different components. Making these systems robust against environmental changes in the data, let alone adversarial changes, is one of the greatest challenges in modern computer science. From our perspective, the following question naturally emerges:

Is random sampling a risk in modern data processing systems?

Fortunately, our results indicate that the answer to this question is largely negative. Our upper bounds, Theorems 1.2 and 1.4, show that a sufficiently large sample suffices to circumvent adversarial changes of the environment.

1.3 Related Work

Online learning.

One related field to our work is online learning, which was introduced for settings where the data is given in a sequential online manner or where it is necessary for the learning algorithm to adapt to changes in the data. Examples include stock price predictions, ad click prediction, and more (see [Sha12] for an overview and more examples).

Similar to our model, online learning is viewed as a repeated game between a learning algorithm (or a predictor) and the environment (i.e., the adversary). It considers $n$ rounds where in each round the environment submits an instance $x_{i}$ , the learning algorithm then makes a prediction for $x_{i}$ , the environment, in turn, chooses a loss for this prediction and sends it as feedback to the algorithm. The goal in this model is usually to minimize regret (the sum of losses) compared to the best fixed prediction in hindsight. This is the typical setting (e.g., [HAK07, SST10]), however, many different variants exist (e.g., [DGS15, ZLZ18]).

PAC learning.

In the PAC-learning framework [Val84], the learner algorithm receives samples generated from an unknown distribution and must choose a hypothesis function from a family of hypotheses that best predicts the data with respect to the given distribution. It is known that the number of samples required for a class to be learnable in this model depends on the VC-dimension of the class.

A recent work of Cullina et al. [CBM18] investigates the effect of evasion adversaries on the PAC-learning framework, coining the term of adversarial VC-dimension for the parameter governing the sample complexity. Despite the name similarity, their context is seemingly unrelated to ours (in particular, it is not a streaming setting), and correspondingly, their notion of adversarial VC-dimension does not seem to relate to our work.

Adversarial examples in deep learning.

A very popular line of research in modern deep learning proposes methods to attack neural networks, and countermeasures to these attacks. In such a setting, an adversary performs adaptive queries to the learned model in order to fool the model via a malicious input. The learning algorithms usually have an underlying assumption that the training and test data are generated from the same statistical distribution. However, in practice, the presence of an adaptive adversary violates this assumption. There are many devastating examples of attacks on learning models [SZS*+*14, BCM*+*13, PMG*+*17, BR18, MHS19] and we stress that currently, the understanding of techniques to defend against such adversaries is rather limited [GMP18, MW18, MM19, MHS19].

Maintaining random samples.

Reservoir sampling is a simple and elegant algorithm for maintaining a random sample of a stream [Vit85], and since its proposal, many flavors have been introduced. Chung, Tirthapura, Woodruff [CTW16] generalized reservoir sampling to the setting of multiple distributed streams, which need to coordinate in order to continuously respond to queries over the union of all streams observed so far (see also Cormode et al. [CMYZ12]). Another variant is weighted reservoir sampling where the probability of sampling an element is proportional to a weight associated with the element in the stream [ES06, BOV15]. A distributed version as above was recently considered for the weighted case as well [JSTW19].

1.4 Paper Organization

Section 2 contains an overview of our adversarial model and a more precise and detailed definition than the one given in the introduction. In Section 3 we mention several concentration inequalities required for our analysis. In Section 4 we present and prove our main technical Lemma, from which we derive Theorem 1.2. This includes analysis of both $\mathsf{BernoulliSample}$ and $\mathsf{ReservoirSample}$ . In Section 5 we present our “attack”, i.e., our lower bound showing the tightness of our result. Finally, in Section 6, we prove our upper bounds in the continuous setting.

2 The Adversarial Model for Sampling

In this section, we formally define the online adversarial model discussed in this paper. Roughly speaking, we say that $\mathsf{Sampler}$ is an $(\varepsilon,\delta)$ -robust sampling algorithm for a set system $(U,\mathcal{R})$ if for any adversary choosing an adaptive stream of elements $X=(x_{1},\ldots,x_{n})$ , the final state of the sampling algorithm $\sigma_{n}$ is an $\varepsilon$ -approximation of the stream with probability $1-\delta$ . This is formulated using a game, $\mathsf{AdaptiveGame}$ , between two players, $\mathsf{Sampler}$ and $\mathsf{Adversary}$ .

Rules of the game:

$\mathsf{Sampler}$ is a streaming algorithm, which gets a sequence of $n$ elements one by one $x_{1},\ldots,x_{n}$ in an online manner (the sampling algorithms we discuss in this paper do not need to know $n$ in advance). Upon receiving an element $x_{i}$ , $\mathsf{Sampler}$ can perform an arbitrary computation (the running time can be unbounded) and update a local state $\sigma$ . We denote the local state after $i$ steps by $\sigma_{i}$ , and write $\sigma_{i}\leftarrow\mathsf{Sampler}(\sigma_{i-1},x_{i})$ . 2. 2.

The stream is chosen adaptively by $\mathsf{Adversary}$ : a probabilistic (unbounded) player that, given all previously sent elements $x_{1},\ldots,x_{i-1}$ and the current state $\sigma_{i-1}$ , chooses the next element $x_{i}$ to submit. The strategy that Adversary employs along the way, that is, the probability distribution over the choice of $x_{i}$ given any possible set of values $x_{1},\ldots,x_{i-1}$ and $\sigma_{i-1}$ , is fixed in advance. The underlying (finite or infinite) set from which $\mathsf{Adversary}$ is allowed to choose elements during the game is called the universe, and denoted by $U$ . We assume that $U$ does not change along the game. 3. 3.

Once all $n$ rounds of the game have ended, $\mathsf{Sampler}$ outputs $\sigma_{n}$ . For the sampling algorithms discussed in this paper, $S:=\sigma_{n}$ is a subsequence of the stream $X=(x_{1},\ldots,x_{n})$ . $S$ is usually called the sample obtained by $\mathsf{Sampler}$ in the game.

For an illustration on the rules of the game see Figure 1.

Using the game defined above, we now describe what it means for a sampling algorithm to be (adversarially) robust.

Definition 2.1 (Robust sampling algorithm).

We say that a sampling algorithm $\mathsf{Sampler}$ is $(\varepsilon,\delta)$ -robust with respect to the set system $(U,\mathcal{R})$ and the stream length $n$ if for and any (even unbounded) strategy of $\mathsf{Adversary}$ , it holds that

[TABLE]

The memory size used by $\mathsf{Sampler}$ is defined to be the maximal size of $\sigma$ throughout the process of $\mathsf{AdaptiveGame}$ .

A stronger requirement that one can impose on the sampling algorithm is to hold an $\varepsilon$ -approximation of the stream at any step during the game. To handle this, we define a continuous variant of $\mathsf{AdaptiveGame}$ which we denote $\mathsf{ContinuousAdaptiveGame}$ , presented in Figure 2.

For the sampling algorithms that we consider, the state at any time $\sigma_{i}$ is essentially equal to the sample $S_{i}$ . In any case, the definition of the framework given in Figure 2 generally allows $\sigma_{i}$ to contain additional information, if needed. A sampling algorithm is called $(\varepsilon,\delta)$ -continuously robust if the following holds with probability at least $1-\delta$ : for any strategy of $\mathsf{Adversary}$ , and all $i\in[n]$ , the sample $S_{i}$ is an $\varepsilon$ -approximation of the stream at time $i$ .

Definition 2.2 (Continuously robust sampling algorithm).

We say that a sampling algorithm $\mathsf{Sampler}$ is $(\varepsilon,\delta)$ -continuously robust with respect to the set system $(U,\mathcal{R})$ and the stream length $n$ if for and any (even unbounded) strategy of $\mathsf{Adversary}$ , it holds that

[TABLE]

The memory size used by $\mathsf{Sampler}$ is defined to be the maximal size of $\sigma$ throughout the process of $\mathsf{ContinuousAdaptiveGame}$ .

Reservoir sampling.

For completeness, we provide the pseudocode of the reservoir sampling algorithm [Vit85, Knu97]. Here, $k$ denotes the (fixed) memory size of the algorithm, $i$ denotes the current round number, and $x_{i}$ is the currently received element.

$\mathsf{ReservoirSample}(k,i,\sigma_{i-1},x_{i})$ :

If $i<k$ then parse $\sigma_{i-1}=x_{1},\ldots,x_{i-1}$ and output $\sigma_{i}=x_{1},\ldots,x_{i}$ . 2. 2.

Otherwise, parse $\sigma_{i-1}=s_{1},\ldots,s_{k}$ . 3. 3.

With probability $k/i$ do:

choose $j\in[k]$ uniformly at random and output $\sigma_{i}=\allowbreak s_{1},\ldots,\allowbreak s_{j-1},x_{i},s_{j+1},\ldots,s_{k}$ . 4. 4.

Otherwise, output $\sigma_{i}=\sigma_{i-1}$ .

3 Technical Preliminaries

The logarithms in this paper are usually of base $e$ , and denoted by $\ln$ . The exponential function $\exp{(x)}$ is $e^{x}$ . For an integer $n\in{\mathbb{N}}$ we denote by $[n]$ the set $\{1,\ldots,n\}$ . We state some concentration inequalities, useful for our analysis in later sections. We start with the well-known Chernoff’s inequality for sums of independent random variables.

Theorem 3.1 (Chernoff Bound [Che52]; see Theorem 3.2 in [CL06]).

Let $X_{1},\ldots,X_{m}$ be independent random variables that take the value 1 with probability $p_{i}$ and 0 otherwise, $X=\sum_{i=1}^{m}X_{i}$ , and $\mu={\mathbb{E}}\!\left[{X}\right]$ . Then for any $0<\delta<1$ ,

[TABLE]

and

[TABLE]

Our analysis of adversarial strategies crucially makes use of martingale inequalities. We thus provide the definition of a martingale.

Definition 3.2.

A martingale is a sequence $X=(X_{0},\ldots,X_{m})$ of random variables with finite means, so that for $0\leq i<m$ , it holds that ${\mathbb{E}}\!\left[{X_{i+1}\mid X_{0},\ldots,X_{i}}\right]=X_{i}$ .

The most basic and well-known martingale inequality, Azuma’s (or Hoeffding’s) inequality, asserts that martingales with bounded differences $|X_{i+1}-X_{i}|$ are well-concentrated around their mean. For our purposes, this inequality does not suffice, and we need a generalized variant of it, due to McDiarmid [McD98, Theorem 3.15]; see also Theorem 4.1 in [Fre75]. The formulation that we shall use is given as Theorem 6.1 in the survey of Chung and Lu [CL06].

Lemma 3.3 (See [CL06], Theorem 6.1).

Let $X=(X_{0},X_{1},\ldots,X_{n})$ be a martingale. Suppose further that for any $1\leq i\leq n$ , the variance satisfies $\text{Var}(X_{i}|X_{0},\ldots,X_{i-1})\leq\sigma_{i}^{2}$ for some values $\sigma_{1},\ldots,\sigma_{n}\geq 0$ , and there exists some $M\geq 0$ so that $|X_{i}-X_{i-1}|\leq M$ always holds. Then, for any $\lambda\geq 0$ , we have

[TABLE]

In particular,

[TABLE]

Unlike Azuma’s inequality, Lemma 3.3 is well-suited to deal with martingales where the maximum value $M$ of $|X_{i+1}-X_{i}|$ is large, but the maximum is rarely attained (making the variance much smaller than $M^{2}$ ). The martingales we investigate in this paper depict this behavior.

4 Adaptive Robustness of Sampling: Main Technical Result

In this section, we prove the main technical lemma underlying our upper bounds for Bernoulli sampling and reservoir sampling. The lemma asserts that for both sampling methods, and any given subset $R$ of the universe $U$ , the fraction of elements from $R$ within the sample typically does not differ by much from the corresponding fraction among the whole stream.

Lemma 4.1.

Fix $\varepsilon,\delta>0$ , a universe $U$ and a subset $R\subseteq U$ , and let $X=(x_{1},x_{2},\ldots,x_{n})$ be the sequence chosen by $\mathsf{Adversary}$ in $\mathsf{AdaptiveGame}_{\varepsilon}$ against either $\mathsf{BernoulliSample}$ or $\mathsf{ReservoirSample}$ .

For $\mathsf{BernoulliSample}$ with parameter $p\geq 10\cdot\frac{\ln(4/\delta)}{\varepsilon^{2}n}$ , we have $\Pr(|d_{R}(X)-d_{R}(S)|\geq\varepsilon)\leq\delta$ . 2. 2.

For $\mathsf{ReservoirSample}$ with memory size $k\geq 2\cdot\frac{\ln(2/\delta)}{\varepsilon^{2}}$ , it holds that $\Pr(|d_{R}(X)-d_{R}(S)|\geq\varepsilon)\leq\delta$ .

Both of these bounds are tight up to an absolute multiplicative constant, even for a static adversary (that has to submit all elements in advance); see Section 6 for more details.

The proof of Theorem 1.2 follows immediately from Lemma 4.1, and is given below. The proof of Theorem 1.4 requires slightly more effort, and is given in Section 6.

Proof of Theorem 1.2.

Let $(U,\mathcal{R})$ , $\varepsilon$ , $\delta$ , $n$ be as in the statement of the theorem, and let $X$ and $S$ denote the stream and sample, respectively. We start with the Bernoulli sampling case, and assume that $p\geq 10\cdot\frac{\ln(4/\delta)+\ln|\mathcal{R}|}{\varepsilon^{2}n}=10\cdot\frac{\ln(4|\mathcal{R}|/\delta)}{\varepsilon^{2}n}$ . For each $R\in\mathcal{R}$ , we apply the first part of Lemma 4.1 with parameters $\varepsilon$ and $\delta/|\mathcal{R}|$ , concluding that

[TABLE]

In the event that $|d_{R}(X)-d_{R}(S)|\leq\varepsilon$ for any $R$ , by definition $S$ is an $\varepsilon$ -approximation of $X$ . Taking a union bound over all $R\mathcal{R}$ , we conclude that the probability of this event not to hold is bounded by $|\mathcal{R}|\cdot(\delta/|\mathcal{R}|)=\delta$ , meaning that $\mathsf{BernoulliSample}$ with $p$ as above is $(\varepsilon,\delta)$ -robust.

The proof for $\mathsf{ReservoirSample}$ is identical, except that we replace the condition on $p$ with the condition that $k\geq 2\cdot\frac{\ln(2/\delta)+\ln|\mathcal{R}|}{\varepsilon^{2}}$ , and apply the second part of Lemma 4.1. ∎

It is important to note that the typical proofs given for statements of this type in the static setting (i.e., when Adversary submits all elements in advance, and cannot act adaptively) do not apply for our adaptive setting. Indeed, the usual proof of the static analogue of the above lemma goes along the following lines: Adversary chooses which elements to submit in advance, and in particular, determines the number of elements from $A$ sent, call it $n_{A}$ . Then, the number of sampled elements from $A$ is distributed according to the binomial distribution $\textsf{Bin}(n_{A},p)$ for Bernoulli sampling, and $\textsf{Bin}(n_{A},k/n)$ for reservoir sampling. One can then employ Chernoff bound to conclude the proof. This kind of analysis crucially relies on the adversary being static.

Here, we need to deal with an adaptive adversary. Recall that $\mathsf{Adversary}$ at any given point is modeled as a probabilistic process, that given the sequence $X_{i-1}=(x_{1},\ldots,x_{i-1})$ of elements sent until now, and the current state $\sigma_{i-1}$ of $\mathsf{Sampler}$ , probabilistically decides which element $x_{i}$ to submit next. Importantly, this makes for a well-defined probability space, and allows us to analyze $\mathsf{Adversary}$ ’s behavior with probabilistic tools, specifically with concentration inequalities.

Chernoff bound cannot be used here, as it requires the choices made by the adversary along the process to be independent of each other, which is clearly not the case. In contrast, martingale inequalities are suitable for this setting. We shall thus employ these, specifically Lemma 3.3, to prove both parts of our main result in this section.

4.1 The Bernoulli Sampling Case

We start by proving the Bernoulli sampling case (first statement of Lemma 4.1). Recall that here each element is sampled, independently, with probability $p$ . At any given point $0\leq i\leq n$ along the process, let $X_{i}=(x_{1},\ldots,x_{i})$ denote the sequence of elements submitted by the adversary until round $i$ , and let $S_{i}\subseteq X_{i}$ denote the subsequence of sampled elements from $X_{i}$ . Note that $X_{n}=X$ and $S_{n}=S$ , and hence, to prove the lemma, we need to show that $|d_{R}(X_{n})-d_{R}(S_{n})|\leq\varepsilon$ .

As a first attempt, it might make sense to try applying a martingale concentration inequality on the sequence of random variables $(Y_{0},Y_{1},\ldots,Y_{n})$ , where we define $Y_{i}=d_{R}(X_{i})-d_{R}(S_{i})$ . Indeed, our end-goal is to bound the probability that $Y_{n}$ significantly deviates from zero. However, a straightforward calculation shows that this is not a martingale, since the condition that $E[Y_{i}|Y_{0},\ldots,Y_{i-1}]=0$ does not hold in general. To overcome this, we show that a slightly different formulation of the random variables at hand does yield a martingale. Given the above $R\subseteq U$ , for any $0\leq i\leq n$ we define the random variables

[TABLE]

where, as before, the intersection between a set $R$ and a sequence $X_{i}$ is the subsequence of $X_{i}$ consisting of all elements that also belong to $R$ .

Importantly, as is described in the next claim, the sequence of random variables $Z^{R}=(Z^{R}_{0},\ldots,Z^{R}_{n})$ defined above forms a martingale. The claim also demonstrates several useful properties of these random variables, to be used later in combination with Lemma 3.3.

Claim 4.2.

The sequence $(Z^{R}_{0},Z^{R}_{1},\ldots,Z^{R}_{n})$ is a martingale. Furthermore, the variance of $Z^{R}_{i}$ conditioned on $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ is bounded by $1/n^{2}p$ , and it always holds that $|Z^{R}_{i}-Z^{R}_{i-1}|\leq 1/np$ .

We shall prove Claim 4.2 later on; first we use it to complete the proof of the main result.

Proof of Lemma 4.1, Bernoulli

sampling case.

It suffices to prove the following two inequalities for any $p$ satisfying the conditions of the lemma for the Bernoulli sampling case:

[TABLE]

Indeed, taking a union bound over these two inequalities, applying the triangle inequality, and observing that $A^{R}_{n}=d_{R}(X_{n})$ , we conclude that $\Pr(|d_{R}(X_{n})-d_{R}(S_{n})|\geq\varepsilon)\leq\delta$ , as desired.

The first inequality follows from Claim 4.2 and Lemma 3.3. Indeed, in view of Claim 4.2, we can apply Lemma 3.3 on $(Z^{R}_{0},\ldots,Z^{R}_{n})$ with parameters $\lambda=\varepsilon/2$ , $\sigma_{i}^{2}=1/n^{2}p$ , and $M=1/np$ . As $Z^{R}_{0}=0$ , we have $|A^{R}_{n}-B^{R}_{n}|=|Z^{R}_{n}-Z^{R}_{0}|$ , and so

[TABLE]

The right hand side is bounded by $\delta/2$ when $np\geq\frac{9}{\varepsilon^{2}}\ln(\delta/4)$ , settling the first inequality of (2).

We next prove the second inequality of (2). Observe that $B^{R}_{n}=d_{R}(S_{n})\cdot\frac{|S_{n}|}{np}$ . Since each element is added to the sample with probability $p$ , independently of other elements, the size of $S_{n}$ is distributed according to the binomial distribution $\textsf{Bin}(n,p)$ , regardless of the adversary’s strategy. Applying Chernoff inequality with $\delta=\varepsilon/2$ , we get that

[TABLE]

This probability is bounded by $\delta/2$ provided that $np\geq\frac{10\ln(4/\delta)}{\varepsilon^{2}}$ . Conditioning on this event not occurring, we have that

[TABLE]

where the first inequality follows from the fact that densities (in this case, $d_{R}(S_{n})$ ) are always bounded from above by one, and the second inequality follows from our conditioning. This completes the proof of the second inequality in (2). ∎

The proof of Claim 4.2 is given next.

Proof of Claim 4.2.

We first show that $(Z^{R}_{0},Z^{R}_{1},\ldots,Z^{R}_{n})$ is a martingale. Fix $1\leq i\leq n$ , and suppose that the first $i-1$ rounds of $\mathsf{AdaptiveGame}_{\varepsilon}$ have just ended (so the values of $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ are already fixed), and that $\mathsf{Adversary}$ now picks an element $x_{i}$ to submit in round $i$ of the game.

If $x_{i}\notin R$ then $A^{R}_{i}=A^{R}_{i-1}$ and $B^{R}_{i}=B^{R}_{i-1}$ and so $Z^{R}_{i}=Z^{R}_{i-1}$ , which trivially means that ${\mathbb{E}}\!\left[{Z^{R}_{i}\ |\ Z^{R}_{0},\ldots,Z^{R}_{i-1}\ ;\ x_{i}\notin R}\right]=Z^{R}_{i-1}$ as desired.

When $x_{i}\in R$ , we have

[TABLE]

Recall that $\mathsf{Sampler}$ uses Bernoulli sampling with probability $p$ , that is, $x_{i}$ is sampled with probability $p$ (regardless of the outcome of the previous rounds). Therefore, we have that

[TABLE]

The analysis of both cases $x_{i}\notin R$ and $x_{i}\in R$ implies that $E[Z^{R}_{i}|Z^{R}_{0},\ldots,Z^{R}_{i-1}]=Z^{R}_{i-1}$ , as desired.

We now turn to prove the other two statements of Claim 4.2. The maximum of the expression $|Z^{R}_{i}-Z^{R}_{i-1}|$ is $\max\{\frac{1}{n},\frac{1}{np}-\frac{1}{n}\}\leq\frac{1}{np}$ , obtained when $x_{i}\in R$ . The variance of $Z^{R}_{i}$ given $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ is zero given the additional assumption that $x_{i}\notin R$ ; assuming that $x_{i}\in R$ , the variance satisfies

[TABLE]

Combining both cases, we conclude that $\mathsf{Var}(Z^{R}_{i}\ |\ Z^{R}_{0},\ldots,Z^{R}_{i-1})\leq\frac{1}{n^{2}p}$ , completing the proof. ∎

4.2 The Reservoir Sampling Case

We continue to the proof of the second statement of Lemma 4.1, which considers reservoir sampling. In high level, the proof goes along the same lines, except that we work with a different martingale. Specifically, for $k<i\leq n$ we define

[TABLE]

whereas for $i\leq k$ we simply define $A^{R}_{i}=B^{R}_{i}=|R\cap X_{i}|$ . (This is a natural extension of the definition for $i>k$ ; specifically, in view of the definition of $B^{R}_{i}$ , note that as long as no more than $k$ elements appear in the stream, the reservoir simply keeps all of the stream’s elements.)

The following claim is the analogue of Claim 4.2 for the setting of reservoir sampling.

Claim 4.3.

The sequence $(Z^{R}_{0},Z^{R}_{1},\ldots,Z^{R}_{n})$ is a martingale. Furthermore, the variance of $Z^{R}_{i}$ conditioned on $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ is bounded by $i/k$ , and it always holds that $|Z^{R}_{i}-Z^{R}_{i-1}|\leq i/k$ .

Proof.

We follow the same kind of analysis as in Claim 4.2. Fix $i>k$ (for $i\leq k$ the claim holds trivially), and suppose that the first $i-1$ rounds have ended, so $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ are already fixed. Denote the next element that the adversary submits by $x_{i}$ . First, it is easy to verify that

[TABLE]

The calculation of $B^{R}_{i}$ requires a more subtle case analysis. Given $B^{R}_{0},\ldots,B^{R}_{i-1}$ and $x_{i}$ , the value of $B^{R}_{i}$ is determined by three factors: (i) is $x_{i}\in R$ or not? (ii) is $x_{i}$ sampled or not? and (iii) conditioning on $x_{i}$ being sampled, does it replace an element from $R$ in the sample, or an element not in $R$ ? We separate the analysis into several cases; in cases where $x_{i}$ is sampled, we denote the element removed from the sample to make room for $x_{i}$ by $r_{i}$ .

Case 1: $x_{i}\notin R$ .

In the cases where $x_{i}$ is either not sampled, or sampled but with $r_{i}\notin R$ , elements from $R$ are neither added nor removed from the sample. That is, $R\cap S_{i}=R\cap S_{i-1}$ . Hence,

[TABLE]

where the first equality is by definition, and the third equality follows again by definition and since $|S_{i-1}|=k$ for $i>k$ .

It remains to consider the event where $x_{i}$ is sampled and $r_{i}\in R$ . The probability that $x_{i}$ is sampled equals $k/i$ , and conditioning on this occurring, the probability that $r_{i}$ belongs to $R$ is $d_{R}(S_{i-1})$ , so the above event holds with probability $(k/i)\cdot d_{R}(S_{i-1})$ . In this case, one element from $R$ is removed from the sample, that is, $|R\cap S_{i}|=|R\cap S_{i-1}|-1$ , and therefore

[TABLE]

Thus, conditioned on $x_{i}\notin R$ , the expectation of $B^{R}_{i}$ is

[TABLE]

Since $A^{R}_{i}=A^{R}_{i-1}$ when $x_{i}\notin R$ , we deduce that

[TABLE]

Case 2: $x_{i}\in R$ .

Similarly, whenever $S_{i}=S_{i-1}$ we have that $B^{R}_{i}=B^{R}_{i-1}+d_{R}(S_{i-1})$ . The only case where this does not hold is when $x_{i}$ is sampled and $r_{i}\notin R$ , which has probability $(k/i)\cdot(1-d_{R}(S_{i-1}))$ . In this case, $|R\cap S_{i}|=|R\cap S_{i-1}|+1$ , implying that

[TABLE]

Combining these two we get, conditioned on $x_{i}\in R$ , that the expectation of $B^{R}_{i}$ is

[TABLE]

Finally, since $A^{R}_{i}=A^{R}_{i-1}+1$ when $x_{i}\in R$ , we have that

[TABLE]

The analysis of these two cases implies that $(Z^{R}_{0},\ldots,Z^{R}_{n})$ is indeed a martingale.

It remains to obtain the bounds on the difference $|Z^{R}_{i}-Z^{R}_{i-1}|$ and the variance of $Z^{R}_{i}$ given $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ . This follows rather easily as a byproduct of the above analysis (and the fact that the density $d_{R}$ is always bounded between zero and one). When $x_{i}\notin R$ , we know from the analysis that $A^{R}_{i}=A^{R}_{i-1}$ and $B^{R}_{i-1}-i/k\leq B^{R}_{i}\leq B^{R}_{i-1}+1$ , whereas if $x_{i}\in R$ , we have $A^{R}_{i}=A^{R}_{i-1}+1$ and $B^{R}_{i-1}\leq B^{R}_{i}\leq B^{R}_{i-1}+1+i/k$ . In both cases, we conclude that $|Z^{R}_{i}-Z^{R}_{i-1}|\leq i/k$ .

We next bound the variance of $Z^{R}_{i}$ conditioned on the values of $Z^{R}_{0},\ldots,Z^{R}_{i-1}$ (the analysis also implicitly conditions on the value $d_{R}(S_{i-1})$ ; the bound we shall eventually derive holds regardless of this value). We start with the case that $x_{i}\notin R$ , and revisit Case 1 above: with probability $(k/i)\cdot d_{R}(S_{i-1})$ , the value of $Z^{R}_{i}$ is smaller than its expectation by $i/k-d_{R}(S_{i-1})$ ; and otherwise (with probability $1-(k/i)\cdot d_{R}(S_{i-1})$ ), the value of $Z^{R}_{i}$ is larger than its expectation by $d_{R}(S_{i-1})$ . Thus, we have that

[TABLE]

We next address the case where $x_{i}\in R$ , which correspond to Case 2 above. Here, with probability $(k/i)\cdot(1-d_{R}(S_{i-1}))$ , the value of $Z^{R}_{i}$ is larger than its conditional expectation by $i/k+d_{R}(S_{i-1})-1$ ; otherwise, $Z^{R}_{i}$ is smaller than the expectation by $1-d_{R}(S_{i-1})$ . Thus,

[TABLE]

As the conditional variance is always bounded by $i/k$ , the bound remains intact if we remove the conditioning on the value of $d_{R}(S_{i-1})$ and the predicate assessing whether $x_{i}\in R$ or not. In other words, $\mathsf{Var}(Z^{R}_{i}|Z^{R}_{0},\ldots,Z^{R}_{i-1})\leq i/k$ , completing the proof. ∎

The proof of the second part of Lemma 4.1 now follows from the last claim.

Proof of Lemma 4.1, reservoir

sampling case.

Observe that

[TABLE]

In view of Claim 4.3, we apply Lemma 3.3 on the martingale $Z^{R}=(Z^{R}_{0},\ldots,Z^{R}_{n})$ with $\lambda=\varepsilon n$ , $\sigma^{2}_{i}=i/k$ for any $i\geq k$ (for $i\leq k$ , we can set $\sigma^{2}_{i}=0$ ), and $M=n/k$ . We get that

[TABLE]

where the second inequality holds for $n\geq 2$ . Therefore, it suffices to require $k\geq\frac{2}{\varepsilon^{2}}\ln\left(\frac{2}{\delta}\right)$ to get the bound $\Pr(|d_{R}(X)-d_{R}(S)|\geq\varepsilon)\leq\delta$ . ∎

5 An Adaptive Attack on Sampling

In this section, we present our lower bounds. Specifically, we show that the sample size cannot depend solely on the VC-dimension, but rather that the dependency on the cardinality is necessary. This is done by describing a set system $(U,\mathcal{R})$ with large $|U|$ and VC-dimension of one, together with a strategy for the adversary that will make the sampled set unrepresentative with respect to $(U,\mathcal{R})$ . That is, the sampled set will not be an $\varepsilon$ -approximation of $(U,\mathcal{R})$ with high probability. This is in contrast to the static setting where the same sample size suffices to an $\varepsilon$ -approximation with high probability. Moreover, in the case of the $\mathsf{BernoulliSample}$ algorithm, the sampled set under attack is extremely unrepresentative, consisting precisely of the $k$ smallest elements in the stream (where $k$ is the total sample size at the end of the stream).

Proof of Theorem 1.3.

Set the universe to be the well-ordered set $U=\{1,2,\ldots,N\}$ for an arbitrary $n^{6\ln n}\leq N\leq 2^{n/2}$ and let $\mathcal{R}=\{[1,b]:b\in U\}$ . Clearly, $(U,\mathcal{R})$ has VC-dimension 1. $\mathsf{Adversary}$ ’s strategy (for both sampling algorithms $\mathsf{BernoulliSample}$ and $\mathsf{ReservoirSample}$ ) is described in Figure 3.

Let $S$ denote the subsequence of elements sampled by the algorithm $\mathsf{BernoulliSample}$ along the stream. The expected size of $S$ is $np\leq np^{\prime}$ , and it follows from the well-known Markov inequality (see e.g. [AS16], Appendix A) that $\Pr(|S|\geq 2np^{\prime})<1/2$ (in fact the probability is much smaller, by Chernoff inequality, but we will not need the stronger bound). From here on, we condition on the complementary event: we assume that $|S|<2np^{\prime}$ . The next claim asserts that for $S$ of this size, Adversary’s strategy does not fail, in the sense that it never runs out of elements (i.e., $a_{i}<b_{i}$ for all $i\in[n]$ ).

Claim 5.1.

If $|S|<2np^{\prime}$ then $b_{i}-a_{i}\geq n$ for any $i\in[n]$ .

Proof.

For any $i\in[n]$ , set $\ell_{i}=b_{i}-a_{i}$ . We prove by induction that $\ell_{i}\geq n$ . If $x_{i}$ is sampled, then we have that $\ell_{i+1}\geq p^{\prime}\ell_{i}$ and otherwise we have that $\ell_{i+1}=(1-p^{\prime})\ell_{i}-2\geq(1-2p^{\prime})\ell_{i}$ , where the inequality follows from the induction assumption. Since $|S|<2np^{\prime}$ , we get that

[TABLE]

where the third inequality holds since $3p\geq\ln(\frac{1}{1-2p})$ for small enough $p>0$ , and the last inequality follows since $p^{\prime}\leq\frac{\ln N}{6n\ln n}$ and $p^{\prime}\geq\ln n/n$ , which means that

[TABLE]

This proves the induction step, and completes the proof of the claim. ∎

The last claim means that if $|S|<2np^{\prime}$ , then the attack in Figure 3 successfully generates a stream of $n$ elements. We now show that the sampled set is not an $\varepsilon$ -approximation. We begin by analyzing the $\mathsf{BernoulliSample}$ algorithm.

Claim 5.2.

Consider $\mathsf{Adversary}$ ’s attack on $\mathsf{BernoulliSample}$ described in Figure 3. At round $i$ of the game,

•

All elements that were previously submitted by $\mathsf{Adversary}$ and sampled are no bigger than $a_{i}$ .

•

All elements that were previously submitted but not sampled are no smaller than $b_{i}$ .

•

The element submitted during round $i$ is between $a_{i}$ and $b_{i}$ .

Proof.

By induction, where the base case $i=1$ is trivial. Suppose that the claim holds for the first $i-1$ rounds; we now prove it for round $i$ . By definition of the attack, and from Claim 5.1 it holds that $a_{i-1}\leq a_{i}<b_{i}\leq b_{i-1}$ and so any of the elements $x_{j}$ for $j<i-1$ satisfies the desired condition, by the induction assumption. It remains to address the case where $j=i-1$ . If $x_{i-1}$ was sampled, then the attack sets $a_{i}=x_{i-1}$ , that is, $x_{i-1}$ is a sampled element and satisfies $x_{i-1}\leq a_{i}$ . Otherwise, the attack sets $b_{i}=x_{i-1}$ and so $x_{i-1}$ is a non-sampled element and satisfies $x_{i-1}\geq b_{i}$ . Finally, $a_{i}<x_{i}<b_{i}$ always holds. Thus, the three desired conditions are retained. ∎

As the last claim depicts, all sampled elements are smaller than all non-sampled ones at any point along the stream. This, of course, suffices for the sampled set to not be an $\varepsilon$ -approximation of $(U,\mathcal{R})$ . Denote the sampled set by $S$ , and let $s$ be the maximal element in $S$ (if $S$ is empty, we are done). Consider now the range $[1,s]\in\mathcal{R}$ : its density in the sampled set is $1$ , namely, $d_{[1,s]}(S)=1$ , while its density in the stream is $d_{[1,s]}(X)=|S|/n$ . To summarize,

[TABLE]

Altogether, the attack does not fail provided that $|S|<2np^{\prime}$ , which holds with probability at least $1/2$ . Thus, $\mathsf{BernoulliSample}$ with parameter $p$ as in the theorem’s statement is not $(\varepsilon,\delta)$ -robust.

The analysis of the $\mathsf{ReservoirSample}$ algorithm is very similar. Recall that $k$ denotes the sample size, and let $k^{\prime}$ be the total number of elements that were sampled during the reservoir sampling process. That is, $k^{\prime}$ counts sampled elements that were evicted at a future iteration. We bound $k^{\prime}$ as follows. ${\mathbb{E}}\!\left[{k^{\prime}}\right]=k+\sum_{i=1}^{n}k/n\leq 2k\ln n$ . Again, Markov inequality shows that with probability at least $1/2$ , we will have $k^{\prime}\leq 4k\ln n$ . Using the previous analysis, we know that all $k^{\prime}$ elements are the smallest elements in the stream. The sample set $S$ consists of some $k$ elements among these $k^{\prime}$ elements (in other words, the sample set is not necessarily the set of $k$ smallest element, but it is still a subset of the $k^{\prime}$ smallest elements). Thus, taking the interval $[1,s]$ where $s$ is the maximal element among the $k^{\prime}$ elements, we have that the density of $[1,s]$ in the sample is $d_{[1,s]}(S)=\frac{k}{k}=1$ . On the other hand, the density of $[1,s]$ is the stream is

[TABLE]

Together, we entail that

[TABLE]

meaning that $\mathsf{ReservoirSample}$ with $k$ as in the statement of the theorem is not $(\varepsilon,\delta)$ -robust. ∎

6 Continuous Robustness

In this section, we prove that the $\mathsf{ReservoirSample}$ algorithm is $(\varepsilon,\delta)$ -continuous robust against static and adaptive adversaries. Recall that a sampling algorithm is $(\varepsilon,\delta)$ -continuously robust if the following holds with probability at least $1-\delta$ : at any point throughout the stream, the current sample held by $\mathsf{Sampler}$ is an $\varepsilon$ -approximation of the current stream (i.e., of the set of all elements submitted by $\mathsf{Adversary}$ until now).

With this definition in hand, $\mathsf{BernoulliSample}$ cannot possibly be continuously robust in general (even in the static setting)444To see this, consider any set system $(U,\mathcal{R})$ where $\mathcal{R}$ contains a singleton $\{u\}$ for some $u\in U$ , which is the first element of the stream. With probability $1-p$ this element is not sampled and the density of $\{u\}$ in the sample at the current point is [math], while its density in the stream is $1$ . This violates the $\varepsilon$ -approximation requirement (unless $p\geq 1-\delta$ ).. We thus restrict our discussion to $\mathsf{ReservoirSample}$ from here on, and turn to the proof of Theorem 1.4. The proof examines $O(\varepsilon^{-1}\ln n)$ carefully picked points along the stream, applying Theorem 1.2 on each of the points. It then shows that if the sample is a good approximation of the stream at all of these points, then continuous robustness is guaranteed with high probability.

Proof of Theorem 1.4.

We provide the proof for the setting of an adaptive adversary. The proof for the static setting is essentially identical, with the only difference being that, instead of making black-box applications of Theorem 1.2, we apply the static analogue of it; Recall that the bound in the static analogue is of the form $\Theta\left(\frac{d+\ln 1/\delta}{\varepsilon^{2}}\right)$ , compared to the $\Theta\left(\frac{\ln|\mathcal{R}|+\ln 1/\delta}{\varepsilon^{2}}\right)$ bound appearing in the statement of Theorem 1.2.

Let $(U,\mathcal{R})$ , $n$ , $\varepsilon$ , $\delta$ be as in the statement of the theorem. As a warmup, let us analyze a simple yet non-optimal proof based on a naïve union bound. Denote the stream and sample after $i$ rounds by $X_{i}$ and $S_{i}$ , respectively. Consider for a moment the first $i$ rounds of the game as a “standalone” game where the stream length is $i$ . Applying the second part of Theorem 1.2 with parameters $(U,\mathcal{R}),\varepsilon,\delta^{\prime},i$ , where $\delta^{\prime}=\delta/n$ , we get that if the memory size $k$ of $\mathsf{ReservoirSample}$ satisfies

[TABLE]

then regardless of $\mathsf{Adversary}$ ’s strategy,

[TABLE]

Taking a union bound, the probability that $S_{i}$ is an $\varepsilon$ -approximation of $X_{i}$ for all $i\in[n]$ is at least $1-n\cdot(\delta/n)=1-\delta$ . Thus, it follows that $\mathsf{ReservoirSample}$ whose parameter $k$ satisfies the condition of (3) is $(\varepsilon,\delta)$ -continuously robust.

We now continue to the proof of the improved bound, appearing in the statement of the theorem. The proof is also based, at its core, on a union bound argument, albeit a more efficient one. The key idea is to take a sparse set of “checkpoints” $i_{1},\ldots,i_{t}$ along the stream, where $i_{j+1}=(1+\Theta(\varepsilon))i_{j}$ , apply Theorem 1.2 at any of the times $i_{1},\ldots,i_{t}$ to make sure the sample is an $(\varepsilon/2)$ -approximation of the stream in any of these times. Finally, we show that with high probability, for any $j\in[t-1]$ , the approximation is preserved (the approximation factor might become slightly worse, but no worse than $\varepsilon$ ) in the “gaps” between any couple of such neighboring points.

For this, we first need the following simple claims.

Claim 6.1.

Let $T,T^{\prime}$ be two sequences of length $k$ over $U$ , which differ in up to $v$ values. Then $|d_{R}(T)-d_{R}(T^{\prime})|\leq v/k$ for any $R\subseteq U$ . In particular, if $T$ is an $\alpha$ -approximation of some sequence $X\supseteq T,T^{\prime}$ , then $T^{\prime}$ is an $(\alpha+v/k)$ -approximation of $X$ .

Proof.

For any subset $R\subseteq U$ we have $-v\leq|R\cap T|-|R\cap T^{\prime}|\leq v$ . Dividing by $k=|T|=|T^{\prime}|$ , and recalling that $d_{R}(T)=|R\cap T|/|T|$ and $d_{R}(T^{\prime})=|R\cap T^{\prime}|/|T^{\prime}|$ , we conclude that $-v/k\leq d_{R}(T)-d_{R}(T^{\prime})\leq v/k$ , that is, $|d_{R}(T)-d_{R}(T^{\prime})|\leq v/k$ . To prove the second part, note that

[TABLE]

for any $R\subseteq U$ . ∎

Claim 6.2.

Suppose that $T\subseteq X\subseteq X^{\prime}$ are three sequences over $U$ , where $T$ is an $\alpha$ -approximation of $X$ , and $|X^{\prime}|\leq(1+\beta)|X|$ . Then $T$ is an $(\alpha+\beta)$ -approximation of $X^{\prime}$ .

Proof.

For any subset $R\subseteq U$ , we have that $|R\cap X|\leq|R\cap X^{\prime}|\leq|R\cap X|+\beta|X|$ . We also know that $|d_{R}(T)-d_{R}(X)|\leq\alpha$ , since $T$ is an $\alpha$ -approximation of $X$ . On the one hand, it follows that

[TABLE]

On the other hand,

[TABLE]

As these inequalities hold for any $R\subseteq U$ , the claim follows. ∎

As a consequence of the above two claims, we get the following useful claim. (Recall that for any $i\in[n]$ , the sample and stream after $i$ rounds are denoted by $S_{i}$ and $X_{i}$ , respectively.)

Claim 6.3.

Consider $\mathsf{ReservoirSample}$ with memory size $k$ , and suppose that exactly $v$ elements were sampled in rounds $l+1,l+2,\ldots,m$ of the game, where $k\leq l<m\leq(1+\beta)l$ . If $S_{l}$ is an $\alpha$ -approximation of $X_{l}$ , then $S_{m}$ is an $(\alpha+\beta+v/k)$ -approximation of $X_{m}$ .

Proof.

By Claim 6.2, $S_{l}$ is an $(\alpha+\beta)$ -approximation of $X_{m}$ . As $S_{m}$ differs from $S_{l}$ by at most $v$ elements, we conclude from Claim 6.1 that $S_{m}$ is an $(\alpha+\beta+v/k)$ -approximation of $X_{m}$ . ∎

The last claim equips us with an approach to ensure continuous robustness, which is more efficient compared to the simple union bound approach. Suppose that there exists a set of integers $k=i_{1}<i_{2}<\ldots<i_{t}=n$ satisfying the following for any $j\in[t-1]$ .

$S_{i_{j}}$ is an $\alpha$ -approximation of $X_{i_{j}}$ , where $\alpha=\varepsilon/4$ . 2. 2.

$i_{j+1}\leq(1+\beta)i_{j}$ , where $\beta=\varepsilon/4$ . 3. 3.

The number of elements sampled in rounds $i_{j}+1,i_{j}+2,\ldots,i_{j+1}$ is bounded by $v=\varepsilon k/2$ .

We claim that the above three conditions suffice to ensure that $S_{i}$ is an $\varepsilon$ -approximation of $X_{i}$ for any $i\in[n]$ . Indeed, for $i\leq k$ , $S_{i}=X_{i}$ is trivially an $\varepsilon$ -approximation. When $i>k$ , consider the maximum $j<t$ for which $i_{j}\leq i$ , and apply Claim 6.3 with $l=i_{j}$ , $m=i$ , and $\alpha,\beta,v$ as dictated above. Since $\alpha+\beta+v/k=\varepsilon$ , the claim implies that $S_{i}$ is an $\varepsilon$ -approximation of $X_{i}$ , as desired.

Specifically, given $k$ satisfying the assumption of Theorem 1.4, we pick $i_{1},i_{2},\ldots,i_{t}$ recursively as follows: we start with $i_{1}=k$ ; and given $i_{j}$ we set $i_{j+1}\leq n$ as the largest integer satisfying that $i_{j+1}\leq(1+\beta)i_{j}=(1+\varepsilon/4)i_{j}$ . It is not hard to verify that $i_{j}=k\cdot(1+\theta(\varepsilon))^{j-1}$ (this implicitly relies on the fact that $k\geq 4/\varepsilon$ , ensured by the assumption of the theorem). Note that $t=O(\ln_{1+\varepsilon}n)=O(\varepsilon^{-1}\ln n)$ . We next show that for this choice of $i_{1},\ldots,i_{t}$ , the above three conditions are satisfied simultaneously for all $j\in[t-1]$ with probability at least $1-\delta$ . This shall conclude the proof.

For the first condition, apply Theorem 1.2 for any $j\in[t-1]$ with parameters $(U,\mathcal{R}),\varepsilon/42,\delta^{\prime},i_{j}$ where $\delta^{\prime}=\delta/2t$ , concluding that if the memory size $k$ satisfies

[TABLE]

then for any $j\in[t-1]$ ,

[TABLE]

Taking a union bound, with probability at least $1-\delta/2$ the first condition holds for all $j\in[t-1]$ .

The second condition, regarding the boundedness of $i_{j+1}$ as a function of $i_{j}$ , holds trivially (and deterministically) for our choice of $i_{1}\leq i_{2}\leq\ldots\leq i_{t}$ .

Finally, it remains to address the third condition. For any $j\in[t-1]$ , let $A_{j}$ denote the total number of sampled elements in rounds $i_{j}+1,i_{j}+2,\ldots,i_{j+1}$ of the game. Note that each such $A_{j}$ is a random variable. We wish to show that

[TABLE]

Indeed, if (4) is true for any $j\in[t-1]$ , then the probability that the third condition holds for any $j$ is at least $1-\delta/2$ , which (in combination with our analysis of the other two conditions) completes the proof. Thus, it remains to prove (4).

Recall that the probability of an element to be sampled in round $i$ is exactly $k/i$ , and that $i_{j+1}\leq(1+\varepsilon/4)i_{j}$ . Hence, $A_{j}$ is a sum of up to $\lfloor\varepsilon i_{j}/4\rfloor$ independent random variables, each of which has probability less than $k/i_{j}$ to be sampled. In particular, the mean of $A_{j}$ is less than $(\varepsilon i_{j}/4)\cdot(k/i_{j})=\varepsilon k/4$ . From Chernoff bound (Theorem 3.1), we get the desired bound:

[TABLE]

where the last inequality holds for $k\geq c\cdot\varepsilon^{-1}(\ln 1/\delta+\ln 1/\varepsilon+\ln\ln n)$ , for a sufficiently large constant $c>0$ ; note that $k$ in the theorem’s statement indeed satisfies this inequality. ∎

Acknowledgments

We are grateful to Moni Naor for suggesting the study of streaming algorithms in the adversarial setting and for helpful and informative discussions about it. We additionally thank Noga Alon, Nati Linial, and Ohad Shamir for invaluable comments and suggestions for the paper.

Bibliography66

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AS 16] Noga Alon and Joel H. Spencer. The Probabilistic Method . Wiley Publishing, 4th edition, 2016.
2[BCEG 07] Amitabha Bagchi, Amitabh Chaudhary, David Eppstein, and Michael T. Goodrich. Deterministic sampling and range counting in geometric data streams. ACM Transactions on Algorithms , 3(2):16, 2007.
3[BCM + 13] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD , pages 387–402, 2013.
4[BOV 15] Vladimir Braverman, Rafail Ostrovsky, and Gregory Vorsanger. Weighted sampling without replacement from data streams. Information Processing Letters , 115(12):923–926, 2015.
5[BR 18] Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition , 84:317–331, 2018.
6[CBM 18] Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. PAC-learning in the presence of evasion adversaries. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems , NIPS, pages 228–239, 2018.
7[CDK + 11] Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM Journal on Computing , 40(5):1402–1431, 2011.
8[CEM + 96] Kenneth L. Clarkson, David Eppstein, Gary L. Miller, Carl Sturtivant, and Shang-Hua Teng. Approximating center points with iterative radon points. International Journal of Computational Geometry and Applications , 6(3):357–377, 1996.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The Adversarial Robustness of Sampling

Abstract

1 Introduction

The adversarial environment.

Bernoulli and reservoir sampling.

Attacking sampling algorithms.

The good news.

What is a representative sample?

Definition 1.1** (ε\varepsilonε-approximation).**

1.1 Our Results

Theorem 1.2**.**

Lower Bounds.

Theorem 1.3**.**

Continuous robustness.

Theorem 1.4**.**

Comparison to deterministic sampling algorithms.

1.2 Applications of Our Results

Quantile approximation.

Corollary 1.5**.**

Range queries.

Center points.

Heavy hitters.

Corollary 1.6**.**

Clustering.

Sampling in modern data-processing systems.

1.3 Related Work

Online learning.

PAC learning.

Adversarial examples in deep learning.

Maintaining random samples.

1.4 Paper Organization

2 The Adversarial Model for Sampling

Rules of the game:

Definition 2.1** (Robust sampling algorithm).**

Definition 2.2** (Continuously robust sampling algorithm).**

Reservoir sampling.

3 Technical Preliminaries

Theorem 3.1** (Chernoff Bound [Che52]; see Theorem 3.2 in [CL06]).**

Definition 3.2**.**

Lemma 3.3** (See [CL06], Theorem 6.1).**

4 Adaptive Robustness of Sampling: Main Technical Result

Lemma 4.1**.**

Proof of Theorem 1.2.

4.1 The Bernoulli Sampling Case

Claim 4.2**.**

Proof of Lemma 4.1, Bernoulli

Proof of Claim 4.2.

4.2 The Reservoir Sampling Case

Claim 4.3**.**

Proof.

Case 1: xi∉Rx_{i}\notin Rxi​∈/R.

Case 2: xi∈Rx_{i}\in Rxi​∈R.

Proof of Lemma 4.1, reservoir

5 An Adaptive Attack on Sampling

Proof of Theorem 1.3.

Claim 5.1**.**

Proof.

Claim 5.2**.**

Proof.

6 Continuous Robustness

Proof of Theorem 1.4.

Claim 6.1**.**

Proof.

Claim 6.2**.**

Proof.

Claim 6.3**.**

Proof.

Acknowledgments

Definition 1.1 ( $\varepsilon$ -approximation).

Theorem 1.2.

Theorem 1.3.

Theorem 1.4.

Corollary 1.5.

Corollary 1.6.

Definition 2.1 (Robust sampling algorithm).

Definition 2.2 (Continuously robust sampling algorithm).

Theorem 3.1 (Chernoff Bound [Che52]; see Theorem 3.2 in [CL06]).

Definition 3.2.

Lemma 3.3 (See [CL06], Theorem 6.1).

Lemma 4.1.

Claim 4.2.

Claim 4.3.

Case 1: $x_{i}\notin R$ .

Case 2: $x_{i}\in R$ .

Claim 5.1.

Claim 5.2.

Claim 6.1.

Claim 6.2.

Claim 6.3.