Optimal Random Sampling from Distributed Streams Revisited

Srikanta Tirthapura; David P. Woodruff

arXiv:1903.12065·cs.DC·March 29, 2019

Optimal Random Sampling from Distributed Streams Revisited

Srikanta Tirthapura, David P. Woodruff

PDF

Open Access

TL;DR

This paper presents an improved algorithm for distributed random sampling that reduces communication and computation costs, achieving optimal message complexity and also enhancing heavy hitter detection across multiple sites.

Contribution

It introduces a new algorithm for distributed sampling that improves efficiency and provides a matching lower bound, also advancing heavy hitter detection methods.

Findings

01

Reduced total messages sent compared to prior algorithms

02

Achieved asymptotic optimality in message complexity

03

Enhanced heavy hitter detection across distributed sites

Abstract

We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.

Equations61

O (\frac{k lo g ( n / s )}{lo g ( 1 + ( k / s ))})

O (\frac{k lo g ( n / s )}{lo g ( 1 + ( k / s ))})

E [ξ] \leq (\frac{lo g ( n / s )}{lo g r}) + 2

E [ξ] \leq (\frac{lo g ( n / s )}{lo g r}) + 2

Pr [ξ \geq z + ℓ] \leq Pr [u \leq (\frac{s}{n}) \frac{1}{r ^{ℓ}}]

Pr [ξ \geq z + ℓ] \leq Pr [u \leq (\frac{s}{n}) \frac{1}{r ^{ℓ}}]

Pr [ξ \geq z + ℓ] \leq Pr [Y \geq s] \leq Pr [Y \geq r^{ℓ} E [Y]] \leq \frac{1}{r ^{ℓ}}

Pr [ξ \geq z + ℓ] \leq Pr [Y \geq s] \leq Pr [Y \geq r^{ℓ} E [Y]] \leq \frac{1}{r ^{ℓ}}

E [ξ]

E [ξ]

μ_{i} = k + 2 X_{i}

μ_{i} = k + 2 X_{i}

μ = j = 0 \sum ξ - 1 μ_{i} = ξ k + 2 j = 0 \sum ξ - 1 X_{j}

μ = j = 0 \sum ξ - 1 μ_{i} = ξ k + 2 j = 0 \sum ξ - 1 X_{j}

Y_{j} = ⎩ ⎨ ⎧ 012 if w_{j} \geq m_{i} if m_{i} / r < w_{j} < m_{i} if m_{i} / r \leq w_{j}

Y_{j} = ⎩ ⎨ ⎧ 012 if w_{j} \geq m_{i} if m_{i} / r < w_{j} < m_{i} if m_{i} / r \leq w_{j}

X_{i} < Y

X_{i} < Y

Y (α) = j = 1 \sum τ (α) Y_{j} (α)

Y (α) = j = 1 \sum τ (α) Y_{j} (α)

E [n = 1 \sum θ Z_{n}] = E [θ] E [X]

E [n = 1 \sum θ Z_{n}] = E [θ] E [X]

E [Y (α)] = (r + 1) s

E [Y (α)] = (r + 1) s

E [Y (α)] = E j = 1 \sum τ (α) Y_{j} (α)

E [Y (α)] = E j = 1 \sum τ (α) Y_{j} (α)

E [Y (α)] = E [τ (α)] E [Y_{1} (α)]

E [Y (α)] = E [τ (α)] E [Y_{1} (α)]

E [Y_{1} (α)] = 0 (1 - α) + 1 (α - α / r) + 2 α / r = α (1 + 1/ r)

E [Y_{1} (α)] = 0 (1 - α) + 1 (α - α / r) + 2 α / r = α (1 + 1/ r)

E [X_{i}] \leq (r + 1) s

E [X_{i}] \leq (r + 1) s

E [μ] \leq (k + 2 (r + 1) r s) (\frac{lo g ( n / s )}{lo g r} + 2)

E [μ] \leq (k + 2 (r + 1) r s) (\frac{lo g ( n / s )}{lo g r} + 2)

E [μ_{i}] \leq k + 2 (r + 1) s = k + 2 s + 2 r s

E [μ_{i}] \leq k + 2 (r + 1) s = k + 2 s + 2 r s

μ = i = 1 \sum \infty μ_{i} I {ξ > (i - 1)}

μ = i = 1 \sum \infty μ_{i} I {ξ > (i - 1)}

E [μ]

E [μ]

\leq (k + 2 s + 2 r s) i = 1 \sum \infty Pr [ξ > (i - 1)]

= (k + 2 s + 2 r s) E [ξ]

\leq (k + 2 s + 2 r s) (\frac{lo g ( n / s )}{lo g r} + 2)

E [μ] \leq (k + 12 s) (\frac{lo g ( n / s )}{lo g 2}) \leq 20 s lo g (\frac{n}{s}) = O (s lo g (\frac{n}{s}))

E [μ] \leq (k + 12 s) (\frac{lo g ( n / s )}{lo g 2}) \leq 20 s lo g (\frac{n}{s}) = O (s lo g (\frac{n}{s}))

E [μ] = O (\frac{k lo g ( \frac{n}{s} )}{lo g ( \frac{k}{s} )}) .

E [μ] = O (\frac{k lo g ( \frac{n}{s} )}{lo g ( \frac{k}{s} )}) .

Pr [X < E [X] /2] \leq exp (E [X] /8) \leq exp (- (ln n / s) /8) \leq (\frac{s}{n})^{1/8} .

Pr [X < E [X] /2] \leq exp (E [X] /8) \leq exp (- (ln n / s) /8) \leq (\frac{s}{n})^{1/8} .

Pr [Y_{i} < E [Y_{i}] /2] \leq exp (- E [Y_{i}] /8) \leq exp (- (s ln β + O (1)) /8) \leq \frac{1}{β ^{s /8}} .

Pr [Y_{i} < E [Y_{i}] /2] \leq exp (- E [Y_{i}] /8) \leq exp (- (s ln β + O (1)) /8) \leq \frac{1}{β ^{s /8}} .

τ, {P_{i}}, {σ_{i}} Pr [E] \geq q .

τ, {P_{i}}, {σ_{i}} Pr [E] \geq q .

τ, {P_{i}}, {σ_{i}} Pr [F] \geq 1 - q /2.

τ, {P_{i}}, {σ_{i}} Pr [F] \geq 1 - q /2.

{σ_{i}} Pr [E ∣ τ = τ^{'}, (P_{0}, P_{1}, \dots, P_{e}) = (P_{0}^{'}, P_{1}^{'}, \dots, P_{e}^{'})] \geq q - q /2 = q /2.

{σ_{i}} Pr [E ∣ τ = τ^{'}, (P_{0}, P_{1}, \dots, P_{e}) = (P_{0}^{'}, P_{1}^{'}, \dots, P_{e}^{'})] \geq q - q /2 = q /2.

{σ_{i}} Pr [i^{*} is balanced ∣ τ = τ^{'}, (P_{0}, P_{1}, \dots, P_{e}) = (P_{0}^{'}, P_{1}^{'}, \dots, P_{e}^{'})] \geq q /2 - q /4 = q /4.

{σ_{i}} Pr [i^{*} is balanced ∣ τ = τ^{'}, (P_{0}, P_{1}, \dots, P_{e}) = (P_{0}^{'}, P_{1}^{'}, \dots, P_{e}^{'})] \geq q /2 - q /4 = q /4.

σ_{i^{*}} Pr [i^{*} is balanced ∣ τ = τ^{'}, (P_{0}, P_{1}, \dots, P_{e}) = (P_{0}^{'}, P_{1}^{'}, \dots, P_{e}^{'})] \geq q /4.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Stream Mining Techniques

Full text

Optimal Random Sampling from Distributed Streams Revisited 111This writeup is a revised version of a paper with the same title and authors, which appeared in the Proceedings of the International Conference on Distributed Computing (DISC) 2011. It corrects an error in the proof of the upper bound on message complexity (Section 4). The proofs in pages 9, 10, and 11 (excluding Theorem 2) have been rewritten relative to the DISC 2011 version. None of the main theorem statements (Theorems 2,3,4) have changed from the DISC 2011 version.

222We thank Rajesh Jayaram for pointing out the error in the conference version.

Srikanta Tirthapura, Iowa State University, [email protected]

David P. Woodruff, Carnegie Mellon University, [email protected]

Abstract

We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.

1 Introduction

For many data analysis tasks, it is impractical to collect all the data at a single site and process it in a centralized manner. For example, data arrives at multiple network routers at extremely high rates, and queries are often posed on the union of data observed at all the routers. Since the data set is changing, the query results could also be changing continuously with time. This has motivated the continuous, distributed, streaming model [8]. In this model there are $k$ physically distributed sites receiving high-volume local streams of data. These sites talk to a central coordinator, who has to continuously respond to queries over the union of all streams observed so far. The challenge is to minimize the communication between the different sites and the coordinator, while providing an accurate answer to queries at the coordinator at all times.

A fundamental problem in this setting is to obtain a random sample drawn from the union of all distributed streams. This generalizes the classic reservoir sampling problem (see, e.g., [15], where the algorithm is attributed to Waterman; see also [19]) to the setting of multiple distributed streams, and has applications to approximate query answering, selectivity estimation, and query planning. For example, in the case of network routers, maintaining a random sample from the union of the streams is valuable for network monitoring tasks involving the detection of global properties [13]. Other problems on distributed stream processing, including the estimation of the number of distinct elements [7, 8] and heavy hitters [4, 14, 17, 21], use random sampling as a primitive.

The study of sampling in distributed streams was initiated by Cormode et al [9]. Consider a set of $k$ different streams observed by the $k$ sites with the total number of current items in the union of all streams equal to $n$ . The authors in [9] show how $k$ sites can maintain a random sample of $s$ items without replacement from the union of their streams using an expected $O((k+s)\log n)$ messages between the sites and the central coordinator. The memory requirement of the central coordinator is $s$ machine words, and the time requirement is $O((k+s)\log n)$ . The memory requirement of the remote sites is a single machine word with constant time per stream update. Cormode et al. also prove that the expected number of messages sent in any scheme is $\Omega(k+s\log(n/s))$ . Each message is assumed to be a single machine word, which can hold an integer of magnitude $(kns)^{O(1)}$ .

Notation. All logarithms are to the base 2 unless otherwise specified. Throughout the paper, when we use asymptotic notation, the variable that is going to infinity is $n$ , and $s$ and $k$ are functions of $n$ .

1.1 Our Results

Our main contribution is an algorithm for sampling without replacement from distributed streams, as well as a matching lower bound showing that the message complexity of our algorithm is optimal. A summary of our results and a comparison with earlier work is shown in Figure 1.

New Algorithm: We present an algorithm which uses an expected

[TABLE]

number of messages for continuously maintaining a random sample of size $s$ from $k$ distributed data streams of total size $n$ . Notice that if $s<k/8$ , this number is $O\left(\frac{k\log(n/s)}{\log(k/s)}\right)$ , while if $s\geq k/8$ , this number is $O(s\log(n/s))$ .

The memory requirement in our protocol at the central coordinator is $s$ machine words, and the time requirement is $O\left(\frac{k\log n/s}{\log(1+k/s)}\right)$ . The former is the same as that in the protocol of [9], while the latter improves their $O((k+s)\log n)$ time requirement. The remote sites in our scheme store a single machine word and use constant time per stream update, which is clearly optimal.

Our result leads to a significant improvement in the message complexity in the case when $k$ is large. For example, for the basic problem of maintaining a single random sample from the union of distributed streams ( $s=1$ ), our algorithm leads to a factor of $O(\log k)$ decrease in the number of messages sent in the system over the algorithm in [9].

Our algorithm is simple, and only requires the central coordinator to communicate with a site if the site initiates the communication. This is useful in a setting where a site may go offline, since it does not require the ability of a site to receive broadcast messages.

Lower Bound: We also show that for any constant $q>0$ , any correct protocol must send $\Omega\left(\frac{k\log(n/s)}{\log(1+(k/s))}\right)$ messages with probability at least $1-q$ . This also yields a bound of $\Omega\left(\frac{k\log(n/s)}{\log(1+(k/s))}\right)$ on the expected message complexity of any correct protocol, showing the expected number of messages sent by our algorithm is optimal, upto constant factors.

In addition to being quantitatively stronger than the lower bound of [9], our lower bound is also qualitatively stronger, because the lower bound in [9] is on the expected number of messages transmitted in a correct protocol. However, this does not rule out the possibility that with large probability, much fewer messages are sent in the optimal protocol. In contrast, we lower bound the number of messages that must be transmitted in any protocol $99\%$ of the time. Since the time complexity of the central coordinator is at least the number of messages received, the time complexity of our protocol is also optimal.

Sampling With Replacement. We also show how to modify our protocol to obtain a random sample of $s$ items from $k$ distributed streams with replacement. Here we achieve a protocol with $O\left(\left(\frac{k}{\log(2+(k/(s\log s)))}+s\log s\right)\log n\right)$ messages, improving the $O((k+s\log s)\log n)$ -message protocol of [9]. We obtain the same improvement in the time complexity of the central coordinator.

Heavy-Hitters. As a corollary, we obtain a protocol for estimating the heavy hitters in distributed streams with the best known message complexity. In this problem we would like to find a set $H$ of items so that if an element $e$ occurs at least an $\varepsilon$ fraction of times in the union of the streams, then $e\in H$ , and if $e$ occurs less than an $\varepsilon/2$ fraction of times in union of the streams, then $e\notin H$ . It is known that $O(\varepsilon^{-2}\log n)$ random samples suffice to estimate the set of heavy hitters with high probability, and the previous best algorithm [9] was obtained by plugging $s=O(\varepsilon^{-2}\log n)$ into a protocol for distributed sampling. We thus improve the message complexity from $O((k+\varepsilon^{-2}\log n)\log n)$ to $O\left(\frac{k\log(\varepsilon n)}{\log(\varepsilon k)}+\varepsilon^{-2}\log(\varepsilon n)\log n\right)$ . This can be significant when $k$ is large compared to $1/\varepsilon$ .

1.2 Related Work

In addition to work discussed above, other research in the continuous distributed streaming model includes estimating frequency moments and counting the number of distinct elements [7, 8], and estimating the entropy [2]. The reservoir sampling technique has been used extensively in large scale data mining applications, see for example [10, 16, 1]. Stream sampling under sliding windows has been considered in [6, 3]. Deterministic algorithms for heavy-hitters over distributed streams, and corresponding lower bounds were considered in [21].

Stream sampling under sliding windows over distributed streams has been considered in [9]. Their algorithm for sliding windows is already optimal upto lower-order additive terms (see Theorems 4.1 and 4.2 in [9]). Hence our improved results for the non-sliding window case do not translate into an improvement for the case of sliding windows.

A related model of distributed streams was considered in [11, 12]. In this model, the coordinator was not required to continuously maintain an estimate of the required aggregate, but when the query was posed to the coordinator, the sites would be contacted and the query result would be constructed. In their model, the coordinator could be said to be “reactive”, whereas in the model considered in this paper, the coordinator is “pro-active”.

Roadmap: We first present the model and problem definition in Section 2, and then the algorithm followed by a proof of correctness in Section 3. The analysis of message complexity and the lower bound are presented in Sections 4 and 5 respectively, followed by an algorithm for sampling with replacement in Section 6.

2 Model

Consider a system with $k$ different sites, numbered from $1$ till $k$ , each receiving a local stream of elements. Let ${\cal S}_{i}$ denote the stream observed at site $i$ . There is one “coordinator” node, which is different from any of the sites. The coordinator does not observe a local stream, but all queries for a random sample arrive at the coordinator. Let ${\cal S}=\cup_{i=1}^{n}{\cal S}_{i}$ be the entire stream observed by the system, and let $n=|{\cal S}|$ . The sample size $s$ is a parameter supplied to the coordinator and to the sites during initialization.

The task of the coordinator is to continuously maintain a random sample ${\cal P}$ of size $\min\{n,s\}$ consisting of elements chosen uniformly at random without replacement from ${\cal S}$ . The cost of the protocol is the number of messages transmitted.

We assume a synchronous communication model, where the system progresses in “rounds”. In each round, each site can observe one element (or none), and send a message to the coordinator, and receive a response from the coordinator. The coordinator may receive up to $k$ messages in a round, and respond to each of them in the same round. This model is essentially identical to the model assumed in previous work [9]. Later we discuss how to handle the case of a site observing multiple elements per round.

The sizes of the different local streams at the sites, their order of arrival, and the interleaving of the streams at different sites, can all be arbitrary. The algorithm cannot make any assumption about these.

3 Algorithm

The idea in the algorithm is as follows. Each site associates a random “weight” with each element that it receives. The coordinator then maintains the set ${\cal P}$ of $s$ elements with the minimum weights in the union of the streams at all times, and this is a random sample of ${\cal S}$ . This idea is similar to the spirit in all centralized reservoir sampling algorithms. In a distributed setting, the interesting aspect is at what times do the sites communicate with the coordinator, and vice versa.

In our algorithm, the coordinator maintains $u$ , which is the $s$ -th smallest weight so far in the system, as well as the sample ${\cal P}$ , consisting of all the elements that have weight no more than $u$ . Each site need only maintain a single value $u_{i}$ , which is the site’s view of the $s$ -th smallest weight in the system so far. Note that it is too expensive to keep the view of each site synchronized with the coordinator’s view at all times – to see this, note that the value of the $s$ -th smallest weight changes $O(s\log(n/s))$ times, and updating every site each time the $s$ -th minimum changes takes a total of $O(sk\log(n/s))$ messages.

In our algorithm, when site $i$ sees an element with a weight smaller than $u_{i}$ , it sends it to the central coordinator. The coordinator updates $u$ and ${\cal P}$ , if needed, and then replies back to $i$ with the current value of $u$ , which is the true minimum weight in the union of all streams. Thus each time a site communicates with the coordinator, it either makes a change to the random sample, or, at least, gets to refresh its view of $u$ .

The algorithm at each site is described in Algorithms 1 and 2. The algorithm at the coordinator is described in Algorithm 3.

3.1 Correctness

The following two lemmas establish the correctness of the algorithm.

Lemma 1

Let $n$ be the number of elements in ${\cal S}$ so far. (1) If $n\leq s$ , then the set ${\cal P}$ at the coordinator contains all the $(e,w)$ pairs seen at all the sites so far. (2) If $n>s$ , then ${\cal P}$ at the coordinator consists of the $s$ $(e,w)$ pairs such that the weights of the pairs in ${\cal P}$ are the smallest weights in the stream so far.

**Proof: **

The variable $u$ is stored at the coordinator, and $u_{i}$ is stored at site $i$ . First we note that the variables $u$ and $u_{i}$ are non-increasing with time; this can be verified from the algorithms.

Next, we note that for every $i$ from $1$ till $k$ , at every round, $u_{i}\geq u$ . This can be seen because initially, $u_{i}=u=1$ , and $u_{i}$ changes only in response to receiving $u$ from the coordinator.

Thus, if fewer than $s$ elements have appeared in the stream so far, $u$ is $1$ , and hence $u_{i}$ is also $1$ for each site $i$ . The next element observed in the system is also sent to the coordinator. Thus, if $n\leq s$ , then the set ${\cal P}$ consists of all elements seen so far in the system.

Next, we consider $n>s$ . Note that $u$ maintains the $s$ -th smallest weight seen at the coordinator, and ${\cal P}$ consists of the $s$ elements seen at the coordinator with the smallest weights. We only have to show that if an element $e$ , observed at site $i$ is such that $w(e)<u$ then $i$ must have sent $(e,w(e))$ to the coordinator. This follows because $u_{i}\geq u$ at all times, and if $w(e)<u$ , then it must be true that $w(e)<u_{i}$ , and in this case, $(e,w(e))$ is sent to the coordinator.

Lemma 2

At the end of each round, sample ${\cal P}$ at the coordinator consists of a uniform random sample of size $\min\{n,s\}$ chosen without replacement from ${\cal S}$ .

**Proof: **

In case $n\leq s$ , then from Lemma 1, we know that ${\cal P}$ contains every element of ${\cal S}$ . In case $n>s$ , from Lemma 1, it follows that ${\cal P}$ consists of $s$ elements with the smallest weights from ${\cal S}$ . Since the weights are assigned randomly, each element in ${\cal S}$ has a probability of $\frac{s}{n}$ of belonging in ${\cal P}$ , showing that this is an uniform random sample. Since an element can appear no more than once in the sample, this is a sample chosen without replacement.

4 Analysis of the Algorithm (Upper Bound)

We now analyze the message complexity of the maintenance of a random sample.

For the sake of analysis, we divide the execution of the algorithm into “epochs”, where each epoch consists of a sequence of rounds. The epochs are defined inductively. Let $r>1$ be a parameter, which will be fixed later. Recall that $u$ is the $s$ -th smallest weight so far in the system (if there are fewer than $s$ elements so far, $u=1$ ). Epoch [math] is the set of all rounds from the beginning of execution until (and including) the earliest round where $u$ is $\frac{1}{r}$ or smaller. Let $m_{i}$ denote the value of $u$ at the end of epoch $i-1$ . Then epoch $i$ consists of all rounds subsequent to epoch $i-1$ until (and including) the earliest round when $u$ is $\frac{m_{i}}{r}$ or smaller. Note that the algorithm does not need to be aware of the epochs, and this is only used for the analysis.

Suppose we call the original distributed algorithm described in Algorithms 3 and 2 as Algorithm $A$ . For the analysis, we consider a slightly different distributed algorithm, Algorithm $B$ , described below. Algorithm $B$ is identical to Algorithm $A$ except for the fact that at the beginning of each epoch, the value $u$ is broadcast by the coordinator to all sites.

While Algorithm $A$ is natural, Algorithm $B$ is easier to analyze. We first note that on the same inputs, the value of $u$ (and ${\cal P}$ ) at the coordinator at any round in Algorithm $B$ is identical to the value of $u$ (and ${\cal P}$ ) at the coordinator in Algorithm $A$ at the same round. Hence, the partitioning of rounds into epochs is the same for both algorithms, for a given input. The correctness of Algorithm $B$ follows from the correctness of Algorithm $A$ . The only difference between them is in the total number of messages sent. In $B$ we have the property that for all $i$ from $1$ to $k$ , $u_{i}=u$ at the beginning of each epoch (though this is not necessarily true throughout the epoch), and for this, $B$ has to pay a cost of at least $k$ messages in each epoch.

Lemma 3

The number of messages sent by Algorithm $A$ for a set of input streams ${\cal S}_{j},j=1\ldots k$ is never more than twice the number of messages sent by Algorithm $B$ for the same input.

**Proof: **

Consider site $v$ in a particular epoch $i$ . In Algorithm $B$ , $v$ receives $m_{i}$ at the beginning of the epoch through a message from the coordinator. In Algorithm $A$ , $v$ may not know $m_{i}$ at the beginning of epoch $i$ . We consider two cases.

Case I: $v$ sends a message to the coordinator in epoch $i$ in Algorithm $A$ . In this case, the first time $v$ sends a message to the coordinator in this epoch, $v$ will receive the current value of $u$ , which is smaller than or equal to $m_{i}$ . This communication costs two messages, one in each direction. Henceforth, in this epoch, the number of messages sent in Algorithm $A$ is no more than those sent in $B$ . In this epoch, the number of messages transmitted to/from $v$ in $A$ is at most twice the number of messages as in $B$ , which has at least one transmission from the coordinator to site $v$ .

Case II: $v$ did not send a message to the coordinator in this epoch, in Algorithm $A$ . In this case, the number of messages sent in this epoch to/from site $v$ in Algorithm $A$ is smaller than in Algorithm $B$ .

Let $\xi$ denote the total number of epochs.

Lemma 4

If $r\geq 2$ ,

[TABLE]

**Proof: **

Let $z=\left(\frac{\log(n/s)}{\log r}\right)$ . First, we note that in each epoch, $u$ decreases by a factor of at least $r$ . Thus after $(z+\ell)$ epochs, $u$ is no more than $\frac{1}{r^{z+\ell}}=(\frac{s}{n})\frac{1}{r^{\ell}}$ . Thus, we have

[TABLE]

Let $Y$ denote the number of elements (out of $n$ ) that have been assigned a weight of $\frac{s}{nr^{\ell}}$ or lesser. $Y$ is a binomial random variable with expectation $\frac{s}{r^{\ell}}$ . Note that if $u\leq\frac{s}{nr^{\ell}}$ , it must be true that $Y\geq s$ .

[TABLE]

where we have used Markov’s inequality.

Since $\xi$ takes only positive integral values,

[TABLE]

where we have assumed $r\geq 2$ .

Let $\mu$ denote the total number of messages sent during the entire execution. Let $\mu_{i}$ denote the total number of messages sent in epoch $i$ . Let $X_{i}$ denote the number of messages sent from the sites to the coordinator in epoch $i$ . $\mu_{i}$ is the sum of two parts, (1) $k$ messages sent by the coordinator at the start of the epoch, and (2) two times the number of messages sent from the sites to the coordinator.

[TABLE]

For epoch $i$ , consider the stochastic process $\mathcal{Y}=\{Y_{j}:j\geq 1\}$ . For each $j$ , choose a random number $w_{j}$ uniformly from $(0,1)$ .

[TABLE]

Let $\tau$ denote the smallest time $t$ such that there are at least $s$ elements of $\{Y_{1},Y_{2},\ldots,Y_{t}\}$ that are equal to $2$ . Let $Y=\sum_{j=1}^{\tau}Y_{j}$ .

Lemma 5

[TABLE]

**Proof: **

Consider the correspondence between the $j$ th element received in epoch $i$ and $Y_{j}$ . Each time a message is sent upon receiving the $j$ th element, it must be true that $Y_{j}\geq 1$ , since the random weight chosen $w_{j}$ must be less than $m_{i}$ for a message to be sent (note that the threshold could be stricter than $m_{i}$ ). Further, the number of elements in this epoch is less than or equal to $\tau$ , since by the time $s$ elements are seen, each with a weight less than $m_{i}/r$ , the epoch would have ended (it may have ended earlier).

Consider the conditional random variables $Y_{j}(\alpha)=(Y_{j}|m_{i}=\alpha)$ , $Y(\alpha)=(Y|m_{i}=\alpha)$ , and $\tau(\alpha)=(\tau|m_{i}=\alpha)$ .

[TABLE]

Definition 1

Let $\mathcal{Z}=\{Z_{n}:n\geq 1\}$ be a stochastic process. A stopping time $\theta$ with respect to $\mathcal{Z}$ is a random time such that for each $n\geq 0$ , the event $\{\theta=n\}$ is completely determined by the total information known up to time $n$ , i.e. $\{Z_{1},Z_{2},\ldots,Z_{n}\}$ .

Theorem 1 (Wald’s Equation)

If $\theta$ is a stopping time with respect to an i.i.d. sequence $\{Z_{n}:n\geq 1\}$ and if $E[\theta]<\infty$ and $E[|X|]<\infty$ , then

[TABLE]

Lemma 6

[TABLE]

**Proof: **

[TABLE]

Note that the different $Y_{j}(\alpha)$ are independent and identically distributed since each $w_{j}$ is chosen independently from the same distribution. Further, $\tau(\alpha)$ is a stopping time for $\mathcal{Y}$ , since for $n\geq 1$ , the event $\tau(\alpha)=n$ can be determined by looking at the information till time $j$ , i.e. $\{Y_{1},Y_{2},\ldots,Y_{n}\}$ and checking the number of $Y_{j}$ s for $j<n$ , that were equal to $2$ .

Further, we note that $E[\tau(\alpha)]$ is finite, and $E[Y_{j}]$ is also finite. Using Wald’s equation (Theorem 1), we get

[TABLE]

Note that $\tau(\alpha)$ is the number of trials until $s$ successes, where the probability of a success is $\alpha/r$ . Hence, $\tau(\alpha)$ is the sum of $s$ geometric random variables each with a parameter of $\alpha/r$ , and $E[\tau(\alpha)]=sr/\alpha$ .

From the definition of $Y_{j}$ and conditioning on $m_{i}=\alpha$ , we have $Y_{j}=0$ with probability $1-\alpha$ , $1$ with probability $\alpha-\alpha/r$ and $2$ with probability $\alpha/r$ . Hence:

[TABLE]

Combining the above, the proof is complete.

Lemma 7

[TABLE]

**Proof: **

We have $E[Y]=E[E[Y|(m_{i}=\alpha)]]=E[(r+1)s]=(r+1)s$ , where we used Lemma 6. Using Lemma 5, the proof is complete.

Lemma 8

[TABLE]

**Proof: **

Using Lemma 7 and Equation 1, we get the expected number of messages in epoch $i$ :

[TABLE]

Let $\mathcal{I}\{\xi>i\}$ denote the indicator random variable that is $1$ when $\xi>i$ and [math] otherwise. The total number of messages can be written as follows.

[TABLE]

Since $\mu_{i}$ is independent of the event $\xi>(i-1)$ , we have:

[TABLE]

where we have used Lemma 4 for an upper bound on the expected number of epochs.

Theorem 2

The expected message complexity $E[\mu]$ of our algorithm is as follows.

I:

If $s\geq\frac{k}{8}$ , then $E[\mu]=O\left(s\log\left(\frac{n}{s}\right)\right)$ 2. II:

If $s<\frac{k}{8}$ , then $E[\mu]=O\left(\frac{k\log\left(\frac{n}{s}\right)}{\log\left(\frac{k}{s}\right)}\right)$

**Proof: **

We note that the upper bounds on $E[\mu]$ in Lemma 8 hold for any value of $r\geq 2$ .

Case I: $s\geq\frac{k}{8}$ . In this case, we set $r=2$ . From Lemma 8,

[TABLE]

Case II: $s<\frac{k}{8}$ . We set $r=\frac{k}{s}$ , and get:

[TABLE]

5 Lower Bound

Theorem 3

For any constant $q,0<q<1$ , any correct protocol must send $\Omega\left(\frac{k\log(n/s)}{\log(1+(k/s))}\right)$ messages with probability at least $1-q$ , where the probability is taken over the protocol’s internal randomness.

**Proof: **

Let $\beta=(1+(k/s))$ . Define $e=\Theta\left(\frac{\log(n/s)}{\log(1+(k/s))}\right)$ epochs as follows: in the $i$ -th epoch, $i\in\{0,1,2,\ldots,e-1\}$ , there are $\beta^{i-1}k$ global stream updates, which can be distributed among the $k$ servers in an arbitrary way.

We consider a distribution on orderings of the stream updates. Namely, we think of a totally-ordered stream $1,2,3,\ldots,n$ of $n$ updates, and in the $i$ -th epoch, we randomly assign the $\beta^{i-1}k$ updates among the $k$ servers, independently for each epoch. Let the randomness used for the assignment in the $i$ -th epoch be denoted $\sigma_{i}$ .

Consider the global stream of updates $1,2,3,\ldots,n$ . Suppose we maintain a sample set ${\cal P}$ of $s$ items without replacement. We let ${\cal P}_{i}$ denote a random variable indicating the value of ${\cal P}$ after seeing $i$ updates in the stream. We will use the following lemma about reservoir sampling.

Lemma 9

For any constant $q>0$ , there is a constant $C^{\prime}=C^{\prime}(q)>0$ for which

–

${\cal P}$ * changes at least $C^{\prime}s\log(n/s)$ times with probability at least $1-q$ , and*

–

If $s<k/8$ and $k=\omega(1)$ and $e=\omega(1)$ , then with probability at least $1-q/2$ , over the choice of $\{{\cal P}_{i}\}$ , there are at least $(1-(q/8))e$ epochs for which the number of times ${\cal P}$ changes in the epoch is at least $C^{\prime}s\log(1+(k/s))$ .

**Proof: **

Consider the stream $1,2,3,\ldots,n$ of updates. In the classical reservoir sampling algorithm [15], ${\cal P}$ is initialized to $\{1,2,3,\ldots,s\}$ . Then, for each $i>s$ , the $i$ -th element is included in the current sample set ${\cal P}_{i}$ with probability $s/i$ , in which case a random item in ${\cal P}_{i-1}$ is replaced with $i$ .

For the first part of Lemma 9, let $X_{i}$ be an indicator random variable if $i$ causes ${\cal P}$ to change. Let $X=\sum_{i=1}^{n}X_{i}$ . Hence, ${\bf E}[X_{i}]=s/i$ for all $i$ , and ${\bf E}[X]=H_{n}-H_{s}$ , where $H_{i}=\ln i+O(1)$ is the $i$ -th Harmonic number. Then all of the $X_{i}$ , $i>s$ are independent indicator random variables. It follows by a Chernoff bound that

[TABLE]

For any $s=o(n)$ , this is less than any constant $q$ , and so the first part of Lemma 9 follows since ${\bf E}[X]/2=1/2\cdot\ln(n/s)$ .

For the second part of Lemma 9, consider the $i$ -th epoch, $i>0$ , which contains $\beta^{i-1}k$ consecutive updates. Let $Y_{i}$ be the number of changes in this epoch. Then ${\bf E}[Y_{i}]=s\ln\beta+O(1)$ . Since $Y_{i}$ can be written as a sum of independent indicator random variables, by a Chernoff bound,

[TABLE]

Hence, the expected number of epochs $i$ for which $Y_{i}<{\bf E}[Y_{i}]/2$ is at most $\sum_{i=1}^{e-1}\frac{1}{\beta^{s/8}}$ , which is $o(e)$ since we’re promised that $s<k/8$ and $k=\omega(1)$ and $e=\omega(1)$ . By a Markov bound, with probability at least $1-q/2$ , at most $o(e/q)=o(e)$ epochs $i$ satisfy $Y_{i}\geq{\bf E}[Y_{i}]/2$ . It follows that with probability at least $1-q/2$ , there are at least $(1-q/8)e$ epochs $i$ for which the number $Y_{i}$ of changes in the epoch $i$ is at least ${\bf E}[Y_{i}]/2\geq\frac{1}{2}s\ln\beta$ , as desired.

Corner Cases: When $s\geq k/8$ , the statement of Theorem 3 gives a lower bound of $\Omega(s\log(n/s))$ . In this case Theorem 3 follows immediately from the first part of Lemma 9 since these changes in ${\cal P}$ must be communicated to the central coordinator. Hence, in what follows we can assume $s<k/8$ . Notice also that if $k=O(1)$ , then $\frac{k\log(n/s)}{\log(1+(k/s))}=O(s\log(n/s))$ , and so the theorem is independent of $k$ , and follows simply by the first part of Lemma 9. Notice also that if $e=O(1)$ , then the statement of Theorem 3 amounts to proving an $\Omega(k)$ lower bound, which follows trivially since every site must send at least one message.

Thus, in what follows, we may apply the second part of Lemma 9.

Main Case: Let $C>0$ be a sufficiently small constant, depending on $q$ , to be determined below. Let $\Pi$ be a possibly randomized protocol, which with probability at least $q$ , sends at most $Cke$ messages. We show that $\Pi$ cannot be a correct protocol.

Let $\tau$ denote the random coin tosses of $\Pi$ , i.e., the concatenation of random strings of all $k$ sites together with that of the central coordinator.

Let $\mathcal{E}$ be the event that $\Pi$ sends less than $Cke$ messages. By assumption, $\Pr_{\tau}[\mathcal{E}]\geq q.$ Hence, it is also the case that

[TABLE]

For a sufficiently small constant $C^{\prime}>0$ that may depend on $q$ , let $\mathcal{F}$ be the event that there are at least $(1-(q/8))e$ epochs for which the number of times ${\cal P}$ changes in the epoch is at least $C^{\prime}s\log(1+(k/s))$ . By the second part of Lemma 9,

[TABLE]

It follows that there is a fixing of $\tau=\tau^{\prime}$ as well as a fixing of ${\cal P}_{0},{\cal P}_{1},\ldots,{\cal P}_{e}$ to $P_{0}^{\prime},P_{1}^{\prime},\ldots,P_{e}^{\prime}$ for which $\mathcal{F}$ occurs and

[TABLE]

Notice that the three (sets of) random variables $\tau,\{P_{i}\},$ and $\{\sigma_{i}\}$ are independent, and so in particular, $\{\sigma_{i}\}$ is still uniformly random given this conditioning.

By a Markov argument, if event $\mathcal{E}$ occurs, then there are at least $(1-(q/8))e$ epochs for which at most $(8/q)\cdot C\cdot k$ messages are sent. If events $\mathcal{E}$ and $\mathcal{F}$ both occur, then by a union bound, there are at least $(1-(q/4))e$ epochs for which at most $(8/q)\cdot C\cdot k$ messages are sent and $S$ changes in the epoch at least $C^{\prime}s\log(1+(k/s))$ times. Call such an epoch balanced.

Let $i^{*}$ be the epoch which is most likely to be balanced, over the random choices of $\{\sigma_{i}\}$ , conditioned on $\tau=\tau^{\prime}$ and $({\cal P}_{0},{\cal P}_{1},\ldots,{\cal P}_{e})=(P_{0}^{\prime},P_{1}^{\prime},\ldots,P_{e}^{\prime})$ . Since at least $(1-(q/4))e$ epochs are balanced if $\mathcal{E}$ and $\mathcal{F}$ occur, and conditioned on $({\cal P}_{0},{\cal P}_{1},\ldots,{\cal P}_{e})=(P_{0}^{\prime},P_{1}^{\prime},\ldots,P_{e}^{\prime})$ event $\mathcal{F}$ does occur, and $\mathcal{E}$ occurs with probability at least $q/2$ given this conditioning, it follows that

[TABLE]

The property of $i^{*}$ being balanced is independent of $\sigma_{j}$ for $j\neq i^{*}$ , so we also have

[TABLE]

If $C^{\prime}s\log(1+(k/s))\geq 1$ , then ${\cal P}$ changes at least once in epoch $i^{*}$ . Suppose, for the moment, that this is the case. Suppose the first update in the global stream at which ${\cal P}$ changes is the $j^{*}$ -th update. In order for $i^{*}$ to be balanced for at least a $q/4$ fraction of the $\sigma_{i^{*}}$ , there must be at least $qk/4$ different servers which receive $j^{*}$ , for which $\Pi$ sends a message. In particular, since $\Pi$ is deterministic conditioned on $\tau$ , at least $qk/4$ messages must be sent in the $i^{*}$ -th epoch. But $i^{*}$ was chosen so that at most $(8/q)\cdot C\cdot k$ messages are sent, which is a contradiction for $C<q^{2}/32$ .

It follows that we reach a contradiction unless $C^{\prime}s\log(1+(k/s))<1$ . Notice, though, that since $C^{\prime}$ is a constant, if $C^{\prime}s\log(1+(k/s))<1$ , then this implies that $k=O(1)$ . However, if $k=O(1)$ , then $\frac{k\log(n/s)}{\log(1+(k/s))}=O(s\log(n/s))$ , and so the theorem is independent of $k$ , and follows simply by the first part of Lemma 9.

Otherwise, we have reached a contradiction, and so it follows that $Cke$ messages must be sent with probability at least $1-q$ . Since $Cke=\Omega\left(\frac{k\log(n/s)}{\log(1+(k/s))}\right)$ , this completes the proof.

6 Sampling With Replacement

We now present an algorithm to maintain a random sample of size $s$ with replacement from ${\cal S}$ . The basic idea is to run in parallel $s$ copies of the single item sampling algorithm from Section 3. Done naively, this will lead to a message complexity of $O(sk\frac{\log n}{\log k})$ . We obtain an improved algorithm based on the following ideas.

We view the distributed streams as $s$ logical streams, ${\cal S}^{i},i=1\ldots s$ . Each ${\cal S}^{i}$ is identical to ${\cal S}$ , but the algorithm assigns independent weights to the different copies of the same element in the different logical streams. Let $w^{i}(e)$ denote the weight assigned to element $e$ in ${\cal S}^{i}$ . $w^{i}(e)$ is a random number between [math] and $1$ . For each $i=1\ldots s$ , the coordinator maintains the minimum weight, say $w^{i}$ , among all elements in ${\cal S}^{i}$ , and the corresponding element.

Let $\beta=\max_{i=1}^{s}w^{i}$ ; $\beta$ is maintained by the coordinator. Each site $j$ maintains $\beta_{j}$ , a local view of $\beta$ , which is always greater than or equal to $\beta$ . Whenever a logical stream element at site $j$ has weight less than $\beta_{j}$ , the site sends it to the coordinator, receives in response the current value of $\beta$ , and updates $\beta_{j}$ . When a random sample is requested at the coordinator, it returns the set of all minimum weight elements in all $s$ logical streams. It can be easily seen that this algorithm is correct, and at all times, returns a random sample of size $s$ selected with replacement. The main optimization relative to the naive approach described above is that when a site sends a message to the coordinator, it receives $\beta$ , which provides partial information about all $w^{i}$ s. This provides a substantial improvement in the message complexity and leads to the following bounds.

Theorem 4

The above algorithm continuously maintains a sample of size $s$ with replacement from ${\cal S}$ , and its expected message complexity is $O(s\log s\log n)$ in case $k\leq 2s\log s$ , and $O\left(k\frac{\log n}{\log(\frac{k}{s\log s})}\right)$ in case $k>2s\log s$ .

**Proof: **

We provide a sketch of the proof here. The analysis of the message complexity is similar to the case of sampling without replacement. We sketch the analysis here, and omit the details. The execution is divided into epochs, where in epoch $i$ , the value of $\beta$ at the coordinator decreases by at least a factor of $r$ (a parameter to be determined later). Let $\xi$ denote the number of epochs. It can be seen that $E[\xi]=O(\frac{\log n}{\log r})$ . In epoch $i$ , let $X_{i}$ denote the number of messages sent from the sites to the coordinator in the epoch, $m_{i}$ denote the value of $\beta$ at the beginning of the epoch, and $n_{i}$ denote the number of elements in ${\cal S}$ that arrived in the epoch.

The $n_{i}$ elements in epoch $i$ give rise to $sn_{i}$ logical elements, and each logical element has a probability of no more than $m_{i}$ of resulting in a message to the coordinator. Similar to the proof of Lemma 7, we can show using conditional expectations that $E[X_{i}]\leq rs\log s$ (the $\log s$ factor comes in due to the fact that $E[n_{i}|m_{i}=\alpha]\leq\frac{r\log s}{\alpha}$ . Thus the expected total number of messages in epoch $i$ is bounded by $(k+2sr\log s)$ , and in the entire execution is $O((k+2sr\log s)\frac{\log n}{\log r})$ . By choosing $r=2$ for the case $k\leq(2s\log s)$ , and $r=k/(s\log s)$ for the case $k>(2s\log s)$ , we get the desired result.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] C. C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In VLDB , pages 607–618, 2006.
2[2] C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP (1) , pages 95–106, 2009.
3[3] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA , pages 633–634, 2002.
4[4] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD Conference , pages 28–39, 2003.
5[5] V. Braverman and R. Ostrovsky. Effective computations on sliding windows. SIAM Journal on Computing. , 39(6):2113–2131, 2010.
6[6] V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In PODS , pages 147–156, 2009.
7[7] G. Cormode and M. N. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB , pages 13–24, 2005.
8[8] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA , pages 1076–1085, 2008.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

1.1 Our Results

1.2 Related Work

2 Model

3 Algorithm

3.1 Correctness

Lemma 1

Lemma 2

4 Analysis of the Algorithm (Upper Bound)

Lemma 3

Lemma 4

Lemma 5

Definition 1

Theorem 1** (Wald’s Equation)**

Lemma 6

Lemma 7

Lemma 8

Theorem 2

5 Lower Bound

Theorem 3

Lemma 9

6 Sampling With Replacement

Theorem 4

Theorem 1 (Wald’s Equation)