Optimal Random Sampling from Distributed Streams Revisited
Srikanta Tirthapura, David P. Woodruff

TL;DR
This paper presents an improved algorithm for distributed random sampling that reduces communication and computation costs, achieving optimal message complexity and also enhancing heavy hitter detection across multiple sites.
Contribution
It introduces a new algorithm for distributed sampling that improves efficiency and provides a matching lower bound, also advancing heavy hitter detection methods.
Findings
Reduced total messages sent compared to prior algorithms
Achieved asymptotic optimality in message complexity
Enhanced heavy hitter detection across distributed sites
Abstract
We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Stream Mining Techniques
Optimal Random Sampling from Distributed Streams Revisited 111This writeup is a revised version of a paper with the same title and authors, which appeared in the Proceedings of the International Conference on Distributed Computing (DISC) 2011. It corrects an error in the proof of the upper bound on message complexity (Section 4). The proofs in pages 9, 10, and 11 (excluding Theorem 2) have been rewritten relative to the DISC 2011 version. None of the main theorem statements (Theorems 2,3,4) have changed from the DISC 2011 version.
222We thank Rajesh Jayaram for pointing out the error in the conference version.
Srikanta Tirthapura, Iowa State University, [email protected]
David P. Woodruff, Carnegie Mellon University, [email protected]
Abstract
We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.
1 Introduction
For many data analysis tasks, it is impractical to collect all the data at a single site and process it in a centralized manner. For example, data arrives at multiple network routers at extremely high rates, and queries are often posed on the union of data observed at all the routers. Since the data set is changing, the query results could also be changing continuously with time. This has motivated the continuous, distributed, streaming model [8]. In this model there are physically distributed sites receiving high-volume local streams of data. These sites talk to a central coordinator, who has to continuously respond to queries over the union of all streams observed so far. The challenge is to minimize the communication between the different sites and the coordinator, while providing an accurate answer to queries at the coordinator at all times.
A fundamental problem in this setting is to obtain a random sample drawn from the union of all distributed streams. This generalizes the classic reservoir sampling problem (see, e.g., [15], where the algorithm is attributed to Waterman; see also [19]) to the setting of multiple distributed streams, and has applications to approximate query answering, selectivity estimation, and query planning. For example, in the case of network routers, maintaining a random sample from the union of the streams is valuable for network monitoring tasks involving the detection of global properties [13]. Other problems on distributed stream processing, including the estimation of the number of distinct elements [7, 8] and heavy hitters [4, 14, 17, 21], use random sampling as a primitive.
The study of sampling in distributed streams was initiated by Cormode et al [9]. Consider a set of different streams observed by the sites with the total number of current items in the union of all streams equal to . The authors in [9] show how sites can maintain a random sample of items without replacement from the union of their streams using an expected messages between the sites and the central coordinator. The memory requirement of the central coordinator is machine words, and the time requirement is . The memory requirement of the remote sites is a single machine word with constant time per stream update. Cormode et al. also prove that the expected number of messages sent in any scheme is . Each message is assumed to be a single machine word, which can hold an integer of magnitude .
Notation. All logarithms are to the base 2 unless otherwise specified. Throughout the paper, when we use asymptotic notation, the variable that is going to infinity is , and and are functions of .
1.1 Our Results
Our main contribution is an algorithm for sampling without replacement from distributed streams, as well as a matching lower bound showing that the message complexity of our algorithm is optimal. A summary of our results and a comparison with earlier work is shown in Figure 1.
New Algorithm: We present an algorithm which uses an expected
[TABLE]
number of messages for continuously maintaining a random sample of size from distributed data streams of total size . Notice that if , this number is , while if , this number is .
The memory requirement in our protocol at the central coordinator is machine words, and the time requirement is . The former is the same as that in the protocol of [9], while the latter improves their time requirement. The remote sites in our scheme store a single machine word and use constant time per stream update, which is clearly optimal.
Our result leads to a significant improvement in the message complexity in the case when is large. For example, for the basic problem of maintaining a single random sample from the union of distributed streams (), our algorithm leads to a factor of decrease in the number of messages sent in the system over the algorithm in [9].
Our algorithm is simple, and only requires the central coordinator to communicate with a site if the site initiates the communication. This is useful in a setting where a site may go offline, since it does not require the ability of a site to receive broadcast messages.
Lower Bound: We also show that for any constant , any correct protocol must send messages with probability at least . This also yields a bound of on the expected message complexity of any correct protocol, showing the expected number of messages sent by our algorithm is optimal, upto constant factors.
In addition to being quantitatively stronger than the lower bound of [9], our lower bound is also qualitatively stronger, because the lower bound in [9] is on the expected number of messages transmitted in a correct protocol. However, this does not rule out the possibility that with large probability, much fewer messages are sent in the optimal protocol. In contrast, we lower bound the number of messages that must be transmitted in any protocol of the time. Since the time complexity of the central coordinator is at least the number of messages received, the time complexity of our protocol is also optimal.
Sampling With Replacement. We also show how to modify our protocol to obtain a random sample of items from distributed streams with replacement. Here we achieve a protocol with messages, improving the -message protocol of [9]. We obtain the same improvement in the time complexity of the central coordinator.
Heavy-Hitters. As a corollary, we obtain a protocol for estimating the heavy hitters in distributed streams with the best known message complexity. In this problem we would like to find a set of items so that if an element occurs at least an fraction of times in the union of the streams, then , and if occurs less than an fraction of times in union of the streams, then . It is known that random samples suffice to estimate the set of heavy hitters with high probability, and the previous best algorithm [9] was obtained by plugging into a protocol for distributed sampling. We thus improve the message complexity from to . This can be significant when is large compared to .
1.2 Related Work
In addition to work discussed above, other research in the continuous distributed streaming model includes estimating frequency moments and counting the number of distinct elements [7, 8], and estimating the entropy [2]. The reservoir sampling technique has been used extensively in large scale data mining applications, see for example [10, 16, 1]. Stream sampling under sliding windows has been considered in [6, 3]. Deterministic algorithms for heavy-hitters over distributed streams, and corresponding lower bounds were considered in [21].
Stream sampling under sliding windows over distributed streams has been considered in [9]. Their algorithm for sliding windows is already optimal upto lower-order additive terms (see Theorems 4.1 and 4.2 in [9]). Hence our improved results for the non-sliding window case do not translate into an improvement for the case of sliding windows.
A related model of distributed streams was considered in [11, 12]. In this model, the coordinator was not required to continuously maintain an estimate of the required aggregate, but when the query was posed to the coordinator, the sites would be contacted and the query result would be constructed. In their model, the coordinator could be said to be “reactive”, whereas in the model considered in this paper, the coordinator is “pro-active”.
Roadmap: We first present the model and problem definition in Section 2, and then the algorithm followed by a proof of correctness in Section 3. The analysis of message complexity and the lower bound are presented in Sections 4 and 5 respectively, followed by an algorithm for sampling with replacement in Section 6.
2 Model
Consider a system with different sites, numbered from till , each receiving a local stream of elements. Let denote the stream observed at site . There is one “coordinator” node, which is different from any of the sites. The coordinator does not observe a local stream, but all queries for a random sample arrive at the coordinator. Let be the entire stream observed by the system, and let . The sample size is a parameter supplied to the coordinator and to the sites during initialization.
The task of the coordinator is to continuously maintain a random sample of size consisting of elements chosen uniformly at random without replacement from . The cost of the protocol is the number of messages transmitted.
We assume a synchronous communication model, where the system progresses in “rounds”. In each round, each site can observe one element (or none), and send a message to the coordinator, and receive a response from the coordinator. The coordinator may receive up to messages in a round, and respond to each of them in the same round. This model is essentially identical to the model assumed in previous work [9]. Later we discuss how to handle the case of a site observing multiple elements per round.
The sizes of the different local streams at the sites, their order of arrival, and the interleaving of the streams at different sites, can all be arbitrary. The algorithm cannot make any assumption about these.
3 Algorithm
The idea in the algorithm is as follows. Each site associates a random “weight” with each element that it receives. The coordinator then maintains the set of elements with the minimum weights in the union of the streams at all times, and this is a random sample of . This idea is similar to the spirit in all centralized reservoir sampling algorithms. In a distributed setting, the interesting aspect is at what times do the sites communicate with the coordinator, and vice versa.
In our algorithm, the coordinator maintains , which is the -th smallest weight so far in the system, as well as the sample , consisting of all the elements that have weight no more than . Each site need only maintain a single value , which is the site’s view of the -th smallest weight in the system so far. Note that it is too expensive to keep the view of each site synchronized with the coordinator’s view at all times – to see this, note that the value of the -th smallest weight changes times, and updating every site each time the -th minimum changes takes a total of messages.
In our algorithm, when site sees an element with a weight smaller than , it sends it to the central coordinator. The coordinator updates and , if needed, and then replies back to with the current value of , which is the true minimum weight in the union of all streams. Thus each time a site communicates with the coordinator, it either makes a change to the random sample, or, at least, gets to refresh its view of .
The algorithm at each site is described in Algorithms 1 and 2. The algorithm at the coordinator is described in Algorithm 3.
3.1 Correctness
The following two lemmas establish the correctness of the algorithm.
Lemma 1
Let be the number of elements in so far. (1) If , then the set at the coordinator contains all the pairs seen at all the sites so far. (2) If , then at the coordinator consists of the pairs such that the weights of the pairs in are the smallest weights in the stream so far.
- **Proof: **
The variable is stored at the coordinator, and is stored at site . First we note that the variables and are non-increasing with time; this can be verified from the algorithms.
Next, we note that for every from till , at every round, . This can be seen because initially, , and changes only in response to receiving from the coordinator.
Thus, if fewer than elements have appeared in the stream so far, is , and hence is also for each site . The next element observed in the system is also sent to the coordinator. Thus, if , then the set consists of all elements seen so far in the system.
Next, we consider . Note that maintains the -th smallest weight seen at the coordinator, and consists of the elements seen at the coordinator with the smallest weights. We only have to show that if an element , observed at site is such that then must have sent to the coordinator. This follows because at all times, and if , then it must be true that , and in this case, is sent to the coordinator.
Lemma 2
At the end of each round, sample at the coordinator consists of a uniform random sample of size chosen without replacement from .
- **Proof: **
In case , then from Lemma 1, we know that contains every element of . In case , from Lemma 1, it follows that consists of elements with the smallest weights from . Since the weights are assigned randomly, each element in has a probability of of belonging in , showing that this is an uniform random sample. Since an element can appear no more than once in the sample, this is a sample chosen without replacement.
4 Analysis of the Algorithm (Upper Bound)
We now analyze the message complexity of the maintenance of a random sample.
For the sake of analysis, we divide the execution of the algorithm into “epochs”, where each epoch consists of a sequence of rounds. The epochs are defined inductively. Let be a parameter, which will be fixed later. Recall that is the -th smallest weight so far in the system (if there are fewer than elements so far, ). Epoch [math] is the set of all rounds from the beginning of execution until (and including) the earliest round where is or smaller. Let denote the value of at the end of epoch . Then epoch consists of all rounds subsequent to epoch until (and including) the earliest round when is or smaller. Note that the algorithm does not need to be aware of the epochs, and this is only used for the analysis.
Suppose we call the original distributed algorithm described in Algorithms 3 and 2 as Algorithm . For the analysis, we consider a slightly different distributed algorithm, Algorithm , described below. Algorithm is identical to Algorithm except for the fact that at the beginning of each epoch, the value is broadcast by the coordinator to all sites.
While Algorithm is natural, Algorithm is easier to analyze. We first note that on the same inputs, the value of (and ) at the coordinator at any round in Algorithm is identical to the value of (and ) at the coordinator in Algorithm at the same round. Hence, the partitioning of rounds into epochs is the same for both algorithms, for a given input. The correctness of Algorithm follows from the correctness of Algorithm . The only difference between them is in the total number of messages sent. In we have the property that for all from to , at the beginning of each epoch (though this is not necessarily true throughout the epoch), and for this, has to pay a cost of at least messages in each epoch.
Lemma 3
The number of messages sent by Algorithm for a set of input streams is never more than twice the number of messages sent by Algorithm for the same input.
- **Proof: **
Consider site in a particular epoch . In Algorithm , receives at the beginning of the epoch through a message from the coordinator. In Algorithm , may not know at the beginning of epoch . We consider two cases.
Case I: sends a message to the coordinator in epoch in Algorithm . In this case, the first time sends a message to the coordinator in this epoch, will receive the current value of , which is smaller than or equal to . This communication costs two messages, one in each direction. Henceforth, in this epoch, the number of messages sent in Algorithm is no more than those sent in . In this epoch, the number of messages transmitted to/from in is at most twice the number of messages as in , which has at least one transmission from the coordinator to site .
Case II: did not send a message to the coordinator in this epoch, in Algorithm . In this case, the number of messages sent in this epoch to/from site in Algorithm is smaller than in Algorithm .
Let denote the total number of epochs.
Lemma 4
If ,
[TABLE]
- **Proof: **
Let . First, we note that in each epoch, decreases by a factor of at least . Thus after epochs, is no more than . Thus, we have
[TABLE]
Let denote the number of elements (out of ) that have been assigned a weight of or lesser. is a binomial random variable with expectation . Note that if , it must be true that .
[TABLE]
where we have used Markov’s inequality.
Since takes only positive integral values,
[TABLE]
where we have assumed .
Let denote the total number of messages sent during the entire execution. Let denote the total number of messages sent in epoch . Let denote the number of messages sent from the sites to the coordinator in epoch . is the sum of two parts, (1) messages sent by the coordinator at the start of the epoch, and (2) two times the number of messages sent from the sites to the coordinator.
[TABLE]
[TABLE]
For epoch , consider the stochastic process . For each , choose a random number uniformly from .
[TABLE]
Let denote the smallest time such that there are at least elements of that are equal to . Let .
Lemma 5
[TABLE]
- **Proof: **
Consider the correspondence between the th element received in epoch and . Each time a message is sent upon receiving the th element, it must be true that , since the random weight chosen must be less than for a message to be sent (note that the threshold could be stricter than ). Further, the number of elements in this epoch is less than or equal to , since by the time elements are seen, each with a weight less than , the epoch would have ended (it may have ended earlier).
Consider the conditional random variables , , and .
[TABLE]
Definition 1
Let be a stochastic process. A stopping time with respect to is a random time such that for each , the event is completely determined by the total information known up to time , i.e. .
Theorem 1** (Wald’s Equation)**
If is a stopping time with respect to an i.i.d. sequence and if and , then
[TABLE]
Lemma 6
[TABLE]
- **Proof: **
[TABLE]
Note that the different are independent and identically distributed since each is chosen independently from the same distribution. Further, is a stopping time for , since for , the event can be determined by looking at the information till time , i.e. and checking the number of s for , that were equal to .
Further, we note that is finite, and is also finite. Using Wald’s equation (Theorem 1), we get
[TABLE]
Note that is the number of trials until successes, where the probability of a success is . Hence, is the sum of geometric random variables each with a parameter of , and .
From the definition of and conditioning on , we have with probability , with probability and with probability . Hence:
[TABLE]
Combining the above, the proof is complete.
Lemma 7
[TABLE]
- **Proof: **
We have , where we used Lemma 6. Using Lemma 5, the proof is complete.
Lemma 8
[TABLE]
- **Proof: **
Using Lemma 7 and Equation 1, we get the expected number of messages in epoch :
[TABLE]
Let denote the indicator random variable that is when and [math] otherwise. The total number of messages can be written as follows.
[TABLE]
Since is independent of the event , we have:
[TABLE]
where we have used Lemma 4 for an upper bound on the expected number of epochs.
Theorem 2
The expected message complexity of our algorithm is as follows.
- I:
If , then 2. II:
If , then
- **Proof: **
We note that the upper bounds on in Lemma 8 hold for any value of .
Case I: . In this case, we set . From Lemma 8,
[TABLE]
Case II: . We set , and get:
[TABLE]
5 Lower Bound
Theorem 3
For any constant , any correct protocol must send messages with probability at least , where the probability is taken over the protocol’s internal randomness.
- **Proof: **
Let . Define epochs as follows: in the -th epoch, , there are global stream updates, which can be distributed among the servers in an arbitrary way.
We consider a distribution on orderings of the stream updates. Namely, we think of a totally-ordered stream of updates, and in the -th epoch, we randomly assign the updates among the servers, independently for each epoch. Let the randomness used for the assignment in the -th epoch be denoted .
Consider the global stream of updates . Suppose we maintain a sample set of items without replacement. We let denote a random variable indicating the value of after seeing updates in the stream. We will use the following lemma about reservoir sampling.
Lemma 9
For any constant , there is a constant for which
- –
* changes at least times with probability at least , and*
- –
If and and , then with probability at least , over the choice of , there are at least epochs for which the number of times changes in the epoch is at least .
- **Proof: **
Consider the stream of updates. In the classical reservoir sampling algorithm [15], is initialized to . Then, for each , the -th element is included in the current sample set with probability , in which case a random item in is replaced with .
For the first part of Lemma 9, let be an indicator random variable if causes to change. Let . Hence, for all , and , where is the -th Harmonic number. Then all of the , are independent indicator random variables. It follows by a Chernoff bound that
[TABLE]
For any , this is less than any constant , and so the first part of Lemma 9 follows since .
For the second part of Lemma 9, consider the -th epoch, , which contains consecutive updates. Let be the number of changes in this epoch. Then . Since can be written as a sum of independent indicator random variables, by a Chernoff bound,
[TABLE]
Hence, the expected number of epochs for which is at most , which is since we’re promised that and and . By a Markov bound, with probability at least , at most epochs satisfy . It follows that with probability at least , there are at least epochs for which the number of changes in the epoch is at least , as desired.
Corner Cases: When , the statement of Theorem 3 gives a lower bound of . In this case Theorem 3 follows immediately from the first part of Lemma 9 since these changes in must be communicated to the central coordinator. Hence, in what follows we can assume . Notice also that if , then , and so the theorem is independent of , and follows simply by the first part of Lemma 9. Notice also that if , then the statement of Theorem 3 amounts to proving an lower bound, which follows trivially since every site must send at least one message.
Thus, in what follows, we may apply the second part of Lemma 9.
Main Case: Let be a sufficiently small constant, depending on , to be determined below. Let be a possibly randomized protocol, which with probability at least , sends at most messages. We show that cannot be a correct protocol.
Let denote the random coin tosses of , i.e., the concatenation of random strings of all sites together with that of the central coordinator.
Let be the event that sends less than messages. By assumption, Hence, it is also the case that
[TABLE]
For a sufficiently small constant that may depend on , let be the event that there are at least epochs for which the number of times changes in the epoch is at least . By the second part of Lemma 9,
[TABLE]
It follows that there is a fixing of as well as a fixing of to for which occurs and
[TABLE]
Notice that the three (sets of) random variables and are independent, and so in particular, is still uniformly random given this conditioning.
By a Markov argument, if event occurs, then there are at least epochs for which at most messages are sent. If events and both occur, then by a union bound, there are at least epochs for which at most messages are sent and changes in the epoch at least times. Call such an epoch balanced.
Let be the epoch which is most likely to be balanced, over the random choices of , conditioned on and . Since at least epochs are balanced if and occur, and conditioned on event does occur, and occurs with probability at least given this conditioning, it follows that
[TABLE]
The property of being balanced is independent of for , so we also have
[TABLE]
If , then changes at least once in epoch . Suppose, for the moment, that this is the case. Suppose the first update in the global stream at which changes is the -th update. In order for to be balanced for at least a fraction of the , there must be at least different servers which receive , for which sends a message. In particular, since is deterministic conditioned on , at least messages must be sent in the -th epoch. But was chosen so that at most messages are sent, which is a contradiction for .
It follows that we reach a contradiction unless . Notice, though, that since is a constant, if , then this implies that . However, if , then , and so the theorem is independent of , and follows simply by the first part of Lemma 9.
Otherwise, we have reached a contradiction, and so it follows that messages must be sent with probability at least . Since , this completes the proof.
6 Sampling With Replacement
We now present an algorithm to maintain a random sample of size with replacement from . The basic idea is to run in parallel copies of the single item sampling algorithm from Section 3. Done naively, this will lead to a message complexity of . We obtain an improved algorithm based on the following ideas.
We view the distributed streams as logical streams, . Each is identical to , but the algorithm assigns independent weights to the different copies of the same element in the different logical streams. Let denote the weight assigned to element in . is a random number between [math] and . For each , the coordinator maintains the minimum weight, say , among all elements in , and the corresponding element.
Let ; is maintained by the coordinator. Each site maintains , a local view of , which is always greater than or equal to . Whenever a logical stream element at site has weight less than , the site sends it to the coordinator, receives in response the current value of , and updates . When a random sample is requested at the coordinator, it returns the set of all minimum weight elements in all logical streams. It can be easily seen that this algorithm is correct, and at all times, returns a random sample of size selected with replacement. The main optimization relative to the naive approach described above is that when a site sends a message to the coordinator, it receives , which provides partial information about all s. This provides a substantial improvement in the message complexity and leads to the following bounds.
Theorem 4
The above algorithm continuously maintains a sample of size with replacement from , and its expected message complexity is in case , and in case .
- **Proof: **
We provide a sketch of the proof here. The analysis of the message complexity is similar to the case of sampling without replacement. We sketch the analysis here, and omit the details. The execution is divided into epochs, where in epoch , the value of at the coordinator decreases by at least a factor of (a parameter to be determined later). Let denote the number of epochs. It can be seen that . In epoch , let denote the number of messages sent from the sites to the coordinator in the epoch, denote the value of at the beginning of the epoch, and denote the number of elements in that arrived in the epoch.
The elements in epoch give rise to logical elements, and each logical element has a probability of no more than of resulting in a message to the coordinator. Similar to the proof of Lemma 7, we can show using conditional expectations that (the factor comes in due to the fact that . Thus the expected total number of messages in epoch is bounded by , and in the entire execution is . By choosing for the case , and for the case , we get the desired result.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] C. C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In VLDB , pages 607–618, 2006.
- 2[2] C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In ICALP (1) , pages 95–106, 2009.
- 3[3] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In SODA , pages 633–634, 2002.
- 4[4] B. Babcock and C. Olston. Distributed top-k monitoring. In SIGMOD Conference , pages 28–39, 2003.
- 5[5] V. Braverman and R. Ostrovsky. Effective computations on sliding windows. SIAM Journal on Computing. , 39(6):2113–2131, 2010.
- 6[6] V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In PODS , pages 147–156, 2009.
- 7[7] G. Cormode and M. N. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB , pages 13–24, 2005.
- 8[8] G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In SODA , pages 1076–1085, 2008.
