Parallel mining of time-faded heavy hitters

Massimo Cafaro; Marco Pulimeno; Italo Epicoco

arXiv:1701.03004·cs.DS·January 12, 2017

Parallel mining of time-faded heavy hitters

Massimo Cafaro, Marco Pulimeno, Italo Epicoco

PDF

Open Access

TL;DR

This paper introduces PFDCMSS, a parallel algorithm for efficiently mining time-faded heavy hitters that maintains high accuracy and scalability on message-passing architectures.

Contribution

It presents the first parallel algorithm for time-faded heavy hitters that is mergeable, accurate, and scalable, based on a novel augmented sketch data structure.

Findings

01

Retains the accuracy and error bounds of the sequential FDCMSS algorithm.

02

Achieves excellent parallel scalability on message-passing architectures.

03

Proves the mergeability of the augmented sketch data structure.

Abstract

We present PFDCMSS, a novel message-passing based parallel algorithm for mining time-faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly two counters, is mergeable. Whilst mergeability of traditional sketches derives immediately from theory, we show that merging our augmented sketch is non trivial. Nonetheless, the resulting parallel algorithm is fast and simple to implement. To the best of our knowledge, PFDCMSS is the first parallel algorithm solving the problem of mining time-faded heavy hitters on message-passing parallel architectures. Extensive experimental results confirm that PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing…

Tables1

Table 1. Table 1: Design of experiments. The input stream size n 𝑛 n is expressed in billions, ρ 𝜌 \rho denotes the skewness of the input data distribution, ϕ italic-ϕ \phi is the support threshold, w 𝑤 w represents the number of columns in the sketch data structure, p 𝑝 p is the number of cores and g s 𝑔 𝑠 gs denotes the grain size, i.e. the number of elements (expressed in billions) in the input stream for each core. The nf value stands for not fixed

Exp.	Aim	Varying	$n$	$ρ$	$ϕ$	$w$	$p$	$g s$
1	Algorithm accuracy	$n = {1, 2, 4, 8}$	nf	1.1	0.01	1340	16	0.5
		$ρ = {1, 1.4, 1.8, 2.2}$	8	nf	0.01	1340	16	0.5
		$ϕ = {0.001, 0.004, 0.008, 0.016}$	8	1.1	nf	1340	16	0.5
		$w = {800, 1600, 3200, 6400}$	8	1.1	0.01	nf	16	0.5
2	Parallel alg. accuracy	$p = {1, 16, 128, 256, 512}$	8	1.1	0.01	1340	nf	nf
3	Computational performance	$p = {1, 16, 128, 256, 512}$	8	1.1	0.01	1340	nf	nf
3	Computational performance	$p = {1, 16, 128, 256, 512}$	nf	1.1	0.01	1340	nf	0.5

Equations32

∣ S ∣ = ∣∣ f ∣ ∣_{1},

∣ S ∣ = ∣∣ f ∣ ∣_{1},

\hat{f}_{v} - \hat{f}^{min} \leq f_{v} \leq \hat{f}_{v}, v \in S,

\hat{f}_{v} - \hat{f}^{min} \leq f_{v} \leq \hat{f}_{v}, v \in S,

f_{v} \leq \hat{f}^{min}, v \in / S,

f_{v} \leq \hat{f}^{min}, v \in / S,

\hat{f}^{min} \leq ⌊ \frac{∣∣ f ∣ ∣ _{1}}{k} ⌋ .

\hat{f}^{min} \leq ⌊ \frac{∣∣ f ∣ ∣ _{1}}{k} ⌋ .

\hat{f}_{v} - f_{v} \leq \hat{f}^{min} \leq ⌊ \frac{∣∣ f ∣ ∣ _{1}}{k} ⌋, v \in U .

\hat{f}_{v} - f_{v} \leq \hat{f}^{min} \leq ⌊ \frac{∣∣ f ∣ ∣ _{1}}{k} ⌋, v \in U .

∣ S ∣ \leq ∣∣ f ∣ ∣_{1},

∣ S ∣ \leq ∣∣ f ∣ ∣_{1},

∣ N ∣ := x \in N \sum f_{N} (x),

∣ N ∣ := x \in N \sum f_{N} (x),

∣ N ∣ := C a r d (N) = x \in N \sum 1 .

∣ N ∣ := C a r d (N) = x \in N \sum 1 .

∣ S ∣ = ∣ N ∣,

∣ S ∣ = ∣ N ∣,

\hat{f_{S}} (e) - \hat{f_{S}}^{min} \leq f_{N} (e) \leq \hat{f_{S}} (e), e \in Σ,

\hat{f_{S}} (e) - \hat{f_{S}}^{min} \leq f_{N} (e) \leq \hat{f_{S}} (e), e \in Σ,

f_{N} (e) \leq \hat{f_{S}}^{min}, e \in / Σ,

f_{N} (e) \leq \hat{f_{S}}^{min}, e \in / Σ,

\hat{f_{S}}^{min} \leq ⌊ \frac{∣ N ∣}{2} ⌋ .

\hat{f_{S}}^{min} \leq ⌊ \frac{∣ N ∣}{2} ⌋ .

∣ S_{i} ∣ \leq ∣ N_{i} ∣, i = 1, 2

∣ S_{i} ∣ \leq ∣ N_{i} ∣, i = 1, 2

\hat{f}_{\mathcal{S}_{C}}(e)=\left\{\begin{array}[]{r}\vspace{0.3cm}\hat{f}_{\mathcal{S}_{1}}(e)+\hat{f}_{\mathcal{S}_{2}}(e),\qquad e\in\Sigma_{1}\cap\Sigma_{2},\\ \vspace{0.3cm}\hat{f}_{\mathcal{S}_{1}}(e)+\hat{f}_{\mathcal{S}_{2}}^{min},\qquad e\in\Sigma_{1}\setminus\Sigma_{2},\\ \hat{f}_{\mathcal{S}_{2}}(e)+\hat{f}_{\mathcal{S}_{1}}^{min},\qquad e\in\Sigma_{2}\setminus\Sigma_{1},\end{array}\right.

\hat{f}_{\mathcal{S}_{C}}(e)=\left\{\begin{array}[]{r}\vspace{0.3cm}\hat{f}_{\mathcal{S}_{1}}(e)+\hat{f}_{\mathcal{S}_{2}}(e),\qquad e\in\Sigma_{1}\cap\Sigma_{2},\\ \vspace{0.3cm}\hat{f}_{\mathcal{S}_{1}}(e)+\hat{f}_{\mathcal{S}_{2}}^{min},\qquad e\in\Sigma_{1}\setminus\Sigma_{2},\\ \hat{f}_{\mathcal{S}_{2}}(e)+\hat{f}_{\mathcal{S}_{1}}^{min},\qquad e\in\Sigma_{2}\setminus\Sigma_{1},\end{array}\right.

∣ S_{C} ∣ = ∣ S_{1} ∣ + ∣ S_{2} ∣ + x δ = ∣ N_{1} ∣ + ∣ N_{2} ∣ + x δ

∣ S_{C} ∣ = ∣ S_{1} ∣ + ∣ S_{2} ∣ + x δ = ∣ N_{1} ∣ + ∣ N_{2} ∣ + x δ

∣ S_{M} ∣ = ∣ N_{1} ∣ + ∣ N_{2} ∣ + x δ - i = 1 \sum x \hat{f}_{S_{C}} (e_{i}),

∣ S_{M} ∣ = ∣ N_{1} ∣ + ∣ N_{2} ∣ + x δ - i = 1 \sum x \hat{f}_{S_{C}} (e_{i}),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Algorithms and Data Compression · Rough Sets and Fuzzy Logic

Full text

Parallel mining of time–faded heavy hitters

Massimo Cafaro

[email protected]

Marco Pulimeno

[email protected]

Italo Epicoco

[email protected]

University of Salento, Lecce, Italy

Abstract

We present PFDCMSS, a novel message–passing based parallel algorithm for mining time–faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly two counters, is mergeable. Whilst mergeability of traditional sketches derives immediately from theory, we show that merging our augmented sketch is non trivial. Nonetheless, the resulting parallel algorithm is fast and simple to implement. To the best of our knowledge, PFDCMSS is the first parallel algorithm solving the problem of mining time–faded heavy hitters on message–passing parallel architectures. Extensive experimental results confirm that PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing excellent parallel scalability.

keywords:

message–passing, heavy hitters, time fading model, sketches.

1 Introduction

In this paper we deal with the problem of mining in parallel time–faded heavy hitters (also called frequent items), and we present PFDCMSS, a novel message–passing based parallel algorithm which is a parallel version of the recently published FDCMSS sequential algorithm [Cafaro-Pulimeno-Epicoco-Aloisio].

Mining of heavy hitters in a data stream has been thoroughly studied, and the problem is regarded as one of the most important in the streaming algorithms literature. Depending on the particular application, the problem is reported in the literature as hot list analysis [Gibbons], market basket analysis [Brin] and iceberg query [Fang98computingiceberg], [Beyer99bottom-upcomputation].

Even though there are many possible applications, we recall here some of the most important contexts to which the problem has been successfully applied: network traffic analysis [DemaineLM02], [Estan], [Pan], analysis of web logs [Charikar], Computational and theoretical Linguistics [CICLing].

All of the algorithms for detecting heavy hitters can be classified as being either counter or sketch based, the difference being that counter–based algorithms rely on a set of counters which are used to keep track of stream items, whilst sketch–based algorithms monitor the data stream by using a sketch data structure, often a bi-dimensional array data structure containing a counter in each cell. Stream items are mapped by hash functions to corresponding cells in the sketch. The former algorithms (counter–based) are deterministic, whilst the latter (sketch–based) are probabilistic.

Regarding counter–based algorithms, the first sequential algorithm has been designed by Misra and Gries [Misra82]. Their algorithm was rediscovered, independently, about twenty years later by Demaine et al. [DemaineLM02] (this algorithm is known in the literature as the Frequent algorithm) and Karp et al. [Karp]. Among the developed counters–based algorithms we recall here Sticky Sampling and Lossy Counting [Manku02approximatefrequency], and Space Saving [Metwally2006]. Sketch–based solutions include CountSketch [Charikar], Group Test [Cormode-grouptest], Count-Min [Cormode05] and hCount [Jin03].

Relevant parallel algorithms include [cafaro-tempesta], [Cafaro-Pulimeno] and [Cafaro-Pulimeno-Tempesta] which are message-passing based parallel versions of the Frequent and Space Saving algorithms. Shared-memory algorithms have been designed as well, including a parallel version of Frequent [Zhang2013], a parallel version of Lossy Counting [Zhang2012], and parallel versions of Space Saving [Roy2012] [Das2009]. Recent shared-memory parallel algorithms for heavy hitters were recently proposed in [Tangwongsan2014]. Finally, accelerator based algorithms exploiting a GPU (Graphics Processing Unit) include [Govindaraju2005] and [Erra2012]. Regarding related work, i.e., parallel algorithms specifically designed to solve the problem of mining time–faded heavy hitters, we are not aware of any other algorithm: to the best of our knowledge, ours is the first parallel algorithm solving the problem on message–passing parallel architectures.

In this paper, we are concerned with the problem of detecting in parallel heavy hitters in a stream with the additional constraint that recent items must be weighted more than former items. The underlying assumption is that, in some applications, recent data is certainly more useful and valuable than older, stale data. Therefore, each item in the stream has an associated timestamp that will be used to determine its weight. In practice, instead of estimating items’ frequencies, we are required to estimate items’ decayed frequencies.

This paper is organized as follows. We recall in Section 2 preliminary definitions and concepts that will be used in the rest of the manuscript. We present in Section 3 our PFDCMSS algorithm and formally prove in Section 4 its correctness. Next, we provide extensive experimental results in Section 5, showing that PFDCMSS retains the extreme accuracy and error bound provided by the sequential FDCMSS whilst providing excellent parallel scalability. Finally, we draw our conclusions in Section 6.

2 Preliminary definitions

In this Section we introduce preliminary definitions and the notation used throughout the paper. We deal with an input data stream $\sigma$ consisting of a sequence of $n$ items drawn from a universe $\mathcal{U}$ ; without loss of generality, let $m$ be the number of distinct items in $\sigma$ i.e., let $\mathcal{U}=\{1,2,\ldots,m\}$ , which we will also denote as $[m]$ . Let $f_{i}$ be the frequency of the item $i\in\mathcal{U}$ (i.e., its number of occurrences in $\sigma$ ), and denote the frequency vector by $\textbf{f}=(f_{1},\ldots,f_{m})$ . Moreover, let $0<\phi<1$ be a support threshold, $0<\epsilon<1$ a tolerance such that $\epsilon<\phi$ and denote the 1-norm of f (which represents the total number of occurrences of all of the stream items) by $||\textbf{f}||_{1}$ .

In this paper, we are concerned with the problem of detecting in parallel frequent items in a stream with the additional constraint that recent items must be weighted more than former items. The underlying assumption is that, in some applications, recent data is certainly more useful and valuable than older, stale data. Therefore, each item in the stream has an associated timestamp that will be used to determine its weight. In practice, instead of estimating frequencies, we are required to estimate decayed frequencies. Two different models have been proposed in the literature: the sliding window and the time fading model. PFDCMSS works in the latter model. Furthermore, even though the basic ideas underlining the algorithm are also appropriate for an online distributed setting, here we are assuming that the entire dataset is available for offline processing.

The time fading model [recent-freq-items] [exp-decay] [Chen-Mei] does not use a window sliding over time; freshness of more recent items is instead emphasized by fading the frequency count of older items. This is achieved by computing the item’s decayed frequency through the use of a decay function that assign greater weight to more recent occurrences of an item than to older ones: the older an occurrences is, the lower its decayed weight.

Definition 1.

Let $w(t_{i},t)$ be a decayed function which computes the decayed weight at time $t$ for the occurrence of item $i$ arrived at time $t_{i}$ . A decayed function must satisfy the following properties:

$w(t_{i},t)=1$ * when $t_{i}=t$ and $0\leq w(t_{i},t)\leq 1$ for all $t>t_{i}$ ;* 2. 2.

$w$ * is a monotone non-increasing function as time $t$ increases, i.e., $t^{\prime}\geq t\implies w(t_{i},t^{\prime})\leq w(t_{i},t)$ .*

Related work has mostly exploited backward decay functions, in which the weight of an item is a function of its age, $a$ , where the age at time $t>t_{i}$ is simply $a=t-t_{i}$ . In this case, $w(t_{i},t)$ is given by $w(t_{i},t)=\frac{h(t-t_{i})}{h(t-t)}=\frac{h(t-t_{i})}{h(0)}$ , where $h$ is a positive monotone non-increasing function.

The term backward decay stems from the aim of measuring from the current time back to the item’s timestamp. Prior algorithms and applications have been using backward exponential decay functions such as $h(a)=e^{-\lambda a}$ , with $\lambda>0$ as decaying factor.

In our algorithm, we use instead a forward decay function, defined as follows (see [forward-decay] for a detailed description of the forward decay approach). Under forward decay, the weight of an item is computed on the amount of time between the arrival of an item and a fixed point $L$ , called the landmark time, which, by convention, is some time earlier than the timestamps of all of the items. The idea is to look forward in time from the landmark to see an item, instead of looking backward from the current time.

Definition 2.

Given a positive monotone non-decreasing function $g$ , and a landmark time $L$ , the forward decayed weight of an item $i$ with arrival time $t_{i}>L$ measured at time $t\geq t_{i}$ is given by $w(t_{i},t)=\frac{g(t_{i}-L)}{g(t-L)}$ .

The denominator is used to normalize the decayed weight so that $w(t_{i},t)$ is always less than or equal to 1 as requested by Definition 1.

Definition 3.

The decayed frequency of an item $v$ in the input stream $\sigma$ , computed at time $t$ , is given by the sum of the decayed weights of all the occurrences of $v$ in $\sigma$ : $f_{v}(t)=\sum_{v_{i}=v}w(t_{i},t)$ .

Definition 4.

The decayed count at time $t$ , $C(t)$ , of a stream $\sigma$ of $n$ items is the sum of the decayed weights of all the items occurring in the stream: $C(t)=\sum_{i=1}^{n}w(t_{i},t)$ .

The Approximate Time–Faded Heavy Hitters (ATFHH) problem is formally stated as follows.

Problem 1.

Approximate Time–Faded Heavy Hitters. Given a stream $\sigma$ of items with an associated timestamp, a threshold $0<\phi<1$ and a tolerance $0<\epsilon<1$ such that $\epsilon<\phi$ , and letting $g$ be a decaying function used to determine the decayed frequencies and $t$ be the query time, return the set of items $F$ , so that:

$F$ * contains all of the items $v$ with decayed frequency at time $t$ $f_{v}(t)>\phi C(t)$ (decayed frequent items);*

2.

$F$ * does not contain any item $v$ such that $f_{v}(t)\leq(\phi-\epsilon)C(t)$ .*

In the following, when clear from the context, the query time shall be considered an implicit parameter, so we write $f_{v}$ and $C$ instead of $f_{v}(t)$ and $C(t)$ . The algorithm presented makes use of a Count–Min sketch data structure augmented by a Space Saving summary associated to each sketch cell. In the following, we recall the main properties of the Count–Min and the Space Saving algorithms in the case of non decaying frequencies, but the same properties also hold in a time-fading context.

Count–Min is based on a sketch whose dimensions are derived by the input parameters $\epsilon$ , the error, and $\delta$ , the probability of failure. In particular, for Count–Min $d=\lceil\ln 1/\delta\rceil$ is the number of rows in the sketch and $w=\lceil e/\epsilon\rceil$ is the number of columns. Every cell in the sketch is a counter, which is updated by hash functions. By using this data structure, the algorithm solves with probability greater than or equal to 1 - $\delta$ the frequency estimation problem for arbitrary items. The algorithm may also be extended to solve the approximate frequent items problem as well, by using an additional heap data structure which is updated each time a cell is updated. Since in Count-Min the frequencies stored in the cells overestimate the true frequencies, a point query for an arbitrary item simply inspects all of the $d$ cells in which the item is mapped to by the corresponding hash functions and returns the minimum of those $d$ counters.

Space Saving is a counter-based algorithms solving the heavy hitters problem. It makes use of a stream summary data structure composed by a given number of counters $k\ll n$ , $n$ being the length of the stream. Each counter monitors an item in the stream and tracks its frequency. A substitution strategy is used when the algorithm processes an item not already monitored and all of the counters are occupied.

Let $\sigma$ be the input stream and denote by $\mathcal{S}$ the summary data structure of $k$ counters used by the Space Saving algorithm. Moreover, denote by $\left|\mathcal{S}\right|$ the sum of the counters in $\mathcal{S}$ , by $f_{v}$ the exact frequency of an item $v$ and by $\hat{f}_{v}$ its estimated frequency, let $\hat{f}^{min}$ be the minimum frequency in $\mathcal{S}$ . If there exist at least one counter not monitoring any item, $\hat{f}^{min}$ is zero.

Finally, denote by $\textbf{f}=(f_{1},\ldots,f_{m})$ the frequency vector. The following relations hold (as shown in [Metwally2006]):

[TABLE]

Therefore, it holds that

[TABLE]

3 The algorithm

In this section, we start by recalling our sequential algorithm FDCMSS [Cafaro-Pulimeno-Epicoco-Aloisio]. The key data structure is an augmented Count–Min sketch $D$ , whose dimensions $d$ (rows) and $w$ (columns) are derived by the input parameters $\epsilon$ , the error, and $\delta$ , the probability of failure. Whilst every cell in an ordinary CM sketch contains a counter used for frequency estimation, in our case a cell holds a Space Saving stream summary with exactly two counters. The idea behind the augmented sketch is to monitor the time–faded items that the sketch hash functions map to the corresponding cells by an instance of Space Saving with two counters, so that for a given cell we are able to determine a majority item candidate with regard to the sub-stream of items falling in that cell.

Indeed, by using a data structure $\mathcal{S}$ with two counters in each cell, and letting $C_{i,j}$ denote the total decayed count of the items falling in the cell $D[i][j]$ , the majority item is, if it exists, the item whose decayed frequency is greater than $\frac{C_{i,j}}{2}$ . The corresponding majority item candidate in the cell is the item monitored by the Space Saving counter whose estimated decayed frequency is maximum. We have proved that, with high probability, if a time-faded item is frequent, then, in at least one of the sketch cells where it is mapped, it is a majority item with regard to the sub-stream of items falling in the same cell. Therefore, our algorithm will detect it.

Theorem 1.

If an item $i$ is frequent, then it appears as a majority item candidate in at least one of the $d$ cells in which it falls, with probability greater than or equal to $1-(\frac{1}{2\phi w})^{d}$ .

Regarding the error bound of our algorithm, let $f_{i}$ be the exact decayed frequency of item $i$ in the stream $\sigma$ and $\hat{f}_{i}$ be the estimated decayed frequency of item $i$ returned by FDCMSS. Let $C$ be the total decayed count of all of the items in the stream. We have proved the following error bound.

Theorem 2.

$\forall u\in[m]$ , $\hat{f}_{u}$ estimates the exact decayed count $f_{u}$ of $u$ at query time with error less than $\epsilon C$ and probability greater than $1-\delta$ .

The proofs of aforementioned theorems can be found in [Cafaro-Pulimeno-Epicoco-Aloisio].

The algorithm’s initialization requires as input parameters $\epsilon$ , the error; $\delta$ , the probability of failure; and $\phi$ , the support threshold. The initialization returns a sketch $D$ . The procedure starts deriving $d=\lceil\ln 1/\delta\rceil$ , the number of rows in the sketch and $w=\lceil\frac{e}{2\epsilon}\rceil$ , the number of columns in the sketch. Then, for each of the $d*w$ cells available in the sketch $D$ we allocate a data structure $\mathcal{S}$ with two Space Saving counters $c_{1}$ and $c_{2}$ . Given a counter $c_{j},j=1,2$ , we denote by $c_{j}.i$ and $c_{j}.f$ respectively the counter’s item and its estimated decayed frequency. Finally, we set the support threshold to $\phi$ , select $d$ pairwise independent hash functions $h_{1},\ldots,h_{d}:[m]\rightarrow[w]$ , mapping $m$ distinct items into $w$ cells, and initialize the count variable, representing the total decayed count of all of the items in the stream, to zero.

Updating the sketch upon arrival of a stream item $i$ with timestamp $t_{i}$ , shown in pseudo-code as Algorithm 1, requires computing $x$ , which is the non normalized forward decayed weight of the item, and incrementing count by $x$ . Then, we update the $d$ cells in which the item is mapped to by the corresponding hash functions $h_{j}(x),j=1,\dots,d$ by using the Space Saving item update procedure.

Let $\mathcal{S}$ denote the Space Saving stream summary data structure with two counters corresponding to the cell to be updated. Updating $\mathcal{S}$ upon arrival of an item works as follows. When processing an item which is already monitored by a counter, its estimated frequency is incremented by the non normalized weight $x$ . When processing an item which is not already monitored by one of the available counters, there are two possibilities. If a counter is available, it will be in charge of monitoring the item, and its estimated frequency is set to the non normalized weight $x$ . Otherwise, if all of the counters are already occupied (their frequencies are different from zero), the counter storing the item with minimum frequency is incremented by the non normalized weight $x$ . Then, the monitored item is evicted from the counter and replaced by the new item. This happens since an item which is not monitored can not have a frequency greater than the minimal frequency.

PFDCMSS, the parallel version of our sequential algorithm, works as follows. We assume the offline setting in which the stream items have been stored as a static dataset along with the corresponding timestamps. It is worth noting here immediately that our algorithm works in the streaming (online) setting as well. Indeed, in the former case (offline setting) we partition the input dataset and timestamps using a simple 1D block-based domain decomposition among the available $p$ processes and then process in parallel the sub-streams assigned to the processes using Algorithm 1. In the latter case (online setting), we have instead $p$ distributed sites, each handling a different stream $\sigma_{i},i=1,\ldots,p$ processed again using Algorithm 1.

In the parallel version, once the sub-streams have been processed, one of the processes is in charge of determining the time–faded heavy hitters. In order to do so, all of the processes engage in a parallel reduction in which their sketches are merged into a global sketch which preserves all of the information stored in the local sketches. This sketch is then queried and the time–faded heavy hitters are returned.

In the distributed setting, one of the sites may act as a centralized coordinator or there can be another different site taking this responsibility. The coordinator broadcasts, when required, a ”query” message to the $p$ sites, which then temporarily stop processing their sub-streams, and engage in the sketch merge procedure. We can imagine the distributed sites as being multi-threaded processes, in which one thread executes Algorithm 1, temporarily stops when a query message is received from the coordinator, creates a copy of its local sketch and then resume stream processing whilst another thread engages in the distributed sketch merging procedure using the sketch copy.

In order to retrieve the time–faded heavy hitters, a query can be posed when needed. The query, shown in pseudo-code as Algorithm 2, starts by determining the global decayed count for the whole stream $\sigma$ . This requires a parallel reduction in which the local decayed counts are summed. It is worth noting here that the global decayed count is still non normalized; the normalization occurs dividing by $g(t-L)$ , where $t$ is the query time and $L$ denotes the landmark time. Then, we build, through a user’s defined parallel reduction, a global sketch $G$ which is obtained by merging the local sketches. To do so, each process invokes a parallel reduction by using the MergeSketch operator shown in pseudo-code as Algorithm 3.

The sketches are reduced as follows: for every corresponding cell in two sketches to be merged, the hosted Space Saving summaries are merged following the steps described in [Cafaro-Pulimeno-Tempesta], i.e., building a temporary summary $\mathcal{S}_{C}$ consisting of all of the items monitored by both $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ . To each item in $\mathcal{S}_{C}$ is assigned a decayed frequency computed as follows: if an item is present in both $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ , its frequency is the sum of the its corresponding frequencies in each summary; if the item is present only in one of either $\mathcal{S}_{1}$ or $\mathcal{S}_{2}$ , its frequency is incremented by the minimum frequency of the other summary. At last, in order to derive the merged summary, we take only the $2$ items in $\mathcal{S}_{C}$ with the greatest frequencies and discard the others.

It is worth noting here that the sum of the counters in the stream summary data structure $\mathcal{S}$ related to a given cell $D[i][j]$ is equal to the value that the Count–Min sketch–based algorithm would store in the counter variable corresponding to that cell, i.e., the 1-norm of the frequency vector corresponding to the sub–stream falling in the cell through the pairwise independent hash functions. Thus, an augmented sketch is equivalent, from this perspective, to a Count–Min sketch and this property is preserved by the merge procedure. From now on we will call this property 1-norm equivalence.

However, merging Count–Min sketches simply requires adding the corresponding cells’ counters. Indeed, via linearity, the sum of sketches is equal to the sketch of the sums. Instead, in our case, we need an ad hoc procedure in order to correctly merge the two Space Saving stream summaries hosted by the corresponding cells so that 1-norm equivalence property is preserved. Nonetheless, the augmented sketch which results from our parallel merge reduction is 1-norm equivalent to the Count–Min sketch obtained by summing the Count–Min sketches corresponding to our augmented sketches which are the input of the parallel merge reduction.

Once the global sketch is obtained, the query procedure initializes $F$ , an empty set, and then it inspects each of the $d*w$ cells in the sketch $D$ . For a given cell, we determine $c_{m}$ , the counter in the data structure $\mathcal{S}$ with maximum decayed count. We normalize the decayed count stored in $c_{m}$ dividing by $g(t-L)$ , and then compare this quantity with the threshold given by $\phi*gCount$ ( $gCount$ being the normalized global decayed count). If the normalized decayed frequency is greater, we pose a point query for the item $c_{m}.i$ , shown in pseudo-code as Algorithm 4. If $p$ , the returned value, is greater than the threshold $\phi*gCount$ , then we insert in $F$ the pair $(c_{m}.i,p)$ .

The point query for an item $j$ returns its estimated decayed frequency. After initializing the answer variable to infinity, we inspect each of the $d$ cells in which the item is mapped to by the corresponding hash functions, to determine the minimum decayed frequency of the item. In each cell, if the item is stored by one of the Space Saving counters, we set answer to the minimum between answer and the corresponding counter’s decayed frequency. Otherwise (none of the two counters monitors the item $j$ ), we set answer to the minimum between answer and the minimum decayed frequency stored in the counters. Since the frequencies stored in all of the counters of the sketch are not normalized, we return the normalized frequency answer dividing by $g(t-L)$ .

At the end of the query procedure the set $F$ is returned.

4 Correctness

Here, we prove that our algorithm correctly merges two FDCMSS sketches. The merge procedure preserves all of the properties of the sketch, including the fact that, considering the sum of the Space Saving counters in each sketch cell, an FDCMSS sketch is 1-norm equivalent to the classical Count–Min sketch.

It is worth noting here that we would obtain a correct result by using the merge procedure presented in [Cafaro-Pulimeno-Tempesta] to combine the Space Saving summaries stored in the corresponding sketch cells, but we also want to impose 1-norm equivalence, i.e., the additional condition that the sum of counters’ values in each merged cell always reflects the total decayed count of the items which fell in the corresponding cells.

Indeed, in [Cafaro-Pulimeno-Tempesta] we showed how to merge Space Saving stream summaries in parallel. However, we have proved that our merge procedure satisfies the Space Saving properties described by eq. 2-5, and the following relaxed version of eq. 1:

[TABLE]

As shown in Theorem 3, which is the main result of this section, it turns out that $k=2$ counters (i.e., majority item mining) is a special case: when the Space Saving summaries to be merged hold two counters, than the property in eq. 1 holds for the merged summary in its original form, that is $\left|\mathcal{S}\right|=||\textbf{f}||_{1}$ , without modifying the merge procedure designed in [Cafaro-Pulimeno-Tempesta].

Theorem 3.

The parallel merge algorithm provides an augmented sketch that preserves all of the properties of a FDCMSS sketch.

Proof.

The correctness of the parallel FDCMSS sketch merge algorithm derives from the correctness of the Space Saving merge procedure, already shown in [Cafaro-Pulimeno-Tempesta]. It remains to show that, when looking to the sum of the Space Saving counters associated to each cell, the merged augmented sketch is still 1-norm equivalent to a Count–Min sketch, that is, the sum of the counters values is equal to the decayed count of all the items fallen in that cell.

Let us recall the merge algorithm for Space Saving summaries introduced in [Cafaro-Pulimeno-Tempesta]. We will use the multiset notation, thus let us rewrite the properties of a Space Saving summary stated in equations 1-4, this time with reference to multisets. Indeed, we model the input stream as a multiset (also called a bag), which essentially is a set where the duplication of elements is allowed. We shall use a calligraphic capital letter to denote a multiset, and the corresponding capital Greek letter to denote its underlying set. In particular, we extend the traditional notion of multiset as follows. Instead of considering an indicator function which returns the multiplicity of an item, we use a function providing the decayed frequency of that item. Therefore, summing over all of the items we obtain the total decayed count in place of the cardinality of the multiset.

Definition 5.

A decayed multiset $\mathcal{N}=(N,f_{\mathcal{N}})$ is a pair where $N$ is some set, called the underlying set of elements, and $f_{\mathcal{N}}:N\rightarrow\mathbb{R}$ is a function which provides the decayed frequency for each $x\in N$ according to Definition 3.

The decayed count of $\mathcal{N}$ is expressed by

[TABLE]

whilst the cardinality of the underlying set $N$ is

[TABLE]

From now on, when referring to either the exact or estimated frequency of an item, we shall mean the item’s exact or estimated decayed frequency. Recall that our Space Saving stream summary data structure uses exactly $k=2$ counters, and let $\mathcal{N}=(N,f_{\mathcal{N}})$ be the input decayed multiset, $\mathcal{S}=(\Sigma,\hat{f_{\mathcal{S}}})$ the decayed multiset of all of the monitored items and their respective counters at the end of the sequential Space Saving algorithm’s execution, i.e., the algorithm’s summary data structure. Let $\left|\mathcal{S}\right|$ be the sum of the frequencies stored in the counters, $f_{\mathcal{N}}(e)$ the exact frequency of an item $e$ , $\hat{f_{\mathcal{S}}}(e)$ its estimated frequency and $\hat{f_{\mathcal{S}}}^{min}$ the minimum frequency in $\mathcal{S}$ , where $\hat{f_{\mathcal{S}}}^{min}=0$ when $\left|{\Sigma}\right|<2$ . Indeed, even though a summary data structure has exactly 2 counters, it may monitor less than 2 items, since an item is actually monitored if and only if its counter’s frequency is different from zero. The following relations hold, for each item $e\in N$ :

[TABLE]

Now, let $\mathcal{S}_{1}=(\Sigma_{1},\hat{f}_{\mathcal{S}_{1}})$ and $\mathcal{S}_{2}=(\Sigma_{2},\hat{f}_{\mathcal{S}_{2}})$ be two summaries related respectively to the input sub-arrays $\mathcal{N}_{1}=(N_{1},f_{\mathcal{N}_{1}})$ and $\mathcal{N}_{2}=(N_{2},f_{\mathcal{N}_{2}})$ , with $\mathcal{N}=\mathcal{N}_{1}\uplus\mathcal{N}_{2}=(N,f_{\mathcal{N}})$ . Let $\mathcal{S}_{M}=(\Sigma_{M},\hat{f}_{\mathcal{S}_{M}})$ be the final merged summary.

Theorem 3 in [Cafaro-Pulimeno-Tempesta] states that if eqs. (10) - (12) hold for $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ and, if it is verified a relaxed version of eq. (9), i.e., it holds that

[TABLE]

then these properties continue to be true also for $\mathcal{S}_{M}$ (it is worth noting here that eq. (13) also holds for summaries produced by the sequential Space Saving algorithm). The authors show that this is enough to guarantee the correctness of the merge operation, but, in general $\left|{\mathcal{S}_{M}}\right|\leq\left|{\mathcal{N}}\right|$ .

In order to obtain $\mathcal{S}_{M}$ , we start combining $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ to obtain $\mathcal{S}_{C}$ , and then, if $\left|\Sigma_{C}\right|>2$ , we take the two counters with the greatest frequency values in $\mathcal{S}_{C}$ in order to build $\mathcal{S}_{M}$ , otherwise we return $\mathcal{S}_{M}=\mathcal{S}_{C}$ .

We can express the combine operation as shown by the following equation:

[TABLE]

In the special case of stream summaries holding exactly $k=2$ counters, it holds that for $i=1,2$ , $\left|\Sigma_{i}\right|\leq 2$ , and $\left|\Sigma_{C}\right|\leq 4$ . Now, suppose that $\left|\mathcal{S}_{i}\right|=\left|\mathcal{N}_{i}\right|$ , (this is true when $\mathcal{S}_{i}$ is produced by the sequential Space Saving) and let $\delta=\hat{f}_{\mathcal{S}_{1}}^{min}+\hat{f}_{\mathcal{S}_{2}}^{min}$ and $x=\left|{\Sigma_{C}}\right|-2$ . Furthermore, suppose that the entries in $\mathcal{S}_{C}$ are sorted in ascending order with regard to the counters’ frequencies.

As proved in [Cafaro-Pulimeno-Tempesta], it holds that:

[TABLE]

where the sum is extended over the first $x$ entries.

We have to show that the difference $x\delta-\sum_{i=1}^{x}\hat{f}_{\mathcal{S}_{C}}(e_{i})$ is always equal to zero when $k=2$ , so that $\left|{\mathcal{S}_{M}}\right|=\left|{\mathcal{N}}\right|$ .

When $x\leq 0$ , $x\delta=0$ . In that case, $\mathcal{S}_{M}=\mathcal{S}_{C}$ and $\left|{\mathcal{S}_{C}}\right|=\left|{\mathcal{N}}\right|$ .

When $x>0$ , the first $x$ counters of $\mathcal{S}_{C}$ have values equal to $\delta$ . To see this, consider the two cases $x=1$ and $x=2$ .

When $x=2$ , that is, the two summaries to be merged contain different items and $\left|{\Sigma_{C}}\right|=4$ , this is easily seen by simple computations: in fact, $\delta$ is the minimum value a counter in $\mathcal{S}_{C}$ can assume, and there are at least two counters with this value in $\mathcal{S}_{C}$ , obtained combining the two counters with minimum value in $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ . As a consequence these counters are the first two, and $x\delta-\sum_{i=1}^{x}\hat{f}_{\mathcal{S}_{C}}(e_{i})=0$ .

When $x=1$ , one of the following cases arises:

one of the summaries (without loss of generality, let us suppose it is $\mathcal{S}_{1}$ ) contains two counters, the other summary ( $\mathcal{S}_{2}$ ) contains only one counter and no item is in common between the summaries. In this case, $\delta$ is equal to the minimum counter in $\mathcal{S}_{1}$ since $\hat{f}_{\mathcal{S}_{2}}^{min}=0$ , but it is also the minimum counter in $\mathcal{S}_{C}$ , hence it holds that $\delta-\hat{f}_{\mathcal{S}_{C}}(e_{1})=0$ 2. 2.

both summaries contain two counters and they have exactly an item in common. In this case we further have to distinguish three cases:

(a)

the item in common has the minimum frequency in both the summaries. The combined frequency of this item will be equal to $\delta$ which is the sum of the minimum frequencies of two summaries. Its combined frequency is also the minimum in $\mathcal{S}_{C}$ , hence it holds that $\delta-\hat{f}_{\mathcal{S}_{C}}(e_{1})=0$ ; 2. (b)

the item in common has the maximum frequency in both the summaries. Its combined frequency is also the maximum value in $\mathcal{S}_{C}$ , and $\mathcal{S}_{C}$ contains two distinct items with combined frequency equal to $\delta$ which is also the minimum in $\mathcal{S}_{C}$ , hence $\delta-\hat{f}_{\mathcal{S}_{C}}(e_{1})=0$ ; 3. (c)

the item in common appears with minimum frequency in one summary (without loss of generality, let us suppose in $\mathcal{S}_{1}$ ) and with maximum frequency in the other summary ( $\mathcal{S}_{2}$ ). The combined frequency of the item which appears with minimum frequency in $\mathcal{S}_{2}$ is equal to $\delta$ which again is the minimum frequency of the counters in $\mathcal{S}_{C}$ , hence $\delta-\hat{f}_{\mathcal{S}_{C}}(e_{1})=0$ .

Taking into account that in all of the cases when $x=1$ the $\mathcal{S}_{C}$ contains at least one item whose combined frequency is equal to $\delta$ , it holds that $x\delta-\sum_{i=1}^{x}\hat{f}_{\mathcal{S}_{C}}(e_{i})=0$ .

∎

We have shown that all of the properties of a Space Saving summary of two counters are preserved by the merge procedure introduced in [Cafaro-Pulimeno-Tempesta]. It suffices to guarantee that all of the properties stated for an FDCMSS sketch continue to hold after the parallel merge procedure depicted in the algorithm presented. In particular, it holds the property 1, which guarantees that a merged FDCMSS sketch continues to be 1-norm equivalent to a Count–Min sketch.

5 Experimental results

In this section, we report experimental results on synthetic datasets. Here, we thoroughly test our algorithm using an exponential decay function. All of the experiments have been carried out on the Galileo cluster machine kindly provided by CINECA in Italy. This machine is a linux CentOS 7.0 NeXtScale cluster with 516 compute nodes; each node is equipped with 2 2.40 GHz octa-core Intel Xeon CPUs E5-2630 v3, 128 GB RAM and 2 16 GB Intel Xeon Phi 7120P accelerators (available on 384 nodes only). High-Performance networking among the nodes is provided by Intel QDR (40Gb/s) Infiniband. All of the codes were compiled using the Intel C++ compiler v17.0.0.

Let $f$ be the true frequency of an item and $\hat{f}$ the corresponding frequency reported by an algorithm, then the Relative Error is defined as $\Delta f=\frac{{\left|{f-\hat{f}}\right|}}{f}$ , and the Average Relative Error is derived by averaging the Relative Errors over all of the measured frequencies.

Precision, a metric defined as the total number of true heavy hitters reported over the total number of candidate items, quantifies the number of false positives reported by an algorithm in the output stream summary. Recall is the total number of true heavy hitters reported over the number of true heavy hitters given by an exact algorithm. In all of the results we obtained 100% recall, even on a tiny sketch of size 4 x 800 (recall may be less than 100%, but this happens only when the sketch size is really minimal). For this reason, to avoid wasting space, we do not show here recall plots. Rather, we present Precision, Absolute Error, Average Relative Error (ARE), Updates/ms and runtime/performance plots since we are interested in understanding the error behavior and the algorithm’s scalability when we use an increasing number of cores of execution. Table 1 reports the experiments carried out. For each different metric under examination, we varied $n$ , the stream size in billions of items, $\rho$ , the skew of the zipfian distribution, $\phi$ , the threshold and $w$ , the number of sketch columns. All of the other parameters are fixed when varying one of the previous ones, and we show, on top of each plot, the fixed parameters’ values.

Finally, we also present, for the metrics of interest, the results obtained by fixing the stream size and varying the number of cores utilized from 1 to 512. We conclude this section with a comparison between strong and weak scalability.

With the experiment 1 we aim at measuring the algorithm accuracy, the experiment 2 aims at measuring how the parallelization affects the algorithm’s accuracy, finally experiment 3 is meant to measure the computational performance of the parallel algorithm measuring both strong and weak scalability.

5.1 Algorithm accuracy

As shown by Figure 1, our parallel algorithm provides 100% Precision in all of the experiments carried out. Both the Absolute and the Average Relative Error, depicted respectively in Figure 2 and 3, have extremely low values, in particular with regard to their mean values. We observe that the Absolute Error is only slightly affected by the stream size $n$ , and it’s not affected at all by the threshold $\phi$ . The behaviour observed when varying $\rho$ and $w$ is expected. We observe a decrement of the Absolute Error in both cases, since, when the skew is higher, the number of frequent items in the corresponding zipfian distribution is lower and, when $w$ is higher, increasing the sketch size provides better accuracy and, correspondingly, less error. Regarding the Average Relative Error, we observe the same qualitative behaviors in the experiments carried out.

Finally, the updates done per millisecond, shown in Figure 4, appear to be stable around 100,000 when varying the stream size $n$ and the threshold $\phi$ . There is a visible increment (from 100,000 to 120,000) when varying $\rho$ and a decrement (from 100,000 to 90,000) when varying $w$ . These behaviors are expected for the same reasons we gave when analyzing the error. Indeed, processing a stream in which the number of frequent items is lower is usually faster. On the other hand, increasing the sketch size provides better accuracy, but more time is required to update the sketch.

5.2 Impact of the parallelization on the accuray

We now discuss the experimental scalability shown by our parallel algorithm. Figure 5 provides the results for the metrics under examination when testing strong scalability. That is, we fix the problem size (i.e., the stream size $n$ ) and increase the number of cores on which the algorithm is executed. As shown, Precision, Absolute and Average Relative Error are not affected at all by a strong scaling of the application, with Precision always equal to 100% and extremely low error values. Finally, the observed increment of the Updates/ms when varying the number of cores utilized is expected, due to the frequency updates made in parallel. Ideally, the throughput of the algorithm, measured as updates/ms, should increase with the same rate of increase of the number of MPI processes.

5.3 Computational perfomance

Figure 6 is related to the comparison we did to test weak scalability, which refers to the scalability of a parallel application obtained when the problem size is increased along with the number of cores, so that we measure how the running time changes with regard to the number of cores for a fixed problem size per core (whilst, for strong scalability, we measure how the running time changes with regard to the number of cores for a fixed total problem size).

As shown, the plot for strong scaling is a log-log plot of the running time versus the number of cores. The dashed straight line with slope -1 indicates ideal scalability, whereas any upward curvature away from that line indicates limited scalability. The plot reports a good strong scalability even on 512 cores; this is due to the high number of items to be processed in the input stream which makes the computational time higher than the parallel overhead.

Regarding weak scalability, the corresponding plot provides an indication of loss of performance when scaling from 1 to 16 cores while we have a very good scalability from 16 cores up to 512. This can be explained considering the parallel architecture used for testing. One computing node is made of two octa-core Xeon processors, so that the cores share the main memory banks and the third level cache memory. In the weak scalability experiment the problem size increases linearly with the number of processes, hence also the total memory increases linearly; since from 1 to 16 cores we use only one computing node the memory contention between parallel processes increases. When varying from 16 to 512 cores we use several different computing nodes ranging from 1 to 32; each node runs 16 processes which compete for memory accesses as already discussed, hence the further slight loss of performance is due to the communication overhead.

6 Conclusions

We have presented PFDCMSS, a novel message–passing based parallel algorithm for mining time–faded heavy hitters, which, to the best of our knowledge, is the first parallel algorithm solving the problem on message–passing parallel architectures. We have formally proved its correctness by showing that the underlying data structure, is non trivially mergeable. However, the parallel algorithm is fast and simple to implement, and we have shown, through extensive experimental results, that PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing very good parallel scalability.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Parallel mining of time–faded heavy hitters

Abstract

keywords:

1 Introduction

2 Preliminary definitions

Definition 1**.**

Definition 2**.**

Definition 3**.**

Definition 4**.**

Problem 1**.**

3 The algorithm

Theorem 1**.**

Theorem 2**.**

4 Correctness

Theorem 3**.**

Proof.

Definition 5**.**

5 Experimental results

5.1 Algorithm accuracy

5.2 Impact of the parallelization on the accuray

5.3 Computational perfomance

6 Conclusions

References

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Problem 1.

Theorem 1.

Theorem 2.

Theorem 3.

Definition 5.