A High-Performance Algorithm for Identifying Frequent Items in Data   Streams

Daniel Anderson; Pryce Bevan; Kevin Lang; Edo Liberty; Lee Rhodes,; Justin Thaler

arXiv:1705.07001·cs.DS·May 23, 2017·1 cites

A High-Performance Algorithm for Identifying Frequent Items in Data Streams

Daniel Anderson, Pryce Bevan, Kevin Lang, Edo Liberty, Lee Rhodes,, Justin Thaler

PDF

Open Access

TL;DR

This paper presents an optimized, industrial-ready version of the Misra-Gries algorithm for identifying frequent items in data streams, featuring improved handling of weighted updates and efficient merging, validated by experiments.

Contribution

It introduces a highly optimized algorithm for frequency estimation in data streams, supporting weighted updates and fast merging, suitable for industrial deployment.

Findings

01

The algorithm handles weighted updates in amortized constant time.

02

It offers a faster merging method for summaries.

03

Experiments show improved efficiency over prior methods.

Abstract

Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited many times due its practicality and applicability. We describe a highly optimized version of Misra and Gries' algorithm that is suitable for deployment in industrial settings. Our code is made public via an open source library called DataSketches that is already used by several companies and production systems. Our algorithm improves on two theoretical and practical aspects of prior work. First, it handles weighted updates in amortized constant time, a common requirement in practice. Second, it uses a simple and fast method for merging summaries that asymptotically improves on prior work even for unweighted streams. We describe experiments confirming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Advanced Database Systems and Queries · Data Management and Algorithms