Beating CountSketch for Heavy Hitters in Insertion Streams

Vladimir Braverman; Stephen R. Chestnut; Nikita Ivkin; David P.; Woodruff

arXiv:1511.00661·cs.DS·November 3, 2015

Beating CountSketch for Heavy Hitters in Insertion Streams

Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, David P., Woodruff

PDF

Open Access

TL;DR

This paper introduces a space-efficient algorithm for identifying heavy hitters in data streams, surpassing the traditional CountSketch method by using Gaussian process techniques to reduce space complexity.

Contribution

It presents the first algorithm achieving $O( ext{log} n ext{ log log} n)$ bits of space for heavy hitter detection, improving previous bounds and introducing new methods for $F_2$ and $ ext{l}_ ext{infinity}$ norm estimation.

Findings

01

Achieves $O( ext{log} n ext{ log log} n)$ bits space for heavy hitters.

02

Provides the first $F_2$ estimation at all stream points with low space.

03

Resolves an open problem for $ ext{l}_ ext{infinity}$ norm estimation in insertion streams.

Abstract

Given a stream $p_{1}, \dots, p_{m}$ of items from a universe $U$ , which, without loss of generality we identify with the set of integers ${1, 2, \dots, n}$ , we consider the problem of returning all $ℓ_{2}$ -heavy hitters, i.e., those items $j$ for which $f_{j} \geq ϵ F_{2}$ , where $f_{j}$ is the number of occurrences of item $j$ in the stream, and $F_{2} = \sum_{i \in [n]} f_{i}^{2}$ . Such a guarantee is considerably stronger than the $ℓ_{1}$ -guarantee, which finds those $j$ for which $f_{j} \geq ϵ m$ . In 2002, Charikar, Chen, and Farach-Colton suggested the {\sf CountSketch} data structure, which finds all such $j$ using $Θ (lo g^{2} n)$ bits of space (for constant $ϵ > 0$ ). The only known lower bound is $Ω (lo g n)$ bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Data Management and Algorithms