Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Daniel Ting

arXiv:1709.04048·stat.CO·September 14, 2017·2 cites

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

Daniel Ting

PDF

Open Access

TL;DR

This paper introduces a new data sketching technique for efficiently estimating subset sums and identifying frequent items in massive, disaggregated datasets, with proven accuracy and broad applicability.

Contribution

The paper presents a novel data sketch that accurately estimates subset sums and frequent items in disaggregated data, outperforming existing methods especially on skewed data.

Findings

01

Achieves unbiased, high-accuracy estimates for subset sums.

02

Effectively identifies frequent items and heavy hitters.

03

Outperforms priority sampling and uniform sampling on skewed data.

Abstract

We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Advanced Database Systems and Queries · Complex Network Analysis Techniques