Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
Daniel Ting

TL;DR
This paper introduces a new data sketching technique for efficiently estimating subset sums and identifying frequent items in massive, disaggregated datasets, with proven accuracy and broad applicability.
Contribution
The paper presents a novel data sketch that accurately estimates subset sums and frequent items in disaggregated data, outperforming existing methods especially on skewed data.
Findings
Achieves unbiased, high-accuracy estimates for subset sums.
Effectively identifies frequent items and heavy hitters.
Outperforms priority sampling and uniform sampling on skewed data.
Abstract
We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Advanced Database Systems and Queries · Complex Network Analysis Techniques
