Leveraging Discarded Samples for Tighter Estimation of Multiple-Set   Aggregates

Edith Cohen; Haim Kaplan

arXiv:0903.0625·cs.DB·March 5, 2009

Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates

Edith Cohen, Haim Kaplan

PDF

Open Access

TL;DR

This paper introduces new unbiased estimators for set-based aggregates that utilize discarded samples, significantly reducing estimation error in large datasets with applications like Jaccard similarity and association rules.

Contribution

The paper presents novel estimators that leverage discarded samples to improve the accuracy of aggregate estimations over multiple sets, outperforming traditional union-sketch methods.

Findings

01

Estimates dominate traditional union-sketch estimators for all queries and datasets.

02

Empirical results show 25%-4 fold reduction in estimation error.

03

Applicable to various set-based aggregates like Jaccard coefficient and Hamming distance.

Abstract

Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications