Leveraging Discarded Samples for Tighter Estimation of Multiple-Set Aggregates
Edith Cohen, Haim Kaplan

TL;DR
This paper introduces new unbiased estimators for set-based aggregates that utilize discarded samples, significantly reducing estimation error in large datasets with applications like Jaccard similarity and association rules.
Contribution
The paper presents novel estimators that leverage discarded samples to improve the accuracy of aggregate estimations over multiple sets, outperforming traditional union-sketch methods.
Findings
Estimates dominate traditional union-sketch estimators for all queries and datasets.
Empirical results show 25%-4 fold reduction in estimation error.
Applicable to various set-based aggregates like Jaccard coefficient and Hamming distance.
Abstract
Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys' attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this union-sketch. We derive novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
