MTS Sketch for Accurate Estimation of Set-Expression Cardinalities from Small Samples
Reuven Cohen, Liran Katzir, Aviv Yehezkel

TL;DR
This paper introduces the MTS sketch, a new probabilistic data structure that accurately estimates set expression cardinalities from small samples of multiple data streams, enabling efficient query optimization in large databases.
Contribution
The paper presents the MTS sketch and algorithms for precise set expression cardinality estimation using minimal samples, improving efficiency in big data processing.
Findings
Algorithms are unbiased and have low asymptotic variance.
The framework accurately estimates complex set expressions from small samples.
Constant memory usage enables scalability in large data systems.
Abstract
Sketch-based streaming algorithms allow efficient processing of big data. These algorithms use small fixed-size storage to store a summary ("sketch") of the input data, and use probabilistic algorithms to estimate the desired quantity. However, in many real-world applications it is impractical to collect and process the entire data stream, the common practice is thus to sample and process only a small part of it. While sampling is crucial for handling massive data sets, it may reduce accuracy. In this paper we present a new framework that can accurately estimate the cardinality of any set expression between any number of streams using only a small sample of each stream. The proposed framework consists of a new sketch, called Maximal-Term with Subsample (MTS), and a family of algorithms that use this sketch. An example of a possible query that can be efficiently answered using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification
