TL;DR
This paper introduces a supervised learning approach to accurately estimate the number of distinct values in database columns from samples, outperforming traditional heuristics and assumptions across diverse datasets.
Contribution
It formulates NDV estimation as a workload-agnostic supervised learning problem, enabling efficient and accurate estimations with synthetic training data and deployment as user-defined functions.
Findings
Learned estimator outperforms state-of-the-art methods on real datasets.
Estimator provides microsecond-level inference on CPU.
Training with synthetic data achieves workload agnosticism.
Abstract
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
