Learning to be a Statistician: Learned Estimator for Number of Distinct   Values

Renzhi Wu; Bolin Ding; Xu Chu; Zhewei Wei; Xiening Dai; Tao Guan,; Jingren Zhou

arXiv:2202.02800·cs.LG·February 8, 2022

Learning to be a Statistician: Learned Estimator for Number of Distinct Values

Renzhi Wu, Bolin Ding, Xu Chu, Zhewei Wei, Xiening Dai, Tao Guan,, Jingren Zhou

PDF

1 Repo

TL;DR

This paper introduces a supervised learning approach to accurately estimate the number of distinct values in database columns from samples, outperforming traditional heuristics and assumptions across diverse datasets.

Contribution

It formulates NDV estimation as a workload-agnostic supervised learning problem, enabling efficient and accurate estimations with synthetic training data and deployment as user-defined functions.

Findings

01

Learned estimator outperforms state-of-the-art methods on real datasets.

02

Estimator provides microsecond-level inference on CPU.

03

Training with synthetic data achieves workload agnosticism.

Abstract

Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based estimators typically rely on heuristics or assumptions and do not have robust performance across different datasets as the assumptions on data can easily break. On the other hand, deriving an estimator from a principled formulation such as maximum likelihood estimation is very challenging due to the complex structure of the formulation. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator. To this end, we need to answer several questions:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wurenzhi/learned_ndv_estimator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.