Extensive Large-Scale Study of Error in Samping-Based Distinct Value Estimators for Databases
Vinay Deolalikar, Hernan Laffitte

TL;DR
This large-scale empirical study evaluates 11 distinct value estimators on billion-row datasets, revealing error patterns linked to a key latent parameter and providing visualization tools for better understanding and improvement.
Contribution
It is the first to scale to billion-row datasets, analyze error patterns related to a new latent parameter, and introduce visualization frameworks for distinct value estimators.
Findings
Estimator error depends on the average uniform class size.
Error patterns can be visualized to understand estimator performance.
The study scales to datasets of a billion rows, reflecting real-world database sizes.
Abstract
The problem of distinct value estimation has many applications. Being a critical component of query optimizers in databases, it also has high commercial impact. Many distinct value estimators have been proposed, using various statistical approaches. However, characterizing the errors incurred by these estimators is an open problem: existing analytical approaches are not powerful enough, and extensive empirical studies at large scale do not exist. We conduct an extensive large-scale empirical study of 11 distinct value estimators from four different approaches to the problem over families of Zipfian distributions whose parameters model real-world applications. Our study is the first that \emph{scales to the size of a billion-rows} that today's large commercial databases have to operate in. This allows us to characterize the error that is encountered in real-world applications of distinct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Quality and Management
