Sampling-based Estimation of the Number of Distinct Values in   Distributed Environment

Jiajun Li; Zhewei Wei; Bolin Ding; Xiening Dai; Lu Lu; Jingren Zhou

arXiv:2206.05476·cs.DB·June 14, 2022

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

Jiajun Li, Zhewei Wei, Bolin Ding, Xiening Dai, Lu Lu, Jingren Zhou

PDF

1 Repo

TL;DR

This paper introduces a novel sketch-based distributed method for estimating the number of distinct values in large-scale data, significantly reducing communication costs while maintaining accuracy.

Contribution

A new distributed sampling-based NDV estimation method using sketches that minimizes communication costs and is compatible with existing estimators.

Findings

01

Reduces communication costs by orders of magnitude.

02

Achieves accurate NDV estimation with sub-linear communication.

03

Validated through extensive experiments.

Abstract

In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llijiajun/ndv_estimation_in_distributed_environment
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.