Zero-Cost NDV Estimation from Columnar File Metadata
Claude Brisson

TL;DR
This paper introduces a zero-cost method for estimating the number of distinct values in columnar data files using only existing metadata, enabling efficient data profiling and optimization without additional storage or data access.
Contribution
The paper proposes a novel approach that leverages file metadata and a distribution detector to accurately estimate NDV across various columnar formats without extra overhead.
Findings
Accurate NDV estimates for well-spread data using dictionary size inversion.
Robust NDV estimation for sorted or partitioned data via min/max value counting.
Method generalizes to formats like Parquet, ORC, and F3.
Abstract
We present a method for estimating the number of distinct values (NDV) of a column in columnar file formats, using only existing file metadata--no extra storage, no data access. Two complementary signals are exploited: (1)~inverting the dictionary-encoded storage size equation yields accurate NDV estimates when distinct values are well-spread across row groups; (2)~counting distinct min/max values across row groups and inverting a coupon collector model provides robust estimates for sorted or partitioned data. A lightweight distribution detector routes between the two estimators. While demonstrated on Apache Parquet, the technique generalizes to any format with dictionary encoding and partition-level statistics, such as ORC and F3. Applications include cost-based query optimization, GPU memory allocation, and data profiling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Data Management and Algorithms
