Zero-Cost NDV Estimation from Columnar File Metadata

Claude Brisson

arXiv:2603.24606·cs.DB·March 27, 2026

Zero-Cost NDV Estimation from Columnar File Metadata

Claude Brisson

PDF

Open Access

TL;DR

This paper introduces a zero-cost method for estimating the number of distinct values in columnar data files using only existing metadata, enabling efficient data profiling and optimization without additional storage or data access.

Contribution

The paper proposes a novel approach that leverages file metadata and a distribution detector to accurately estimate NDV across various columnar formats without extra overhead.

Findings

01

Accurate NDV estimates for well-spread data using dictionary size inversion.

02

Robust NDV estimation for sorted or partitioned data via min/max value counting.

03

Method generalizes to formats like Parquet, ORC, and F3.

Abstract

We present a method for estimating the number of distinct values (NDV) of a column in columnar file formats, using only existing file metadata--no extra storage, no data access. Two complementary signals are exploited: (1)~inverting the dictionary-encoded storage size equation yields accurate NDV estimates when distinct values are well-spread across row groups; (2)~counting distinct min/max values across row groups and inverting a coupon collector model provides robust estimates for sorted or partitioned data. A lightweight distribution detector routes between the two estimators. While demonstrated on Apache Parquet, the technique generalizes to any format with dictionary encoding and partition-level statistics, such as ORC and F3. Applications include cost-based query optimization, GPU memory allocation, and data profiling.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Data Management and Algorithms