The VC-Dimension of Queries and Selectivity Estimation Through Sampling
Matteo Riondato, Mert Akdere, Ugur Cetintemel, Stanley B. Zdonik, Eli, Upfal

TL;DR
This paper introduces a new method for estimating SQL query selectivity using VC-dimension theory, enabling accurate, sample-based predictions that are independent of database size and query count, with practical validation.
Contribution
The work provides an explicit VC-dimension bound for query outcome spaces and develops a sampling method for accurate selectivity estimation applicable to multiple queries simultaneously.
Findings
The VC-dimension depends on query predicate complexity, not database size.
A small, representative sample can accurately estimate query selectivity.
The method outperforms existing techniques like PostgreSQL and SQL Server in experiments.
Abstract
We develop a novel method, based on the statistical concept of the Vapnik-Chervonenkis dimension, to evaluate the selectivity (output cardinality) of SQL queries - a crucial step in optimizing the execution of large scale database and data-mining operations. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VC-dimension of a range space defined by all possible outcomes of a collection (class) of queries. We prove that the VC-dimension is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. We leverage on this result and develop a method that, given a class of queries, builds a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Machine Learning and Algorithms
