Towards Establishing Guaranteed Error for Learned Database Operations
Sepanta Zeighami, Cyrus Shahabi

TL;DR
This paper provides the first theoretical analysis of error guarantees for learned database operations, establishing bounds on model size needed for guaranteed accuracy in indexing, cardinality, and range-sum estimation.
Contribution
It introduces the first theoretical conditions and lower bounds on model size for achieving guaranteed accuracy in learned database operations.
Findings
Derived lower bounds on model size for guaranteed accuracy.
Bounded model size based on data size and error requirements.
Guidelines for integrating learned models into real-world systems.
Abstract
Machine learning models have demonstrated substantial performance enhancements over non-learned alternatives in various fundamental data management operations, including indexing (locating items in an array), cardinality estimation (estimating the number of matching records in a database), and range-sum estimation (estimating aggregate attribute values for query-matched records). However, real-world systems frequently favor less efficient non-learned methods due to their ability to offer (worst-case) error guarantees - an aspect where learned approaches often fall short. The primary objective of these guarantees is to ensure system reliability, ensuring that the chosen approach consistently delivers the desired level of accuracy across all databases. In this paper, we embark on the first theoretical study of such guarantees for learned methods, presenting the necessary conditions for…
Peer Reviews
Decision·ICLR 2024 poster
* As pointed out by the authors, for database operations like indexing, learned estimators have empirically been shown to outperform some well-known traditional methods. However, such learned estimators are not widely used as no guarantees on their errors are known. In this sense, this paper makes a significant contribution by providing such guarantees for some useful database operations. * The experimental evaluation provides some evidence that the bounds provided in the paper ar
* The cardinality estimation queries considered in the paper are restrictive. For such queries to be useful in practical database systems, they should include more complex queries, in particular, the join operator. In fact, one of the most important cardinality estimation tasks in databases is the estimation of the size of a join query, which is witnessed by the large number of research articles on this subject. * The range-sum estimation queries considered in the paper are also r
S1) Theoretical results are accompanied by empirical results. S2) Results provide some insight into the practical complexity of approximating some database operators for multidimensional numerical data with learned models. S3) The paper is generally easy to read and understand.
W1) Presentation a bit misleading: The paper gives the impression as if the results apply to general database operators over all sorts of tabular data (e.g., SQL queries over mix of categorial/numerical data) while the results are limited to orthogonal/axis-aligned range queries (intersection of range selections along each dimensions). In general the limitations of this work are not outlined clearly. W2) Significance unclear: The empirical study is too limited to give a clear idea how much pred
This paper tackles an important problem. Existing learned index structures either grow unbounded to support a specific error (e.g., PGM index), or have an unbounded error but a specific size (e.g., RMI). In the former case, the author's bounds can be used to estimate the size of the fixed-error index structure ahead of time. In the latter case, where the model size is fixed ahead of time and the error is determined during training, the author's bounds can be used to estimate an initial model siz
While the bounds given by the authors certainly help bring our understanding of learned database components closer to that of traditional data structures, it is not clear to me how these bounds could be used in systems today. The most I seem to be able to say with the author's bounds is "if your learned index uses S bytes of memory, then for a given domain size, there exists a dataset size n for which your index must have an error larger than e." It is not clear to me how to use these bounds t
Videos
Taxonomy
TopicsAdvanced Database Systems and Queries · Distributed and Parallel Computing Systems · Reservoir Engineering and Simulation Methods
