ML Based Lineage in Databases

Michael Leybovich; Oded Shmueli

arXiv:2109.06339·cs.DB·October 5, 2021·1 cites

ML Based Lineage in Databases

Michael Leybovich, Oded Shmueli

PDF

Open Access

TL;DR

This paper introduces a machine learning and NLP-based approximation method for tracking database tuple lineage, reducing space complexity and improving explanation ranking, integrated into PostgreSQL with promising experimental results.

Contribution

It presents a novel ML/NLP approach for approximate lineage tracking in databases, addressing space issues and enabling efficient, ranked explanations.

Findings

01

High precision and recall in lineage approximation

02

Effective handling of multiple generations of tuples

03

Space-efficient lineage summaries

Abstract

We track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As time goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple. We devise an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Topic Modeling · Web Data Mining and Analysis