ML Based Lineage in Databases
Michael Leybovich, Oded Shmueli

TL;DR
This paper introduces a machine learning and NLP-based approximation method for tracking database tuple lineage, reducing space complexity and improving explanation ranking, integrated into PostgreSQL with promising experimental results.
Contribution
It presents a novel ML/NLP approach for approximate lineage tracking in databases, addressing space issues and enabling efficient, ranked explanations.
Findings
High precision and recall in lineage approximation
Effective handling of multiple generations of tuples
Space-efficient lineage summaries
Abstract
We track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As time goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple. We devise an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Topic Modeling · Web Data Mining and Analysis
