Learning Lineage Constraints for Data Science Operations
Jinjin Zhao

TL;DR
This paper proposes a unified architecture for representing data lineage across diverse data science libraries, enabling better debugging and understanding of complex workflows by abstracting logical patterns beyond specific data models.
Contribution
It introduces a novel framework inspired by intermediate representations to specify and infer logical data lineage across multiple libraries in a common, parameterized way.
Findings
Design of a cross-library lineage architecture (XProv)
Linking materialized graphs with abstract logical patterns
Initial ideas for inferring logical patterns from graphs
Abstract
Data science workflows often integrate functionalities from a diverse set of libraries and frameworks. Tasks such as debugging require data lineage that crosses library boundaries. The problem is that the way that "lineage" is represented is often intimately tied to particular data models and data manipulation paradigms. Inspired by the use of intermediate representations (IRs) in cross-library performance optimizations, this vision paper proposes a similar architecture for lineage - how do we specify logical lineage across libraries in a common parameterized way? In practice, cross-library workflows will contain both known operations and unknown operations, so a key design of XProv to link both materialized lineage graphs of data transformations and the aforementioned abstracted logical patterns. We further discuss early ideas on how to infer logical patterns when only the materialized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
