Learning Lineage Constraints for Data Science Operations

Jinjin Zhao

arXiv:2506.18252·cs.DB·June 24, 2025

Learning Lineage Constraints for Data Science Operations

Jinjin Zhao

PDF

TL;DR

This paper proposes a unified architecture for representing data lineage across diverse data science libraries, enabling better debugging and understanding of complex workflows by abstracting logical patterns beyond specific data models.

Contribution

It introduces a novel framework inspired by intermediate representations to specify and infer logical data lineage across multiple libraries in a common, parameterized way.

Findings

01

Design of a cross-library lineage architecture (XProv)

02

Linking materialized graphs with abstract logical patterns

03

Initial ideas for inferring logical patterns from graphs

Abstract

Data science workflows often integrate functionalities from a diverse set of libraries and frameworks. Tasks such as debugging require data lineage that crosses library boundaries. The problem is that the way that "lineage" is represented is often intimately tied to particular data models and data manipulation paradigms. Inspired by the use of intermediate representations (IRs) in cross-library performance optimizations, this vision paper proposes a similar architecture for lineage - how do we specify logical lineage across libraries in a common parameterized way? In practice, cross-library workflows will contain both known operations and unknown operations, so a key design of XProv to link both materialized lineage graphs of data transformations and the aforementioned abstracted logical patterns. We further discuss early ideas on how to infer logical patterns when only the materialized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.