In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines
Khalid Belhajjame, Haroun Mezrioui, Yuyan Zhao

TL;DR
This paper introduces an efficient tensor-based indexing mechanism for capturing and querying detailed data provenance in data preparation pipelines, supporting both retrospective and prospective provenance at fine granularity.
Contribution
It presents a novel tensor-based approach that captures and combines retrospective and prospective provenance for efficient querying in data pipelines.
Findings
Supports fine-grained attribute-level provenance
Achieves efficient querying with minimal memory overhead
Demonstrates effectiveness on real and synthetic data
Abstract
Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the sources feeding the pipeline execution. In this paper, we present an indexing mechanism to efficiently capture and query pipeline provenance. Our solution leverages tensors to capture fine-grained provenance of data processing operations, using minimal memory. In addition to record-level lineage relationships, we provide finer granularity at the attribute level. This is achieved by augmenting tensors, which capture retrospective provenance, with prospective provenance information, drawing connections between input and output schemas of data processing operations. We demonstrate how these two types of provenance (retrospective and prospective) can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Software System Performance and Reliability
