In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines

Khalid Belhajjame; Haroun Mezrioui; Yuyan Zhao

arXiv:2511.03480·cs.DB·November 6, 2025

In-Memory Indexing and Querying of Provenance in Data Preparation Pipelines

Khalid Belhajjame, Haroun Mezrioui, Yuyan Zhao

PDF

Open Access

TL;DR

This paper introduces an efficient tensor-based indexing mechanism for capturing and querying detailed data provenance in data preparation pipelines, supporting both retrospective and prospective provenance at fine granularity.

Contribution

It presents a novel tensor-based approach that captures and combines retrospective and prospective provenance for efficient querying in data pipelines.

Findings

01

Supports fine-grained attribute-level provenance

02

Achieves efficient querying with minimal memory overhead

03

Demonstrates effectiveness on real and synthetic data

Abstract

Data provenance has numerous applications in the context of data preparation pipelines. It can be used for debugging faulty pipelines, interpreting results, verifying fairness, and identifying data quality issues, which may affect the sources feeding the pipeline execution. In this paper, we present an indexing mechanism to efficiently capture and query pipeline provenance. Our solution leverages tensors to capture fine-grained provenance of data processing operations, using minimal memory. In addition to record-level lineage relationships, we provide finer granularity at the attribute level. This is achieved by augmenting tensors, which capture retrospective provenance, with prospective provenance information, drawing connections between input and output schemas of data processing operations. We demonstrate how these two types of provenance (retrospective and prospective) can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices · Software System Performance and Reliability