Fast Capture of Cell-Level Provenance in Numpy

Jinjin Zhao; Sanjay Krishnan

arXiv:2506.18255·cs.DB·June 24, 2025

Fast Capture of Cell-Level Provenance in Numpy

Jinjin Zhao, Sanjay Krishnan

PDF

TL;DR

This paper introduces a prototype system for capturing cell-level provenance in numpy arrays, addressing challenges like API evolution and large datasets to improve reproducibility and data governance.

Contribution

It presents a novel annotation system for numpy that efficiently captures cell-level provenance, with memory optimizations to reduce latency.

Findings

01

Memory optimizations significantly reduce annotation latency

02

Prototype effectively captures cell-level provenance in numpy

03

Envisions integration into broader data governance systems

Abstract

Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows. However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types, and (3) large-scale datasets. To address these challenges, this paper presents a prototype annotation system designed for arrays, which captures cell-level provenance specifically within the numpy library. With this prototype, we explore straightforward memory optimizations that substantially reduce annotation latency. We envision this provenance capture approach for arrays as part of a broader governance system for tracking for structured data workflows and diverse data science applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.