Fast Capture of Cell-Level Provenance in Numpy
Jinjin Zhao, Sanjay Krishnan

TL;DR
This paper introduces a prototype system for capturing cell-level provenance in numpy arrays, addressing challenges like API evolution and large datasets to improve reproducibility and data governance.
Contribution
It presents a novel annotation system for numpy that efficiently captures cell-level provenance, with memory optimizations to reduce latency.
Findings
Memory optimizations significantly reduce annotation latency
Prototype effectively captures cell-level provenance in numpy
Envisions integration into broader data governance systems
Abstract
Effective provenance tracking enhances reproducibility, governance, and data quality in array workflows. However, significant challenges arise in capturing this provenance, including: (1) rapidly evolving APIs, (2) diverse operation types, and (3) large-scale datasets. To address these challenges, this paper presents a prototype annotation system designed for arrays, which captures cell-level provenance specifically within the numpy library. With this prototype, we explore straightforward memory optimizations that substantially reduce annotation latency. We envision this provenance capture approach for arrays as part of a broader governance system for tracking for structured data workflows and diverse data science applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
