Pipeline Provenance for Analysis, Evaluation, Trust or Reproducibility
Michael A. C. Johnson, Hans-Rainer Kl\"ockner, Albina Muzafarova,, Kristen Lackeos, David J. Champion, Marta Dembska, Sirko Schindler, Marcus, Paradies

TL;DR
This paper introduces PRAETOR, a software suite that automates provenance capture for Python data pipelines, enhancing reproducibility, trust, and enabling performance evaluation for machine learning optimization.
Contribution
It presents PRAETOR, a novel tool for automated provenance modeling and analysis in Python pipelines, supporting reproducibility and performance assessment.
Findings
Automated provenance generation for Python pipelines.
Supports evaluation of pipeline performance using quality metrics.
Facilitates machine learning optimization processes.
Abstract
Data volumes and rates of research infrastructures will continue to increase in the upcoming years and impact how we interact with their final data products. Little of the processed data can be directly investigated and most of it will be automatically processed with as little user interaction as possible. Capturing all necessary information of such processing ensures reproducibility of the final results and generates trust in the entire process. We present PRAETOR, a software suite that enables automated generation, modelling, and analysis of provenance information of Python pipelines. Furthermore, the evaluation of the pipeline performance, based upon a user defined quality matrix in the provenance, enables the first step of machine learning processes, where such information can be fed into dedicated optimisation procedures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
