TL;DR
This paper proposes criteria for long-term, sustainable reproducibility of analysis pipelines, introduces Maneage as a proof-of-concept implementation, and demonstrates that longevity can be achieved without sacrificing short-term reproducibility.
Contribution
It defines a set of criteria for durable reproducibility and presents Maneage, a system enabling long-term archiving and verification of scientific analyses.
Findings
Maneage supports cheap archiving and provenance extraction.
Longevity does not compromise short-term reproducibility.
The approach is tested in multiple research publications.
Abstract
Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term. A set of criteria is introduced to address this problem: Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open source software. As a proof of concept, we introduce "Maneage" (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications. We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
