How do users design scientific workflows? The Case of Snakemake
Sebastian Pohl, Nourhan Elfaramawy, Kedi Cao, Birte Kehr, and Matthias, Weidlich

TL;DR
This paper investigates how users design scientific workflows with Snakemake by analyzing 1602 GitHub repositories, revealing common structures and language features to inform the development of more interactive workflow systems.
Contribution
It provides an empirical analysis of real-world Snakemake workflows, identifying typical patterns and features used, which aids in designing more user-friendly, interactive workflow systems.
Findings
Identified common workflow structures in Snakemake scripts
Analyzed frequently used language features in workflow specifications
Provided insights to improve interactive workflow design
Abstract
Scientific workflows automate the analysis of large-scale scientific data, fostering the reuse of data processing operators as well as the reproducibility and traceability of analysis results. In exploratory research, however, workflows are continuously adapted, utilizing a wide range of tools and software libraries, to test scientific hypotheses. Script-based workflow engines cater to the required flexibility through direct integration of programming primitives but lack abstractions for interactive exploration of the workflow design by a user during workflow execution. To derive requirements for such interactive workflows, we conduct an empirical study on the use of Snakemake, a popular Python-based workflow engine. Based on workflows collected from 1602 GitHub repositories, we present insights on common structures of Snakemake workflows, as well as the language features typically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
