Validity Constraints for Data Analysis Workflows
Florian Schintke, Ninon De Mecquenem, David Frantz, Vanessa Emanuela, Guarino, Marcus Hilbrich, Fabian Lehmann, Rebecca Sattler, Jan Arne Sparka,, Daniel Speckhard, Hermann Stolte, Anh Duc Vu, Ulf Leser

TL;DR
This paper introduces validity constraints (VCs) for data analysis workflows to explicitly specify assumptions, improve error detection, and enhance portability and robustness across different computing environments.
Contribution
It proposes a new concept of validity constraints for DAWs, enabling automatic validation, better error handling, and making implicit assumptions explicit.
Findings
Broad list and classification of VCs
Comparison with related concepts
Initial implementation sketch for existing infrastructures
Abstract
Porting a scientific data analysis workflow (DAW) to a cluster infrastructure, a new software stack, or even only a new dataset with some notably different properties is often challenging. Despite the structured definition of the steps (tasks) and their interdependencies during a complex data analysis in the DAW specification, relevant assumptions may remain unspecified and implicit. Such hidden assumptions often lead to crashing tasks without a reasonable error message, poor performance in general, non-terminating executions, or silent wrong results of the DAW, to name only a few possible consequences. Searching for the causes of such errors and drawbacks in a distributed compute cluster managed by a complex infrastructure stack, where DAWs for large datasets typically are executed, can be tedious and time-consuming. We propose validity constraints (VCs) as a new concept for DAW…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Distributed systems and fault tolerance
