Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Shawn Bowers, Timothy McPhillips, Bertram Lud\"ascher

TL;DR
This paper introduces a reasoning framework that validates and infers detailed data dependency annotations in scientific workflows, improving the accuracy of provenance and lineage analysis.
Contribution
It extends previous dependency annotation work by providing a consistency checking and inference framework for workflow specifications using answer-set programming.
Findings
Framework ensures consistency of dependency annotations
Can infer complete annotations from partial data
Implementation demonstrates practical applicability
Abstract
An advantage of scientific workflow systems is their ability to collect runtime provenance information as an execution trace. Traces include the computation steps invoked as part of the workflow run along with the corresponding data consumed and produced by each workflow step. The information captured by a trace is used to infer "lineage" relationships among data items, which can help answer provenance queries to find workflow inputs that were involved in producing specific workflow outputs. Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives". In prior work, we defined annotations for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Advanced Database Systems and Queries
