The data, they are a-changin'
Paolo Missier, Jacek Cala, Eldarina Wijaya

TL;DR
This paper explores how the value of derived knowledge from large datasets decays over time due to evolving data processing environments, proposing models for re-computation decisions with a genomics case study.
Contribution
It introduces an initial model for reasoning about change and re-computation, informed by provenance analysis, to maintain data value over time.
Findings
Proposed a model for re-computation decisions based on change analysis.
Demonstrated the approach with a genomics case study.
Showed how provenance analysis informs re-computation choices.
Abstract
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors: low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. This suggests that the value of such derivative knowledge may decay over time, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Genomics and Phylogenetic Studies
