Preserving the value of large scale data analytics over time through selective re-computation
Paolo Missier, Jacek Cala, Maisha Rathi

TL;DR
This paper addresses the challenge of maintaining the value of large-scale data analytics over time by proposing a formal framework for selective re-computation, considering data drift, algorithm improvements, and evolving reference datasets.
Contribution
It introduces a formalization of the problem of re-computation decision-making in data analytics, emphasizing a generic and customizable approach supported by metadata analysis.
Findings
Initial formalization of re-computation decision problem
Identification of key challenges in value decay and cost estimation
Proposed approach based on metadata analysis from computation history
Abstract
A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time, as the data used to compute it drifts, the algorithms used in the processes are improved, and the external knowledge embodied by reference datasets used in the computation evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data change events, requires problem-specific functions to quantify their value and its decay over time, as well as models for estimating the cost of their re-computation. What makes this problem challenging is the ambition to develop a decision support system for informing data analytics re-computation decisions over time, that is both generic and customisable. With the help of a case study from genomics, in this vision paper we offer an initial formalisation of this problem, highlight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Scientific Computing and Data Management
