An Abstract View of Big Data Processing Programs
Joao Batista de Souza Neto, Anamaria Martins Moreira, Genoveva, Vargas-Solar, Martin A. Musicante

TL;DR
This paper introduces a formal model for specifying Big Data processing programs that captures both non-iterative and iterative workflows, enabling better understanding and comparison of different frameworks' strategies.
Contribution
It extends an existing data flow model to include iterative programs and generalizes iteration strategies across multiple Big Data frameworks.
Findings
Model captures data flow and transformation at two levels
Generalizes iteration strategies of Spark, DryadLINQ, Beam, Flink
Facilitates formal comparison of Big Data frameworks
Abstract
This paper proposes a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks. The paper focuses on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow Big Data processing frameworks. The proposed model relies on monoid AlgebraandPetri Netstoabstract Big Data processing programs in two levels: a high level representing the program data flow and a lower level representing data transformation operations (e.g., filtering, aggregation, join). We extend the model for data processing programs proposed in [1], to enable the use of iterative programs. The general specification of iterative data processing programs implemented by data flow-based parallel programming models is essential given the democratization of iterative and greedy Big Data analytics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Distributed systems and fault tolerance
