Reproducible and Portable Big Data Analytics in the Cloud
Xin Wang, Pei Guo, Xingyan Li, Aryya Gangopadhyay, Carl E. Busart,, Jade Freeman, Jianwu Wang

TL;DR
This paper introduces an open-source toolkit that automates scalable, reproducible big data analytics in the cloud, addressing portability and automation challenges across different cloud providers using serverless and containerization techniques.
Contribution
It presents a novel toolkit leveraging serverless computing and containerization to automate scalable execution and enable cross-cloud application portability for big data analytics.
Findings
The toolkit supports fully automated end-to-end execution with a single command.
It enables reproducibility of analytics across different cloud environments.
Experiments demonstrate good performance, scalability, and reproducibility on AWS and Azure.
Abstract
Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
MethodsAdapter
