Optimization Opportunities for Cloud-Based Data Pipeline Infrastructures
Johannes Jablonski, Georg-Daniel Schwarz, Philip Heltweg, Dirk Riehle

TL;DR
This paper systematically reviews optimization opportunities in cloud-based data pipelines, focusing on cost, speed, and resource utilization, and identifies research gaps and future directions.
Contribution
It provides a comprehensive theory of optimization goals and highlights gaps in current research on cloud data pipeline optimization.
Findings
Identifies key optimization goals like cost minimization and execution time reduction.
Highlights gaps such as underexplored multi-tenant environments.
Suggests future research directions for industry evaluation.
Abstract
Cloud infrastructure supports the efficient operation of data pipelines regarding requirements like cost, speed, and resource utilization. We present an integrated view of optimization opportunities for cloud-based data pipelines by conducting a systematic review of existing literature on optimization approaches to cloud infrastructure performance for data pipelines. Our study contributes a theory of optimization goals like minimizing cost, reducing execution time, and cost-makespan trade-offs, consisting of dimensions such as single vs. multi-cloud, batch vs. stream processing, etc. We highlight gaps in primary research, including the underexploration of multi-tenant environments and lack of industry evaluation, and suggest directions for future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
