Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
Hernan Picatto, Georg Heiler, Peter Klimek

TL;DR
This paper presents a cost-effective, multi-platform data orchestration framework using Dagster that improves performance and reduces operational costs compared to traditional PaaS solutions like EMR and Databricks.
Contribution
It introduces a flexible, vendor-agnostic orchestration approach with significant cost savings and performance improvements for big data processing.
Findings
Achieved 12% performance improvement over EMR.
Realized 40% cost reduction compared to Databricks.
Saved over 300 euros per pipeline run.
Abstract
The rapid advancement of big data technologies has underscored the need for robust and efficient data processing solutions. Traditional Spark-based Platform-as-a-Service (PaaS) solutions, such as Databricks and Amazon Web Services Elastic MapReduce, provide powerful analytics capabilities but often result in high operational costs and vendor lock-in issues. These platforms, while user-friendly, can lead to significant inefficiencies due to their cost structures and lack of transparent pricing. This paper introduces a cost-effective and flexible orchestration framework using Dagster. Our solution aims to reduce dependency on any single PaaS provider by integrating various Spark execution environments. We demonstrate how Dagster's orchestration capabilities can enhance data processing efficiency, enforce best coding practices, and significantly reduce operational costs. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Business Intelligence
