Supercharging Distributed Computing Environments For High Performance Data Engineering
Niranda Perera, Kaiying Shan, Supun Kamburugamuwe, Thejaka Amila, Kanewela, Chathura Widanage, Arup Sarker, Mills Staylor, Tianle Zhong,, Vibhatha Abeykoon, Geoffrey Fox

TL;DR
This paper introduces CylonFlow, a high-performance distributed dataframe system that significantly enhances scalability and speed on Dask and Ray, outperforming existing solutions like Dask Dataframes by up to 30 times.
Contribution
The paper presents CylonFlow, integrating the Cylon system into Dask and Ray to optimize distributed dataframe processing with a novel execution paradigm.
Findings
CylonFlow achieves 30x more distributed performance than Dask Dataframes.
CylonFlow enables superior sequential performance due to native C++ execution.
The approach can unify high-performance computing and distributed data engineering ecosystems.
Abstract
The data engineering and data science community has embraced the idea of using Python & R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these applications are now essential in order to process terabytes of data. They can easily exceed the capabilities of a single machine, but also demand significant developer time & effort. Therefore it is essential to design scalable dataframe solutions. There have been multiple attempts to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask/Ray distributed computing features look very promising, we perceive that the Dask Dataframes/Ray Datasets still have room for optimization. In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed and Parallel Computing Systems
