High Performance Dataframes from Parallel Processing Patterns
Niranda Perera, Supun Kamburugamuve, Chathura Widanage, Vibhatha, Abeykoon, Ahmet Uyar, Kaiying Shan, Hasara Maithree, Damitha Lenadora,, Thejaka Amila Kanewala, Geoffrey Fox

TL;DR
This paper introduces a framework for high-performance distributed-memory parallel dataframes, exemplified by Cylon, which overcomes performance limitations of traditional serial dataframes and is adaptable to various hardware configurations.
Contribution
It presents a novel framework for building scalable distributed dataframe systems and introduces Cylon as the first such system with demonstrated high performance.
Findings
Cylon achieves scalable high performance on large datasets.
The framework enables flexible and extensible distributed dataframe operations.
Cylon is the first distributed-memory parallel dataframe system available.
Abstract
The data science community today has embraced the concept of Dataframes as the de facto standard for data representation and manipulation. Ease of use, massive operator coverage, and popularization of R and Python languages have heavily influenced this transformation. However, most widely used serial Dataframes today (R, pandas) experience performance limitations even while working on even moderately large data sets. We believe that there is plenty of room for improvement by investigating the generic distributed patterns of dataframe operators. In this paper, we propose a framework that lays the foundation for building high performance distributed-memory parallel dataframe systems based on these parallel processing patterns. We also present Cylon, as a reference runtime implementation. We demonstrate how this framework has enabled Cylon achieving scalable high performance. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Scientific Computing and Data Management
