In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes
Niranda Perera, Arup Kumar Sarker, Mills Staylor, Gregor von, Laszewski, Kaiying Shan, Supun Kamburugamuve, Chathura Widanage, Vibhatha, Abeykoon, Thejaka Amila Kanewela, Geoffrey Fox

TL;DR
This paper explores parallel processing patterns for high-performance dataframes, introduces a cost model for evaluating these patterns, and assesses Cylon's performance on a supercomputer to improve data preprocessing efficiency.
Contribution
It extends previous work by developing a cost model for parallel dataframe patterns and evaluates Cylon on a supercomputer for high-performance data processing.
Findings
Cylon demonstrates significant performance improvements on large datasets.
The cost model effectively predicts the efficiency of different processing patterns.
Parallel patterns outperform serial dataframes in high-performance computing environments.
Abstract
The Data Science domain has expanded monumentally in both research and industry communities during the past decade, predominantly owing to the Big Data revolution. Artificial Intelligence (AI) and Machine Learning (ML) are bringing more complexities to data engineering applications, which are now integrated into data processing pipelines to process terabytes of data. Typically, a significant amount of time is spent on data preprocessing in these pipelines, and hence improving its e fficiency directly impacts the overall pipeline performance. The community has recently embraced the concept of Dataframes as the de-facto data structure for data representation and manipulation. However, the most widely used serial Dataframes today (R, pandas) experience performance limitations while working on even moderately large data sets. We believe that there is plenty of room for improvement by taking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Scientific Computing and Data Management · Cloud Computing and Resource Management
