HiFrames: High Performance Data Frames in a Scripting Language
Ehsan Totoni, Wajih Ul Hassan, Todd A. Anderson, Tatiana Shpeisman

TL;DR
HiFrames is a compiler-integrated data frame system that combines expressive APIs with automatic parallelization, achieving significant speedups over existing solutions like Spark SQL in distributed data analytics tasks.
Contribution
It introduces a novel compiler-based approach to tightly integrate data frames with array computations, enabling high-performance distributed analytics in a scripting language.
Findings
HiFrames is 3.6x to 70x faster than Spark SQL on basic operations.
It can be up to 20,000x faster for advanced analytics like weighted moving averages.
HiFrames outperforms Spark SQL on TPCx-BB Q26 by 5x on 64 nodes.
Abstract
Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Cloud Computing and Resource Management · Advanced Data Storage Technologies
