FlashR: R-Programmed Parallel and Scalable Machine Learning using SSDs
Da Zheng, Disa Mhembere, Joshua T. Vogelstein, Carey E. Priebe, Randal, Burns

TL;DR
FlashR enables scalable, parallel execution of R code for large datasets by leveraging SSDs and generalized operations, significantly improving performance over existing R implementations and other ML frameworks.
Contribution
It introduces a novel approach to scale R code using SSDs and generalized operations, allowing existing R code to run efficiently on large datasets with minimal modifications.
Findings
FlashR outperforms H2O and Spark MLlib by 2-10x in machine learning tasks.
FlashR's performance closely matches in-memory execution despite out-of-core processing.
It enables R to handle datasets with billions of data points efficiently.
Abstract
R is one of the most popular programming languages for statistics and machine learning, but the R framework is relatively slow and unable to scale to large datasets. The general approach for speeding up an implementation in R is to implement the algorithms in C or FORTRAN and provide an R wrapper. FlashR takes a different approach: it executes R code in parallel and scales the code beyond memory capacity by utilizing solid-state drives (SSDs) automatically. It provides a small number of generalized operations (GenOps) upon which we reimplement a large number of matrix functions in the R base package. As such, FlashR parallelizes and scales existing R code with little/no modification. To reduce data movement between CPU and SSDs, FlashR evaluates matrix operations lazily, fuses operations at runtime, and uses cache-aware, two-level matrix partitioning. We evaluate FlashR on a variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Graph Theory and Algorithms
