A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment
Jianguo Chen, Kenli Li, Zhuo Tang, Kashif Bilal, Shui Yu, Chuliang, Weng, Keqin Li

TL;DR
This paper introduces a Parallel Random Forest algorithm optimized for big data in Spark, combining data and task parallelism, data reduction techniques, and dimension reduction to improve accuracy, performance, and scalability.
Contribution
The paper presents a novel hybrid parallelization approach for Random Forests on Spark, enhancing efficiency and accuracy for large, high-dimensional, and noisy datasets.
Findings
Outperforms Spark MLlib in accuracy and speed
Demonstrates high scalability on big data
Effective in noisy and high-dimensional data environments
Abstract
With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
