A Parallel Random Forest Algorithm for Big Data in a Spark Cloud   Computing Environment

Jianguo Chen; Kenli Li; Zhuo Tang; Kashif Bilal; Shui Yu; Chuliang; Weng; Keqin Li

arXiv:1810.07748·cs.DC·November 26, 2019

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Jianguo Chen, Kenli Li, Zhuo Tang, Kashif Bilal, Shui Yu, Chuliang, Weng, Keqin Li

PDF

TL;DR

This paper introduces a Parallel Random Forest algorithm optimized for big data in Spark, combining data and task parallelism, data reduction techniques, and dimension reduction to improve accuracy, performance, and scalability.

Contribution

The paper presents a novel hybrid parallelization approach for Random Forests on Spark, enhancing efficiency and accuracy for large, high-dimensional, and noisy datasets.

Findings

01

Outperforms Spark MLlib in accuracy and speed

02

Demonstrates high scalability on big data

03

Effective in noisy and high-dimensional data environments

Abstract

With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining data-parallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.