Random Forests for Big Data

Robin Genuer (ISPED; SISTM); Jean-Michel Poggi (UPD5; LM-Orsay),; Christine Tuleau-Malot (JAD); Nathalie Villa-Vialaneix (MIAT INRA)

arXiv:1511.08327·stat.ML·March 23, 2017

Random Forests for Big Data

Robin Genuer (ISPED, SISTM), Jean-Michel Poggi (UPD5, LM-Orsay),, Christine Tuleau-Malot (JAD), Nathalie Villa-Vialaneix (MIAT INRA)

PDF

TL;DR

This paper reviews and evaluates various methods for scaling random forests to handle Big Data, including parallel, online, and divide-and-conquer approaches, through experiments on massive datasets.

Contribution

It provides a comprehensive review of existing scalable random forest methods and compares their performance on large-scale datasets.

Findings

01

Parallel and online methods improve scalability.

02

Different variants have varying accuracy and efficiency.

03

Limitations exist in some approaches for extremely large data.

Abstract

Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Regression