Random Forests for Big Data
Robin Genuer (ISPED, SISTM), Jean-Michel Poggi (UPD5, LM-Orsay),, Christine Tuleau-Malot (JAD), Nathalie Villa-Vialaneix (MIAT INRA)

TL;DR
This paper reviews and evaluates various methods for scaling random forests to handle Big Data, including parallel, online, and divide-and-conquer approaches, through experiments on massive datasets.
Contribution
It provides a comprehensive review of existing scalable random forest methods and compares their performance on large-scale datasets.
Findings
Parallel and online methods improve scalability.
Different variants have varying accuracy and efficiency.
Limitations exist in some approaches for extremely large data.
Abstract
Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Regression
