Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance
Stanis{\l}aw Ka\'zmierczak, Jacek Ma\'ndziuk

TL;DR
This study demonstrates that using a bootstrap sampling rate greater than 1.0 in random forests can significantly improve classification accuracy, challenging previous assumptions about the inefficacy of larger sample sizes.
Contribution
The paper provides empirical evidence that higher bootstrap rates than 1.0 can enhance RF performance and identifies data characteristics as key factors influencing the optimal rate.
Findings
Higher BR values lead to statistically significant accuracy improvements.
Optimal BR depends mainly on dataset characteristics.
Analysis of leaf structures reveals how BR influences tree complexity.
Abstract
Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set (). Previous research indicates that drawing fewer than observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than observations (BR 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR…
Peer Reviews
Decision·Submitted to ICLR 2025
Extensive experimental studies
1. Lack of Insight: The study would greatly benefit from a discussion on the underlying rationale behind the observed improvements in classification accuracy with BR > 1. Although the (very limited) empirical results suggest that higher BR values could potentially improve performance, the paper does not offer insights into why this might be the case. Introducing a simplified, theoretical case, such as a linear Gaussian model, could help illustrate the mechanisms at play and provide a foundation
1) Overall, this work re-examines the bootstrap rate in random forests, considering a wide range of BR values and multiple datasets, which throw light on focusing some basic machine learning configs. 2) Sufficient experiments on multiple datasets are carried out, which involves various aspect on BR values, such as the relationship between BR and classification accuracy, the influence of different RF configurations, and the dependence of the optimal BR on the dataset.
1) Although different BR values have been extensively studied, the explanations for why certain datasets exhibit specific behaviors at specific BR values may not be deep and comprehensive enough, especially for some complex patterns and anomalies. For example, for a model like RF (nf all) that exhibits special behaviors, although some hypotheses have been proposed, more research may be needed to understand the exact mechanism behind it. 2) When analyzing the relationship between the optimal BR
The paper studies the impact of BR on random forests performances and provide insights about the characteristics of data sets for which BR >1 should be preferred. Extensive simulations are done to corroborate the proposed statements.
- Several related works are missing. Besides, it is not clear from the related work section which contributions are theoretical or practical. See for example: - For empirical performances, see Section 4 in Random Forests for Big Data https://arxiv.org/abs/1511.08327 - For theoretical results quantifying the uncertainty of random forests, depending on BR, see https://jmlr.org/papers/volume17/14-168/14-168.pdf or https://arxiv.org/pdf/1405.0352 - In https://www.esaim-ps.org/articles/p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaussian Processes and Bayesian Inference · Forest ecology and management · Advanced Statistical Methods and Models
MethodsSparse Evolutionary Training
