Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanis{\l}aw Ka\'zmierczak; Jacek Ma\'ndziuk

arXiv:2410.04297·cs.LG·October 23, 2025

Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanis{\l}aw Ka\'zmierczak, Jacek Ma\'ndziuk

PDF

Open Access 3 Reviews

TL;DR

This study demonstrates that using a bootstrap sampling rate greater than 1.0 in random forests can significantly improve classification accuracy, challenging previous assumptions about the inefficacy of larger sample sizes.

Contribution

The paper provides empirical evidence that higher bootstrap rates than 1.0 can enhance RF performance and identifies data characteristics as key factors influencing the optimal rate.

Findings

01

Higher BR values lead to statistically significant accuracy improvements.

02

Optimal BR depends mainly on dataset characteristics.

03

Analysis of leaf structures reveals how BR influences tree complexity.

Abstract

Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ( $N$ ). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 1Confidence 5

Strengths

Extensive experimental studies

Weaknesses

1. Lack of Insight: The study would greatly benefit from a discussion on the underlying rationale behind the observed improvements in classification accuracy with BR > 1. Although the (very limited) empirical results suggest that higher BR values could potentially improve performance, the paper does not offer insights into why this might be the case. Introducing a simplified, theoretical case, such as a linear Gaussian model, could help illustrate the mechanisms at play and provide a foundation

Reviewer 02Rating 5Confidence 3

Strengths

1) Overall, this work re-examines the bootstrap rate in random forests, considering a wide range of BR values and multiple datasets, which throw light on focusing some basic machine learning configs. 2) Sufficient experiments on multiple datasets are carried out, which involves various aspect on BR values, such as the relationship between BR and classification accuracy, the influence of different RF configurations, and the dependence of the optimal BR on the dataset.

Weaknesses

1) Although different BR values have been extensively studied, the explanations for why certain datasets exhibit specific behaviors at specific BR values may not be deep and comprehensive enough, especially for some complex patterns and anomalies. For example, for a model like RF (nf all) that exhibits special behaviors, although some hypotheses have been proposed, more research may be needed to understand the exact mechanism behind it. 2) When analyzing the relationship between the optimal BR

Reviewer 03Rating 1Confidence 5

Strengths

The paper studies the impact of BR on random forests performances and provide insights about the characteristics of data sets for which BR >1 should be preferred. Extensive simulations are done to corroborate the proposed statements.

Weaknesses

- Several related works are missing. Besides, it is not clear from the related work section which contributions are theoretical or practical. See for example: - For empirical performances, see Section 4 in Random Forests for Big Data https://arxiv.org/abs/1511.08327 - For theoretical results quantifying the uncertainty of random forests, depending on BR, see https://jmlr.org/papers/volume17/14-168/14-168.pdf or https://arxiv.org/pdf/1405.0352 - In https://www.esaim-ps.org/articles/p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Forest ecology and management · Advanced Statistical Methods and Models

MethodsSparse Evolutionary Training