The Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks

Micha{\l} Iwaniuk; Mateusz Jarosz; Bart{\l}omiej Borycki; Bartosz Jezierski; Jan Cwalina; Stanis{\l}aw Ka\'zmierczak; Jacek Ma\'ndziuk

arXiv:2511.13952·cs.LG·November 19, 2025

The Impact of Bootstrap Sampling Rate on Random Forest Performance in Regression Tasks

Micha{\l} Iwaniuk, Mateusz Jarosz, Bart{\l}omiej Borycki, Bartosz Jezierski, Jan Cwalina, Stanis{\l}aw Ka\'zmierczak, Jacek Ma\'ndziuk

PDF

Open Access

TL;DR

This study systematically investigates how varying bootstrap sampling rates affect random forest regression performance across diverse datasets, revealing that tuning BR can significantly improve results depending on dataset characteristics and noise levels.

Contribution

It provides the first comprehensive analysis of bootstrap rate effects on RF regression, linking dataset features and noise to optimal BR settings, and highlights the importance of tuning this hyperparameter.

Findings

01

Optimal BR varies across datasets, often differing from the default 1.0.

02

Higher BRs favor datasets with strong global feature-target relationships.

03

Lower BRs benefit high-noise datasets by reducing variance.

Abstract

Random Forests (RFs) typically train each tree on a bootstrap sample of the same size as the training set, i.e., bootstrap rate (BR) equals 1.0. We systematically examine how varying BR from 0.2 to 5.0 affects RF performance across 39 heterogeneous regression datasets and 16 RF configurations, evaluating with repeated two-fold cross-validation and mean squared error. Our results demonstrate that tuning the BR can yield significant improvements over the default: the best setup relied on BR \leq 1.0 for 24 datasets, BR > 1.0 for 15, and BR = 1.0 was optimal in 4 cases only. We establish a link between dataset characteristics and the preferred BR: datasets with strong global feature-target relationships favor higher BRs, while those with higher local target variance benefit from lower BRs. To further investigate this relationship, we conducted experiments on synthetic datasets with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Imbalanced Data Classification Techniques