Random Forests as Statistical Procedures: Design, Variance, and Dependence
Nathaniel S. O'Connell

TL;DR
This paper develops a finite-sample, design-based theory for random forests, introducing a variance identity, a covariance floor concept, and a new resampling method to improve uncertainty quantification and confidence interval accuracy.
Contribution
It provides the first finite-sample variance decomposition for random forests, introduces PASR for covariance estimation, and develops confidence intervals with guaranteed coverage for both regression and classification.
Findings
Prediction intervals achieve nominal coverage with conservative bias.
PASR estimator is asymptotically unbiased for classification probabilities.
Nominal coverage maintained across various high-dimensional settings.
Abstract
We develop a finite-sample, design-based theory for random forests in which each tree is a randomized conditional predictor acting on fixed covariates and the forest is their Monte Carlo average. An exact variance identity separates Monte Carlo error from a covariance floor that persists under infinite aggregation. The floor arises through two mechanisms: observation reuse, where the same training outcomes receive weight across multiple trees, and partition alignment, where independently generated trees discover similar conditional prediction rules. We prove the floor is strictly positive under minimal conditions and show that alignment persists even when sample splitting eliminates observation overlap entirely. We introduce procedure-aligned synthetic resampling (PASR) to estimate the covariance floor, decomposing the total prediction uncertainty of a deployed forest into interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Modeling Techniques · Data Analysis with R · Explainable Artificial Intelligence (XAI)
