What is the distribution of the number of unique original items in a bootstrap sample?
Alex F. Mendelson, Maria A. Zuluaga, Brian F. Hutton, S\'ebastien, Ourselin

TL;DR
This paper analyzes the distribution of unique items in bootstrap samples, providing insights and heuristics to understand and control this aspect in machine learning resampling methods like bagging.
Contribution
It presents a clear characterization of the distribution of unique items in bootstrap samples, including a normal approximation and its applicability in classification tasks.
Findings
Distribution of unique items can be approximated by a normal distribution under certain conditions
Heuristic for when the normal approximation is valid in practice
Extension of the distribution analysis to categorical data in classification
Abstract
Sampling with replacement occurs in many settings in machine learning, notably in the bagging ensemble technique and the .632+ validation scheme. The number of unique original items in a bootstrap sample can have an important role in the behaviour of prediction models learned on it. Indeed, there are uncontrived examples where duplicate items have no effect. The purpose of this report is to present the distribution of the number of unique original items in a bootstrap sample clearly and concisely, with a view to enabling other machine learning researchers to understand and control this quantity in existing and future resampling techniques. We describe the key characteristics of this distribution along with the generalisation for the case where items come from distinct categories, as in classification. In both cases we discuss the normal limit, and conduct an empirical investigation to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Bayesian Modeling and Causal Inference · Statistical Methods and Bayesian Inference
