Quantifying Inherent Randomness in Machine Learning Algorithms
Soham Raste, Rahul Singh, Joel Vaughan, and Vijayan N. Nair

TL;DR
This study empirically quantifies how stochastic elements in training and data partitioning affect the performance variability of ML algorithms like RFs, GBMs, and FFNNs, highlighting the importance of controlling randomness for reproducibility.
Contribution
It systematically compares the impact of training randomness and data splitting on performance variation across multiple ML algorithms.
Findings
FFNNs exhibit larger variation due to training randomness.
Data splitting causes more variation than training randomness.
Heterogeneous datasets amplify the impact of data splitting variability.
Abstract
Most machine learning (ML) algorithms have several stochastic elements, and their performances are affected by these sources of randomness. This paper uses an empirical study to systematically examine the effects of two sources: randomness in model training and randomness in the partitioning of a dataset into training and test subsets. We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs). Among the different algorithms, randomness in model training causes larger variation for FFNNs compared to tree-based methods. This is to be expected as FFNNs have more stochastic elements that are part of their model initialization and training. We also found that random splitting of datasets leads to higher variation compared to the inherent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsTest
