TL;DR
This paper reviews best practices for applying statistical learning methods in epidemiology, emphasizing the importance of sensitivity analysis to account for variability introduced by random seed dependence.
Contribution
It highlights seed sensitivity in modern statistical learning methods and recommends repeated analyses with different seeds to improve robustness in epidemiological research.
Findings
All tested methods showed seed-dependent variability.
Variability differed across methods and exposures.
Recommends sensitivity analysis with multiple seeds.
Abstract
Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampling which may induce variability in results. Best practices in data science can help to ensure robustness. As a case study, we included four SL models that have been applied previously to analyze the relationship between environmental mixtures and health outcomes. We ran each model across 100 initializing values for random number generation, or "seeds," and assessed variability in resulting estimation and inference. All methods exhibited some seed-dependent variability in results. The degree of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
