TL;DR
This study evaluates machine learning methods for selecting social-environmental variables from census data that are truly associated with health outcomes, demonstrating their effectiveness in simulations and real-world prostate cancer data.
Contribution
It compares various machine learning approaches for variable selection in high-dimensional social-environmental data, identifying the most effective methods for true association detection.
Findings
Elastic net identified many true positives
Lasso controlled false positives well
Sparse group lasso and Bayesian trees showed strong performance
Abstract
Objective: Social-environmental data obtained from the U.S. Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Materials and Methods: We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1,000 total variables). We applied the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
