Regularization method in the variable selection for logistic regression on BRFSS data
Jinbo Niu

TL;DR
This study develops regularized logistic regression models with resampling techniques to predict stroke risk from large-scale BRFSS data, achieving high accuracy and identifying key predictors.
Contribution
It introduces the combined use of regularization and resampling methods for effective feature selection and prediction in high-dimensional, imbalanced health data.
Findings
Lasso model achieved AUC of 0.761.
Group Lasso identified key predictors: Age, Heart Disease, Physical and Dental Health.
Resampling techniques improved model performance.
Abstract
Stroke remains a leading cause of death and disability worldwide, yet effective prediction of stroke risk using large-scale population data remains challenging due to data imbalance and high-dimensional features. In this study, we develop and evaluate regularized logistic regression models for stroke prediction using data from the 2022 Behavioral Risk Factor Surveillance System (BRFSS), comprising 445132 U.S. adult respondents and 328 health-related variables. To address data imbalance, we apply several resampling techniques including oversampling, undersampling, class weighting, and the Synthetic Minority Oversampling Technique (SMOTE). We further employ Lasso, Elastic Net, and Group Lasso regularization methods to perform feature selection and dimensionality reduction. Model performance is assessed using ROC-AUC, sensitivity, and specificity metrics. Among all methods, the Lasso-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
