Survey-aware Machine Learning: A Guideline for Valid Population Health Inference based on Scoping Review
YongKyung Oh, Henry W. Zheng, Jeffrey Feng, Alex A. T. Bui

TL;DR
This paper introduces SaML, a nine-step guideline for incorporating survey design metadata into machine learning models to ensure valid population health inferences from complex survey data.
Contribution
It provides a comprehensive, task-specific checklist for survey-aware ML, addressing gaps in current practices and summarizing existing methodological approaches.
Findings
Survey-aware ML improves bias and fairness in population health estimates.
A scoping review of 16 papers summarizes current methodologies.
Identifies gaps in hyperparameter tuning and deployment for survey data.
Abstract
Machine Learning (ML) models trained on complex health surveys such as the National Health and Nutrition Examination Survey (NHANES) often ignore primary sampling units, stratification variables, and sampling weights. This practice violates the independence assumptions of standard evaluation methods. As a result, estimates become biased, uncertainty is underestimated, and fairness assessments fail to reflect population-level disparities. We propose Survey-aware Machine Learning (SaML), a nine-step guideline that incorporates survey design metadata across the ML lifecycle. Through a scoping review of 16 methodological papers, we summarize existing work on weighted model training, design-based cross-validation, and survey-adjusted performance evaluation. We also identify gaps in hyperparameter tuning and deployment. We provide task-specific guidance that clarifies which steps are required…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
