Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets
Gift Khangamwa, Terence L. van Zyl, Clint J. van Alten

TL;DR
This paper presents a methodology combining synthetic data generation, missing data imputation, and deep learning to effectively address missingness in health datasets, improving predictive accuracy on unseen data.
Contribution
The study introduces a novel approach that integrates Gaussian mixture models and deep learning for synthetic data creation and missing data imputation in demographic health datasets.
Findings
Models trained on synthetic and imputed data achieved 83% accuracy on real unseen data.
DAE-based imputation yielded the lowest log loss, indicating superior performance.
The methodology is adaptable to other contexts beyond health datasets.
Abstract
Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study's contribution is a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods to resolve missing data challenges. Specifically, we conducted a series of experiments with these objectives; generating a realistic synthetic dataset, simulating data missingness, recovering the missing data, and analyzing imputation performance. Our methodology used a gaussian mixture model whose parameters were learned from a cleaned subset of a real demographic and health dataset to generate the synthetic data. We simulated various missingness degrees ranging from , , , and under the missing completely at random scheme MCAR. We used an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Nutritional Studies and Diet · Health disparities and outcomes
MethodsTest
