Comparison of Missing Data Imputation Methods using the Framingham Heart study dataset
Konstantinos Psychogyios, Loukas Ilias, Dimitris Askounis

TL;DR
This study compares advanced GAN and Autoencoder-based missing data imputation methods on the Framingham Heart Study dataset, demonstrating significant improvements over traditional techniques in both imputation accuracy and predictive performance.
Contribution
It introduces modified GAN and Autoencoder methods for missing data imputation and evaluates their effectiveness on medical datasets, showing notable performance gains.
Findings
Improvements of 0.20 in normalized RMSE
7.00% increase in AUROC
2.50% higher F1-score in post-imputation prediction
Abstract
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels and according to World Health Organization is the leading cause of death worldwide. EHR data regarding this case, as well as medical cases in general, contain missing values very frequently. The percentage of missingness may vary and is linked with instrument errors, manual data entry procedures, etc. Even though the missing rate is usually significant, in many cases the missing value imputation part is handled poorly either with case-deletion or with simple statistical approaches such as mode and median imputation. These methods are known to introduce significant bias, since they do not account for the relationships between the dataset's variables. Within the medical framework, many datasets consist of lab tests or patient medical tests, where these relationships are present and strong. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare
MethodsTest
