Generating Synthetic but Plausible Healthcare Record Datasets
Laura Avi\~n\'o, Matteo Ruffini, Ricard Gavald\`a

TL;DR
This paper introduces a new method for generating synthetic healthcare datasets that are more realistic and interpretable than GAN-based methods, avoiding mode collapse and better preserving data diversity.
Contribution
A novel latent variable-based approach for generating binary healthcare data that outperforms GANs in realism and interpretability.
Findings
Synthetic datasets are harder to distinguish from real data using Random Forests and MMD.
The method avoids mode collapse common in GANs.
Generated data are more interpretable than GAN-based models.
Abstract
Generating datasets that "look like" given real ones is an interesting tasks for healthcare applications of ML and many other fields of science and engineering. In this paper we propose a new method of general application to binary datasets based on a method for learning the parameters of a latent variable moment that we have previously used for clustering patient datasets. We compare our method with a recent proposal (MedGan) based on generative adversarial methods and find that the synthetic datasets we generate are globally more realistic in at least two senses: real and synthetic instances are harder to tell apart by Random Forests, and the MMD statistic. The most likely explanation is that our method does not suffer from the "mode collapse" which is an admitted problem of GANs. Additionally, the generative models we generate are easy to interpret, unlike the rather obscure GANs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Music and Audio Processing
