Label scarcity in biomedicine: Data-rich latent factor discovery enhances phenotype prediction
Marc-Andre Schulz, Bertrand Thirion, Alexandre Gramfort, Ga\"el, Varoquaux, Danilo Bzdok

TL;DR
This paper shows that using low-dimensional embeddings derived from large biobank datasets can significantly improve phenotype prediction in health data-scarce scenarios, outperforming traditional methods like PCA and Isomap.
Contribution
It introduces a novel approach of leveraging Variational Autoencoder manifolds to enhance phenotype prediction in data-scarce biomedical contexts, demonstrating scalability with unlabeled data.
Findings
VAE manifolds outperform PCA and Isomap in phenotype prediction
Embedding spaces improve prediction accuracy in data-scarce settings
Unsupervised learning scales well with increasing unlabeled data
Abstract
High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and used to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics. Phenotype predictions facilitated by Variational Autoencoder manifolds typically scaled better with increasing unlabeled data than dimensionality reduction by PCA or Isomap. Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
MethodsPrincipal Components Analysis
