VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science
Youssef Abdalla, Marrisa Taub, Eleanor Hilton, Priya Akkaraju,, Alexander Milanovic, Mine Orlu, Abdul W. Basit, Michael T Cook, Tapabrata, Chakraborti, David Shorthouse

TL;DR
VECT-GAN is a novel generative model designed to augment small, noisy pharmaceutical datasets, significantly improving predictive model performance and enabling the development of new therapeutically relevant compounds.
Contribution
This paper introduces VECT-GAN, a variationally encoded conditional tabular GAN, with a pipeline for data augmentation that outperforms existing models and is pre-trained on ChEMBL for broader applicability.
Findings
VECT-GAN improves regression performance on pharmaceutical datasets.
Synthetic data regularizes small tabular datasets effectively.
Pre-trained VECT-GAN enhances generalisability to small molecule data.
Abstract
Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Genetics, Bioinformatics, and Biomedical Research · Machine Learning in Healthcare
MethodsKnowledge Distillation
