Enhancing Drug Discovery: Autoencoder-Based Latent Space Augmentation for Improved Molecular Solubility Prediction using LatMixSol
Mohammad Saleh Hasankhani

TL;DR
LatMixSol introduces a latent space augmentation method using autoencoders and guided interpolation to improve molecular solubility prediction, addressing data scarcity and high-dimensional features in drug discovery.
Contribution
The paper presents a novel latent space augmentation framework combining autoencoder encoding and cluster-guided MixUp interpolation for better solubility prediction.
Findings
Achieves 3.2-7.6% RMSE reduction across models
Improves R-squared by 0.5-1.5 points
Most significant gains with HistGradientBoosting
Abstract
Accurate prediction of molecular solubility is a cornerstone of early-stage drug discovery, yet conventional machine learning models face significant challenges due to limited labeled data and the high-dimensional nature of molecular descriptors. To address these issues, we propose LatMixSol, a novel latent space augmentation framework that combines autoencoder-based feature compression with guided interpolation to enrich training data. Our approach first encodes molecular descriptors into a low-dimensional latent space using a two-layer autoencoder. Spectral clustering is then applied to group chemically similar molecules, enabling targeted MixUp-style interpolation within clusters. Synthetic samples are generated by blending latent vectors of cluster members and decoding them back to the original feature space. Evaluated on the Huuskonen solubility benchmark, LatMixSol demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Crystallization and Solubility Studies
MethodsSpectral Clustering
