SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules
Esben Jannik Bjerrum

TL;DR
This paper demonstrates that using multiple SMILES representations of the same molecule for data augmentation improves neural network performance in molecular property prediction tasks.
Contribution
It introduces SMILES enumeration as a novel data augmentation technique that enhances neural network modeling of molecules.
Findings
Augmented dataset was 130 times larger than original.
Performance metrics improved with SMILES enumeration.
Prediction phase averaging further increased accuracy.
Abstract
Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Machine Learning in Bioinformatics
