SMILES Enumeration as Data Augmentation for Neural Network Modeling of   Molecules

Esben Jannik Bjerrum

arXiv:1703.07076·cs.LG·May 18, 2017·302 cites

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Esben Jannik Bjerrum

PDF

Open Access 4 Repos

TL;DR

This paper demonstrates that using multiple SMILES representations of the same molecule for data augmentation improves neural network performance in molecular property prediction tasks.

Contribution

It introduces SMILES enumeration as a novel data augmentation technique that enhances neural network modeling of molecules.

Findings

01

Augmented dataset was 130 times larger than original.

02

Performance metrics improved with SMILES enumeration.

03

Prediction phase averaging further increased accuracy.

Abstract

Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Machine Learning in Bioinformatics