SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
Shion Honda, Shoi Shi, Hiroki R. Ueda

TL;DR
The paper introduces SMILES Transformer, a pre-trained language model that generates molecular fingerprints, improving drug discovery predictions especially in small data scenarios by leveraging unsupervised learning on SMILES representations.
Contribution
It presents a novel Transformer-based pre-trained model for molecular fingerprinting, outperforming traditional methods in low-data drug discovery tasks.
Findings
Superior performance on 10 benchmark datasets
Effective in small-data settings due to pre-training
Introduces a new metric for accuracy and data efficiency
Abstract
In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Protein Structure and Dynamics
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
