Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Jerret Ross, Brian Belgodere, Vijil Chenthamarakshan, Inkit Padhi,, Youssef Mroueh, Payel Das

TL;DR
This paper introduces MoLFormer, a transformer-based model trained on 1.1 billion molecules' SMILES sequences, which captures molecular structure and properties effectively, outperforming existing models on multiple property prediction tasks.
Contribution
The paper presents a large-scale, efficient transformer model for molecular embeddings trained on massive unlabelled data, demonstrating superior performance and structural understanding over prior models.
Findings
MoLFormer outperforms existing baselines on multiple benchmarks.
The model captures spatial relationships between atoms.
It effectively predicts various molecular properties.
Abstract
Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Various Chemistry Research Topics
