ChemBERTa-2: Towards Chemical Foundation Models
Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath, Ramsundar

TL;DR
ChemBERTa-2 advances molecular machine learning by leveraging large-scale SMILES data and improved pretraining techniques, achieving competitive performance on benchmark tasks.
Contribution
This work introduces ChemBERTa-2, a chemical foundation model trained on the largest SMILES dataset to date, with optimized pretraining methods for better downstream task performance.
Findings
Pretraining on 77 million compounds improves model performance.
ChemBERTa-2 achieves state-of-the-art results on MoleculeNet benchmarks.
Enhanced pretraining translates to better downstream task accuracy.
Abstract
Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Weight Decay · Dropout · 15 Ways to Contact How can i speak to someone at Delta Airlines · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Linear Warmup With Cosine Annealing
