ChemBERTa-2: Towards Chemical Foundation Models

Walid Ahmad; Elana Simon; Seyone Chithrananda; Gabriel Grand; Bharath; Ramsundar

arXiv:2209.01712·cs.LG·September 7, 2022·141 cites

ChemBERTa-2: Towards Chemical Foundation Models

Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath, Ramsundar

PDF

Open Access 2 Repos 3 Models 1 Datasets

TL;DR

ChemBERTa-2 advances molecular machine learning by leveraging large-scale SMILES data and improved pretraining techniques, achieving competitive performance on benchmark tasks.

Contribution

This work introduces ChemBERTa-2, a chemical foundation model trained on the largest SMILES dataset to date, with optimized pretraining methods for better downstream task performance.

Findings

01

Pretraining on 77 million compounds improves model performance.

02

ChemBERTa-2 achieves state-of-the-art results on MoleculeNet benchmarks.

03

Enhanced pretraining translates to better downstream task accuracy.

Abstract

Large pretrained models such as GPT-3 have had tremendous impact on modern natural language processing by leveraging self-supervised learning to learn salient representations that can be used to readily finetune on a wide variety of downstream tasks. We investigate the possibility of transferring such advances to molecular machine learning by building a chemical foundation model, ChemBERTa-2, using the language of SMILES. While labeled data for molecular prediction tasks is typically scarce, libraries of SMILES strings are readily available. In this work, we build upon ChemBERTa by optimizing the pretraining process. We compare multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size, up to 77M compounds from PubChem. To our knowledge, the 77M set constitutes one of the largest datasets used for molecular pretraining to date. We find that with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

zpn/clearance
dataset· 24 dl
24 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Weight Decay · Dropout · 15 Ways to Contact How can i speak to someone at Delta Airlines · Refunds@Expedia|||How do I get a full refund from Expedia? · Byte Pair Encoding · Linear Warmup With Cosine Annealing