SmilesT5: Domain-specific pretraining for molecular language models
Philip Spence, Brooks Paige, Anne Osbourn

TL;DR
SmilesT5 introduces domain-specific pretraining tasks for molecular language models, significantly improving property prediction accuracy and efficiency in drug discovery applications by leveraging transformer-based models trained on SMILES strings.
Contribution
The paper proposes novel domain-specific pretraining tasks for molecular language models, enhancing performance and efficiency over traditional methods.
Findings
Improved performance on six molecular property prediction benchmarks.
Enhanced data and computational efficiency with domain-specific pretraining.
Pretrained embeddings perform comparably to finetuning with lower computational cost.
Abstract
Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
