SmilesT5: Domain-specific pretraining for molecular language models

Philip Spence; Brooks Paige; Anne Osbourn

arXiv:2507.22514·cs.LG·July 31, 2025

SmilesT5: Domain-specific pretraining for molecular language models

Philip Spence, Brooks Paige, Anne Osbourn

PDF

TL;DR

SmilesT5 introduces domain-specific pretraining tasks for molecular language models, significantly improving property prediction accuracy and efficiency in drug discovery applications by leveraging transformer-based models trained on SMILES strings.

Contribution

The paper proposes novel domain-specific pretraining tasks for molecular language models, enhancing performance and efficiency over traditional methods.

Findings

01

Improved performance on six molecular property prediction benchmarks.

02

Enhanced data and computational efficiency with domain-specific pretraining.

03

Pretrained embeddings perform comparably to finetuning with lower computational cost.

Abstract

Molecular property prediction is an increasingly critical task within drug discovery and development. Typically, neural networks can learn molecular properties using graph-based, language-based or feature-based methods. Recent advances in natural language processing have highlighted the capabilities of neural networks to learn complex human language using masked language modelling. These approaches to training large transformer-based deep learning models have also been used to learn the language of molecules, as represented by simplified molecular-input line-entry system (SMILES) strings. Here, we present novel domain-specific text-to-text pretraining tasks that yield improved performance in six classification-based molecular property prediction benchmarks, relative to both traditional likelihood-based training and previously proposed fine-tuning tasks. Through ablation studies, we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.