Tokenization for Molecular Foundation Models

Alexius Wadell; Anoushka Bhutani; Venkatasubramanian Viswanathan

arXiv:2409.15370·cs.LG·January 29, 2026

Tokenization for Molecular Foundation Models

Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

PDF

TL;DR

This paper evaluates existing molecular tokenizers, introduces new open-vocabulary tokenizers with full SMILES coverage, and demonstrates their effectiveness in molecular property prediction tasks, emphasizing the importance of chemically diverse benchmarks.

Contribution

The paper systematically assesses current tokenizers, introduces Smirk and Smirk-GPE with complete SMILES coverage, and validates their utility in molecular modeling.

Findings

01

Existing tokenizers have limited coverage of molecular space.

02

Open-vocabulary tokenizers improve molecular property prediction.

03

Chemically diverse benchmarks are essential for progress.

Abstract

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings