Tokenization for Molecular Foundation Models
Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

TL;DR
This paper evaluates existing molecular tokenizers, introduces new open-vocabulary tokenizers with full SMILES coverage, and demonstrates their effectiveness in molecular property prediction tasks, emphasizing the importance of chemically diverse benchmarks.
Contribution
The paper systematically assesses current tokenizers, introduces Smirk and Smirk-GPE with complete SMILES coverage, and validates their utility in molecular modeling.
Findings
Existing tokenizers have limited coverage of molecular space.
Open-vocabulary tokenizers improve molecular property prediction.
Chemically diverse benchmarks are essential for progress.
Abstract
Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in material science and molecular design.However, existing models are constrained by closed-vocabulary tokenizers that capture only a fraction of molecular space. In this work, we systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by pretraining and finetuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. The proposed tokenizers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
