Chemical Language Models for Natural Products: A State-Space Model Approach
Ho-Hsuan Wang, Afnan Sultan, Andrea Volkamer, Dietrich Klakow

TL;DR
This paper develops and compares state-space and transformer-based chemical language models tailored for natural products, demonstrating that domain-specific pre-training on a modest dataset can achieve competitive performance in molecule generation and property prediction.
Contribution
It introduces NP-specific state-space models (Mamba and Mamba-2), systematically compares them with transformers, and evaluates various tokenization strategies for natural product tasks.
Findings
Mamba generates 1-2% more valid and unique molecules than GPT.
Mamba variants outperform GPT in property prediction by 0.02-0.04 MCC.
Domain-specific pre-training on 1M NPs matches larger dataset models.
Abstract
Language models are widely used in chemistry for molecular property prediction and small-molecule generation, yet Natural Products (NPs) remain underexplored despite their importance in drug discovery. To address this gap, we develop NP-specific chemical language models (NPCLMs) by pre-training state-space models (Mamba and Mamba-2) and comparing them with transformer baselines (GPT). Using a dataset of about 1M NPs, we present the first systematic comparison of selective state-space models and transformers for NP-focused tasks, together with eight tokenization strategies including character-level, Atom-in-SMILES (AIS), byte-pair encoding (BPE), and NP-specific BPE. We evaluate molecule generation (validity, uniqueness, novelty) and property prediction (membrane permeability, taste, anti-cancer activity) using MCC and AUC-ROC. Mamba generates 1-2 percent more valid and unique molecules…
Peer Reviews
Decision·Submitted to ICLR 2026
The contribution addresses the gap in modeling natural products in chemistry. The presentation is clear, the methodology is sound.
Aside from targeting narrow yet important chemical domain, such as the natural products, the paper does not report any particularly interesting results. It's a solid research that would be best suited for a cheminformatic journal (Journal of Chemical Information and Modeling or something similar).
1. A large and systematic comparison was conducted. 2. The presentation and results discussion are detailed and easy to follow. 3. The covered related work is comprehensive and well-aligned with the discussion.. 4. Mamba architectures are applied to natural product modeling for the first time.
**Significance** 1. The focus of the work is natural products, which is a narrow subfield of drug discovery. The significance of the findings to the broader ICLR community is limited. The work can better fit technical and specialized venues of drug discovery. 2. Mamba and some tokenizers are applied to natural products for the first time here. However, these approaches perform similarly to the existing work and only confirm the findings (as cited multiple times by the authors), yielding no new i
1. **Essential field and motivation**: The paper addresses an important and underexplored problem: developing chemical language models specifically for natural products (NPs). Since NPs are chemically more complex and biologically significant than typical synthetic molecules, focusing on them fills a meaningful gap in current research and aligns well with drug discovery applications. By targeting this space, the paper contributes to an essential field with high scientific and practical relevance
The main limitation of the paper is that the contribution largely reduces to an extensive set of empirical comparisons, but without introducing new methodological innovations or deriving deeper insights that could guide future work. 1. **Model comparison (Mamba vs GPT)**: The comparison between Mamba and GPT models offers limited value. The results mainly show that Mamba tends to generate molecules with higher validity and uniqueness, while GPT produces more novel ones. However, this trade-off
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Machine Learning in Bioinformatics
