mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules
Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku\'c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz A. Grzybowski, Martin D. Burke, Heng Ji

TL;DR
mCLM is a modular chemical language model that tokenizes molecules into functional building blocks, enabling better prediction of molecular functions and synthesizability, thus advancing automated drug discovery and synthesis.
Contribution
The paper introduces mCLM, a novel modular language model that tokenizes molecules into functional blocks, improving function prediction and synthesis compatibility over atom-based models.
Findings
mCLM significantly improves chemical function predictions.
mCLM enhances synthetic accessibility compared to other AI methods.
mCLM outperforms baselines on out-of-distribution drug molecules.
Abstract
Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as…
Peer Reviews
Decision·ICLR 2026 Oral
There is novelty in this work. Specifically, there is clear architectural separation between domain-specific encoders and a unified fusion backbone which promotes flexibility and domain transfer. Experiment results are also promising. This work outperforms strong baselines (MolX, ChemBERTa, GraphMVP) on multimodal reasoning tasks, particularly in low-data and cross-domain settings. Figure 4 is particularly useful as it shows module-wise attribution analyses for how modality-specific knowledge
There is however limited novelty at the core LLM level. While modularization is effective, the language model itself is adapted rather than fundamentally redesigned for chemistry. There is also lack of validation of the practicality of this approach, say on real world sparse datasets. The evaluation focuses primarily on benchmark datasets, with minimal discussion of noisy experimental spectra or reaction data. Further, the computational cost of the work seems infeasible. Training multiple moda
- Conceptually Innovative but Incremental in Execution The idea of representing molecules through modular, synthesis-ready building blocks rather than atom-level encoding is conceptually novel and offers a creative bridge between digital design and physical synthesis. This modular approach reflects an original perspective on chemical language modeling. However, the implementation mainly extends existing ideas from reaction-aware and retrosynthesis-based models, making the innovation more increme
- Limited Generalization and Chemical Creativity The modular tokenization relies on a fixed library of known reaction building blocks and predefined synthesis rules. While this ensures synthetic feasibility, it severely restricts the model’s ability to explore novel chemical spaces or generate fundamentally new scaffolds beyond existing reaction types. Thus, the model’s creativity is constrained by human-curated chemistry knowledge. - Lack of Experimental Validation and limited ablation and int
- The construction of an LLM framework that jointly considers synthesizability and functionality represents an important step toward practical and interpretable molecular generation. - The integration of GNN representations with natural language embeddings for modular chemical reasoning is technically novel and well-motivated. - The figures are clean, well-structured, and enhance the overall readability and understanding of the method.
I would consider raising the score if the following weaknesses are resolved. - **Comparison to fragment-aware baselines**: While the paper includes comparisons to recent general-purpose and domain-specific molecule LLMs, it omits fragment- or group-aware baselines such as SAFE [1], GROUPSELFIES [2], or Reasyn [3]. Even acknowledging that Reasyn is concurrent, such comparisons (especially against Transformer-based models with other representations, as mCLM itself employs a Transformer backbone) w
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Chemical Synthesis and Analysis
