mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards; Chi Han; Gawon Lee; Thao Nguyen; Sara Szymku\'c; Chetan Kumar Prasad; Bowen Jin; Jiawei Han; Ying Diao; Ge Liu; Hao Peng; Bartosz A. Grzybowski; Martin D. Burke; Heng Ji

arXiv:2505.12565·cs.AI·March 3, 2026

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

Carl Edwards, Chi Han, Gawon Lee, Thao Nguyen, Sara Szymku\'c, Chetan Kumar Prasad, Bowen Jin, Jiawei Han, Ying Diao, Ge Liu, Hao Peng, Bartosz A. Grzybowski, Martin D. Burke, Heng Ji

PDF

Open Access 1 Models 3 Reviews

TL;DR

mCLM is a modular chemical language model that tokenizes molecules into functional building blocks, enabling better prediction of molecular functions and synthesizability, thus advancing automated drug discovery and synthesis.

Contribution

The paper introduces mCLM, a novel modular language model that tokenizes molecules into functional blocks, improving function prediction and synthesis compatibility over atom-based models.

Findings

01

mCLM significantly improves chemical function predictions.

02

mCLM enhances synthetic accessibility compared to other AI methods.

03

mCLM outperforms baselines on out-of-distribution drug molecules.

Abstract

Despite their ability to understand chemical knowledge, large language models (LLMs) remain limited in their capacity to propose novel molecules with desired functions (e.g., drug-like properties). In addition, the molecules that LLMs propose can often be challenging to make, and are almost never compatible with automated synthesis approaches. To better enable the discovery of functional small molecules, LLMs need to learn a new molecular language that is more effective in predicting properties and inherently synced with automated synthesis technology. Current molecule LLMs are limited by representing molecules based on atoms. In this paper, we argue that just like tokenizing texts into meaning-bearing (sub-)word tokens instead of characters, molecules should be tokenized at the level of functional building blocks, i.e., parts of molecules that bring unique functions and serve as…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 4

Strengths

There is novelty in this work. Specifically, there is clear architectural separation between domain-specific encoders and a unified fusion backbone which promotes flexibility and domain transfer. Experiment results are also promising. This work outperforms strong baselines (MolX, ChemBERTa, GraphMVP) on multimodal reasoning tasks, particularly in low-data and cross-domain settings. Figure 4 is particularly useful as it shows module-wise attribution analyses for how modality-specific knowledge

Weaknesses

There is however limited novelty at the core LLM level. While modularization is effective, the language model itself is adapted rather than fundamentally redesigned for chemistry. There is also lack of validation of the practicality of this approach, say on real world sparse datasets. The evaluation focuses primarily on benchmark datasets, with minimal discussion of noisy experimental spectra or reaction data. Further, the computational cost of the work seems infeasible. Training multiple moda

Reviewer 02Rating 2Confidence 5

Strengths

- Conceptually Innovative but Incremental in Execution The idea of representing molecules through modular, synthesis-ready building blocks rather than atom-level encoding is conceptually novel and offers a creative bridge between digital design and physical synthesis. This modular approach reflects an original perspective on chemical language modeling. However, the implementation mainly extends existing ideas from reaction-aware and retrosynthesis-based models, making the innovation more increme

Weaknesses

- Limited Generalization and Chemical Creativity The modular tokenization relies on a fixed library of known reaction building blocks and predefined synthesis rules. While this ensures synthetic feasibility, it severely restricts the model’s ability to explore novel chemical spaces or generate fundamentally new scaffolds beyond existing reaction types. Thus, the model’s creativity is constrained by human-curated chemistry knowledge. - Lack of Experimental Validation and limited ablation and int

Reviewer 03Rating 6Confidence 5

Strengths

- The construction of an LLM framework that jointly considers synthesizability and functionality represents an important step toward practical and interpretable molecular generation. - The integration of GNN representations with natural language embeddings for modular chemical reasoning is technically novel and well-motivated. - The figures are clean, well-structured, and enhance the overall readability and understanding of the method.

Weaknesses

I would consider raising the score if the following weaknesses are resolved. - **Comparison to fragment-aware baselines**: While the paper includes comparisons to recent general-purpose and domain-specific molecule LLMs, it omits fragment- or group-aware baselines such as SAFE [1], GROUPSELFIES [2], or Reasyn [3]. Even acknowledging that Reasyn is concurrent, such comparisons (especially against Transformer-based models with other representations, as mCLM itself employs a Transformer backbone) w

Code & Models

Models

🤗
language-plus-molecules/mCLM_1k-3b
model· 89 dl· ♡ 1
89 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Chemical Synthesis and Analysis