SMolLM: Small Language Models Learn Small Molecular Grammar

Akhil Jindal; Harang Ju

arXiv:2605.06322·cs.LG·May 8, 2026

SMolLM: Small Language Models Learn Small Molecular Grammar

Akhil Jindal, Harang Ju

PDF

TL;DR

This paper introduces SMolLM, a small, interpretable transformer model that effectively generates valid SMILES strings for molecules, providing insights into how language models learn chemical grammar.

Contribution

The paper presents a compact transformer model that outperforms larger models in molecular generation and reveals the mechanistic process of learning chemical grammar.

Findings

01

SMolLM achieves 95% validity on ZINC-250K benchmark.

02

The model's block resolves SMILES constraints in a fixed order.

03

A single attention head handles bracket-matching in the process.

Abstract

Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as shown by error classification, linear probing, and sparse autoencoders. A systematic ablation across attention heads and passes further localizes the first bracket-matching step to a single attention head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.