SMolLM: Small Language Models Learn Small Molecular Grammar
Akhil Jindal, Harang Ju

TL;DR
This paper introduces SMolLM, a small, interpretable transformer model that effectively generates valid SMILES strings for molecules, providing insights into how language models learn chemical grammar.
Contribution
The paper presents a compact transformer model that outperforms larger models in molecular generation and reveals the mechanistic process of learning chemical grammar.
Findings
SMolLM achieves 95% validity on ZINC-250K benchmark.
The model's block resolves SMILES constraints in a fixed order.
A single attention head handles bracket-matching in the process.
Abstract
Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed order: brackets first, rings second, and valence last, as shown by error classification, linear probing, and sparse autoencoders. A systematic ablation across attention heads and passes further localizes the first bracket-matching step to a single attention head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
