Molecular Representations for Large Language Models
Nicholas T. Runcie, Fergus Imrie, Charlotte M. Deane

TL;DR
This paper introduces MolJSON, a new molecular representation for LLMs, demonstrating its superior performance over traditional formats like SMILES and IUPAC in various chemistry reasoning tasks.
Contribution
The study systematically compares MolJSON with existing formats, showing its effectiveness and robustness for LLM-based chemical reasoning tasks.
Findings
MolJSON outperforms SMILES and IUPAC in translation accuracy.
GPT-5 achieves 95.3% accuracy generating MolJSON in constrained tasks.
MolJSON is more robust to atom count and ring complexity errors.
Abstract
Large Language Models (LLMs) are increasingly being used to support scientific discovery. In chemistry, tasks such as reaction prediction and structure elucidation require reasoning about the structures of molecules. As such, LLM-based systems for chemistry must interact reliably with molecular structures. Most previous studies of LLMs in chemistry have used SMILES strings or IUPAC names as molecular representations; however, the suitability of these formats has not been systematically assessed. In this work, we introduce MolJSON, a novel molecular representation for LLMs, and systematically compare it with five common chemical formats. We evaluated each representation with GPT-5-nano, GPT-5-mini, GPT-5, and Claude Haiku 4.5 using a set of 78,045 questions spanning translation, shortest path, and constrained generation reasoning tasks. We observed substantial variation across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
