Molecular Representations for Large Language Models

Nicholas T. Runcie; Fergus Imrie; Charlotte M. Deane

arXiv:2605.01822·cs.LG·May 5, 2026

Molecular Representations for Large Language Models

Nicholas T. Runcie, Fergus Imrie, Charlotte M. Deane

PDF

TL;DR

This paper introduces MolJSON, a new molecular representation for LLMs, demonstrating its superior performance over traditional formats like SMILES and IUPAC in various chemistry reasoning tasks.

Contribution

The study systematically compares MolJSON with existing formats, showing its effectiveness and robustness for LLM-based chemical reasoning tasks.

Findings

01

MolJSON outperforms SMILES and IUPAC in translation accuracy.

02

GPT-5 achieves 95.3% accuracy generating MolJSON in constrained tasks.

03

MolJSON is more robust to atom count and ring complexity errors.

Abstract

Large Language Models (LLMs) are increasingly being used to support scientific discovery. In chemistry, tasks such as reaction prediction and structure elucidation require reasoning about the structures of molecules. As such, LLM-based systems for chemistry must interact reliably with molecular structures. Most previous studies of LLMs in chemistry have used SMILES strings or IUPAC names as molecular representations; however, the suitability of these formats has not been systematically assessed. In this work, we introduce MolJSON, a novel molecular representation for LLMs, and systematically compare it with five common chemical formats. We evaluated each representation with GPT-5-nano, GPT-5-mini, GPT-5, and Claude Haiku 4.5 using a set of 78,045 questions spanning translation, shortest path, and constrained generation reasoning tasks. We observed substantial variation across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.