MoleCode unlocks structural intelligence in large language models
Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

TL;DR
MoleCode introduces a graph-explicit molecular language for LLMs, enabling direct reasoning on molecular structures, improving performance in chemical tasks, and extending to complex scientific objects.
Contribution
This work presents MoleCode, a novel LLM-native, training-free molecular representation that makes structural information explicit and accessible within language models.
Findings
Enhanced reasoning on unfamiliar molecules and topology-sensitive operations.
Improved molecular editing and optimization with property alignment.
Extended applicability to polymers, Markush structures, and scientific documents.
Abstract
Molecules are graphs, but large language models~(LLMs) are usually asked to reason about them through linear strings. The most popular molecular representation, SMILES, compresses atoms, bonds, branches and rings into a compact sequence in which topology is implicit, forcing LLMs to reconstruct molecular structure before performing the requested chemical operation. Here we introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. MoleCode makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across molecular reasoning, editing, generation and analysis tasks, this representational shift improves frontier LLMs most strongly when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
