A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai; Guijuan He; Yi Hu; Jingjing Wang; Joshua Luo; Tianyu Zhu; Srikanth Pilla; Gang Li; Ling Liu; Feng Luo

arXiv:2602.02320·cs.CL·May 11, 2026

A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces an automated framework for creating a large-scale dataset of molecular structure descriptions aligned with natural language, facilitating chemical reasoning with language models.

Contribution

It presents a rule-based annotation method that generates high-quality, structure-preserving molecular descriptions at scale, enabling better molecule-language alignment.

Findings

01

Achieved 98.6% description precision on validation subset.

02

Curated a dataset of approximately 163,000 molecule-description pairs.

03

Demonstrated the framework's utility for chemical tasks relying on structural descriptions.

Abstract

Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structural XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TheLuoFengLab/MolLangData
github

Datasets

ChemFM/MolLangData
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.