A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method
Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

TL;DR
This paper introduces an automated framework for creating a large-scale dataset of molecular structure descriptions aligned with natural language, facilitating chemical reasoning with language models.
Contribution
It presents a rule-based annotation method that generates high-quality, structure-preserving molecular descriptions at scale, enabling better molecule-language alignment.
Findings
Achieved 98.6% description precision on validation subset.
Curated a dataset of approximately 163,000 molecule-description pairs.
Demonstrated the framework's utility for chemical tasks relying on structural descriptions.
Abstract
Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular descriptions that preserve complete structural details at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structural XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
