Improving Chemical Understanding of LLMs via SMILES Parsing
Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

TL;DR
This paper introduces CLEANMOL, a framework that improves LLMs' understanding of molecular structures by reformulating SMILES parsing into structured, deterministic tasks, leading to better molecular comprehension.
Contribution
CLEANMOL reformulates SMILES parsing as structured tasks and pre-trains LLMs on them, significantly enhancing their molecular understanding capabilities.
Findings
CLEANMOL improves LLMs' structural comprehension of molecules.
Pretraining on structured tasks enhances performance on molecular benchmarks.
CLEANMOL achieves state-of-the-art results on Mol-Instructions.
Abstract
Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Machine Learning in Materials Science · Computational Drug Discovery Methods
