Improving Chemical Understanding of LLMs via SMILES Parsing

Yunhui Jang; Jaehyung Kim; Sungsoo Ahn

arXiv:2505.16340·cs.LG·May 23, 2025

Improving Chemical Understanding of LLMs via SMILES Parsing

Yunhui Jang, Jaehyung Kim, Sungsoo Ahn

PDF

Open Access

TL;DR

This paper introduces CLEANMOL, a framework that improves LLMs' understanding of molecular structures by reformulating SMILES parsing into structured, deterministic tasks, leading to better molecular comprehension.

Contribution

CLEANMOL reformulates SMILES parsing as structured tasks and pre-trains LLMs on them, significantly enhancing their molecular understanding capabilities.

Findings

01

CLEANMOL improves LLMs' structural comprehension of molecules.

02

Pretraining on structured tasks enhances performance on molecular benchmarks.

03

CLEANMOL achieves state-of-the-art results on Mol-Instructions.

Abstract

Large language models (LLMs) are increasingly recognized as powerful tools for scientific discovery, particularly in molecular science. A fundamental requirement for these models is the ability to accurately understand molecular structures, commonly encoded in the SMILES representation. However, current LLMs struggle to interpret SMILES, even failing to carry out basic tasks such as counting molecular rings. To address this limitation, we introduce CLEANMOL, a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks explicitly designed to promote graph-level molecular comprehension. These tasks span from subgraph matching to global graph matching, providing structured supervision aligned with molecular structural properties. We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Machine Learning in Materials Science · Computational Drug Discovery Methods