From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

Zehao Li; Yasuhiro Yoshikai; Shumpei Nemoto; Hiroyuki Kusuhara; Tadahaya Mizuno

arXiv:2605.09949·cs.LG·May 12, 2026

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno

PDF

TL;DR

This study investigates how chemical language models learn chiral information from SMILES strings, revealing a complex, encoder-centered process involving transient destabilization and reorganization during training.

Contribution

The paper introduces Pan-CORE models for SMILES translation and provides mechanistic insights into the emergence of chiral semantics in chemical language models.

Findings

01

Chiral-token accuracy improves abruptly after a long plateau.

02

Chiral representations undergo transient destabilization and reorganization.

03

A small set of chiral-sensitive attention heads influence chiral accuracy.

Abstract

Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.