From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno

TL;DR
This study investigates how chemical language models learn chiral information from SMILES strings, revealing a complex, encoder-centered process involving transient destabilization and reorganization during training.
Contribution
The paper introduces Pan-CORE models for SMILES translation and provides mechanistic insights into the emergence of chiral semantics in chemical language models.
Findings
Chiral-token accuracy improves abruptly after a long plateau.
Chiral representations undergo transient destabilization and reorganization.
A small set of chiral-sensitive attention heads influence chiral accuracy.
Abstract
Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
