Notation-level confounding: When inconsistent molecular notations mislead chemical language models
Yosuke Kikuchi, Yasuhiro Yoshikai, Shumpei Nemoto, Ayako Furuhama, Takashi Yamada, Hiroyuki Kusuhara, Tadahaya Mizuno

TL;DR
This paper reveals that inconsistent molecular notations, especially in SMILES representations, can mislead chemical language models by distorting their understanding and inflating performance metrics, emphasizing the need for transparency in preprocessing.
Contribution
It introduces the concept of notation-level confounding in CLMs and demonstrates its impact through a survey and molecular translation framework, highlighting the importance of standardized notation reporting.
Findings
Approximately half of surveyed papers did not specify canonicalization procedures.
Inconsistent notations can distort latent representations in CLMs.
Misleading performance gains can result from notation-level confounding.
Abstract
Chemical language models (CLMs) are increasingly used for molecular design and property prediction. Because these models learn from textual encodings of molecules, differences in how such encodings are generated may affect their behavior. In cheminformatics, the term canonical SMILES implies a single standardized notation, yet different toolkits define distinct canonicalization rules, yielding multiple canonical strings for the same molecule. To examine how this variability arises and why it matters, we surveyed 264 CLM papers in PubMed and found that about half did not specify their canonicalization procedure, limiting transparency and reproducibility. Using a molecular translation framework, we show that when multiple valid notations are mixed or left undocumented, inconsistent notations distort latent representations and, in some benchmarks, can spuriously inflate predictive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Biomedical Text Mining and Ontologies
MethodsFeature Selection
