Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties
Asu B\"u\c{s}ra Temizer, G\"ok\c{c}e Uludo\u{g}an, R{\i}za, \"Oz\c{c}elik, Taha Koulani, Elif Ozkirimli, Kutlu O. Ulgen, Nilg\"un, Karal{\i}, Arzucan \"Ozg\"ur

TL;DR
This paper investigates chemical word segmentation methods for molecules to identify key substructures linked to protein-ligand binding, enhancing understanding of chemical features relevant to drug discovery.
Contribution
It introduces a pipeline to analyze chemical vocabularies generated by subword tokenization algorithms and identifies target-specific key chemical words associated with binding affinity.
Findings
Key chemical words are specific to protein targets.
Identified chemical words correspond to known pharmacophores.
Analysis reveals the chemical significance of subword units in binding.
Abstract
Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words (analogous to the words that make up sentences in human languages) and then apply advanced natural language processing techniques for tasks such as drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. This study aims to investigate the chemical vocabularies generated by popular subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and Unigram, and identify key chemical words associated with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Biomedical Text Mining and Ontologies · Chemical Synthesis and Analysis
MethodsWordPiece
