Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm
Sarwan Ali, Haris Mansoor, Prakash Chourasia, Imdad Ullah Khan, Murray, Patterson

TL;DR
This paper introduces a kernel-based method combining RDKFingerprint, Sinkhorn-Knopp algorithm, and kernel PCA to analyze SMILES strings for molecular structure classification and regression, showing improved performance over baselines.
Contribution
It proposes a novel kernel matrix computation approach for SMILES strings using RDKFingerprint and Sinkhorn-Knopp, enhancing molecular analysis accuracy.
Findings
Outperforms baseline methods in drug subcategory classification.
Achieves better solubility and partition coefficient regression results.
Demonstrates effectiveness of kernel PCA in molecular data analysis.
Abstract
In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Algorithms and Data Compression
