A novel methodology on distributed representations of proteins using their interacting ligands
Hakime \"Ozt\"urk, Elif Ozkirimli, Arzucan \"Ozg\"ur

TL;DR
This paper introduces SMILESVec, a ligand-based protein representation method using SMILES strings, demonstrating comparable performance to traditional sequence-based methods in protein clustering, with potential applications in bioinformatics tasks.
Contribution
The study presents a novel ligand-based approach for protein representation using SMILES strings, offering an alternative to sequence or structure-based methods.
Findings
Ligand-based protein representation performs as well as sequence-based methods in clustering.
SMILESVec effectively captures protein functional properties from ligand information.
Ligand-based methods can be applied to various bioinformatics problems.
Abstract
The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand based approach can be utilized in protein representation. In this study, we propose SMILESVec, a SMILES-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, BLAST and ProtVec, and two compound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
