A PLMs based protein retrieval framework
Yuxuan Wu, Xiao Yi, Yang Tan, Huiqun Yu, Guisheng Fan, Gaowei Zheng

TL;DR
This paper introduces a novel protein retrieval framework leveraging protein language models and vector indexing to improve retrieval of both similar and dissimilar proteins, surpassing traditional sequence similarity methods.
Contribution
The proposed framework mitigates sequence similarity bias by embedding proteins in a high-dimensional space using PLMs and employing an accelerated vector database for efficient retrieval.
Findings
Effectively retrieves both similar and dissimilar proteins.
Identifies proteins overlooked by traditional methods.
Enhances protein mining and biological research.
Abstract
Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification
