HSEARCH: fast and accurate protein sequence motif search and clustering
Haifeng Chen, Ting Chen

TL;DR
HSEARCH is a novel algorithm that efficiently finds and clusters protein sequence motifs by converting sequences into high-dimensional data points and using locality-sensitive hashing, significantly improving speed and accuracy.
Contribution
It introduces a new method combining high-dimensional data representation and locality-sensitive hashing for fast, accurate protein motif search and clustering.
Findings
HSEARCH is significantly faster than brute force methods.
It achieves high accuracy in protein motif clustering.
The method scales well with large datasets.
Abstract
Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Most existing methods are not fast enough to analyze large data sets for motif finding or achieve low accuracy for motif clustering. We present a new protein sequence motif finding and clustering algorithm, called HSEARCH. It converts fixed length protein sequences to data points in high dimensional space, and applies locality-sensitive hashing to fast search homologous protein sequences for a motif. HSEARCH is significantly faster than the brute force algorithm for protein motif finding and achieves high accuracy for protein motif clustering.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Protein Structure and Dynamics
