HSEARCH: fast and accurate protein sequence motif search and clustering

Haifeng Chen; Ting Chen

arXiv:1701.00452·q-bio.GN·January 3, 2017·2 cites

HSEARCH: fast and accurate protein sequence motif search and clustering

Haifeng Chen, Ting Chen

PDF

Open Access

TL;DR

HSEARCH is a novel algorithm that efficiently finds and clusters protein sequence motifs by converting sequences into high-dimensional data points and using locality-sensitive hashing, significantly improving speed and accuracy.

Contribution

It introduces a new method combining high-dimensional data representation and locality-sensitive hashing for fast, accurate protein motif search and clustering.

Findings

01

HSEARCH is significantly faster than brute force methods.

02

It achieves high accuracy in protein motif clustering.

03

The method scales well with large datasets.

Abstract

Protein motifs are conserved fragments occurred frequently in protein sequences. They have significant functions, such as active site of an enzyme. Search and clustering protein sequence motifs are computational intensive. Most existing methods are not fast enough to analyze large data sets for motif finding or achieve low accuracy for motif clustering. We present a new protein sequence motif finding and clustering algorithm, called HSEARCH. It converts fixed length protein sequences to data points in high dimensional space, and applies locality-sensitive hashing to fast search homologous protein sequences for a motif. HSEARCH is significantly faster than the brute force algorithm for protein motif finding and achieves high accuracy for protein motif clustering.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Protein Structure and Dynamics