Indexing Schemes for Similarity Search In Datasets of Short Protein   Fragments

Aleksandar Stojmirovic; Vladimir Pestov

arXiv:cs/0309005·cs.DS·September 4, 2007·3 cites

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

Aleksandar Stojmirovic, Vladimir Pestov

PDF

Open Access

TL;DR

This paper introduces efficient hierarchical indexing schemes for rapid similarity search in large datasets of short protein fragments, significantly reducing search time while maintaining accuracy.

Contribution

The authors develop a novel indexing method leveraging amino acid geometry, enabling fast, scalable similarity searches in datasets of up to 60 million protein fragments.

Findings

01

Achieves search of 100 nearest neighbors with less than 1% dataset scanning

02

Performs exceptionally well on datasets of 4-12 amino acid fragments

03

Offers a building block for complex algorithms and biological investigations

Abstract

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than one per cent of the entire dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Data Management and Algorithms · Machine Learning in Bioinformatics