Efficient seeding techniques for protein similarity search
Mihkail Roytberg (IMPB-RAS), Anna Gambin, Laurent No\'e (LIFL, INRIA, Lille - Nord Europe), Slawomir Lasota, Eugenia Furletova (IMPB-RAS), Ewa, Szczurek (MPI), Gregory Kucherov (LIFL, INRIA Lille - Nord Europe)

TL;DR
This paper explores the design of efficient subset seed alphabets for protein similarity search, achieving comparable or better sensitivity/selectivity trade-offs than standard methods like Blastp.
Contribution
It introduces new seed design methods using subset seed formalism, demonstrating improved performance over traditional seeding techniques in protein sequence analysis.
Findings
Seeds outperform Blastp on Bernoulli models with BLOSUM62 matrix.
Subset seed formalism is less expressive but more efficient.
Proposed seeds achieve optimal sensitivity/selectivity trade-offs.
Abstract
We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets.We then perform an analysis of seeds built over those alphabet and compare them with the standard Blastp seeding method [2,3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seed is less expressive (but less costly to implement) than the accumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
