On subset seeds for protein alignment

Mikhail A. Roytberg (IMPB); Anna Gambin; Laurent No\'e (LIFL; INRIA; Lille - Nord Europe); Slawomir Lasota; Eugenia Furletova (IMPB); Ewa Szczurek; (MPI); Gregory Kucherov (LIFL; INRIA Lille - Nord Europe)

arXiv:0901.3198·q-bio.QM·January 18, 2011·IEEE ACM Trans. Comput. Biol. Bioinform.

On subset seeds for protein alignment

Mikhail A. Roytberg (IMPB), Anna Gambin, Laurent No\'e (LIFL, INRIA, Lille - Nord Europe), Slawomir Lasota, Eugenia Furletova (IMPB), Ewa Szczurek, (MPI), Gregory Kucherov (LIFL, INRIA Lille - Nord Europe)

PDF

TL;DR

This paper explores the design of subset seeds for protein similarity search, proposing new seed alphabets and demonstrating their competitive performance against standard methods like BLASTP across various datasets.

Contribution

It introduces novel seed alphabet design methods for protein alignment and provides a comprehensive comparison with existing seeding techniques.

Findings

01

Seeds with designed alphabets outperform BLASTP on Bernoulli models.

02

Our seeds show comparable or better performance on large protein databases.

03

Subset seeds are less expressive but more efficient than some existing methods.

Abstract

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.