Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding
Yoojin Hong, Jaewoo Kang, Dongwon Lee, Randen L. Patterson, and Damian, B. van Rossum

TL;DR
This paper introduces Ada-BLAST, a heuristic embedding strategy for protein sequence analysis that is faster and maintains sensitivity, enhancing phylogenetic profiling and structural classification even in low sequence similarity scenarios.
Contribution
The paper presents Ada-BLAST, a new algorithm that improves speed and accuracy of protein sequence embedding for phylogenetic and structural analysis.
Findings
Ada-BLAST is approximately 19 times faster than previous methods.
Embedded alignment measurements help identify secondary structural elements.
The approach improves classification of transmembrane domain structures.
Abstract
We theorize that phylogenetic profiles provide a quantitative method that can relate the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of phylogenetic profiles is the interoperable data format (e.g. alignment information, physiochemical information, genomic information, etc). Indeed, we have previously demonstrated Position Specific Scoring Matrices (PSSMs) are an informative M-dimension which can be scored from quantitative measure of embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, even in the twilight zone of sequence similarity (<25% identity)(1-5). Although powerful, our previous embedding strategy suffered from contaminating alignments(embedded AND unmodified) and computational expense. Herein, we describe the logic and algorithmic process for a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Bioinformatics and Genomic Networks
