Fixed-Length Protein Embeddings using Contextual Lenses
Amir Shanehsazzadeh, David Belanger, David Dohan

TL;DR
This paper introduces a method for generating fixed-length protein embeddings using transformer models and contextual lenses, enabling efficient similarity searches comparable to BLAST without extensive fine-tuning.
Contribution
It presents a novel supervised approach with contextual lenses to learn fixed-length protein embeddings from pretrained transformers, improving nearest-neighbor classification performance.
Findings
Embeddings trained with contextual lenses outperform raw transformer embeddings.
Pretraining significantly boosts family classification accuracy.
Learned embeddings are competitive with traditional BLAST searches.
Abstract
The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being computationally expensive. As opposed to working with edit distance, a vector similarity approach can be accelerated substantially using modern hardware or hashing techniques. Such an approach would require fixed-length embeddings for biological sequences. There has been recent interest in learning fixed-length protein embeddings using deep learning models under the hypothesis that the hidden layers of supervised or semi-supervised models could produce potentially useful vector embeddings. We consider transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Bioinformatics and Genomic Networks
