Learning protein sequence embeddings using information from structure
Tristan Bepler, Bonnie Berger

TL;DR
This paper introduces a novel representation learning framework that generates protein sequence embeddings encoding structural information, improving structural similarity prediction and downstream tasks like transmembrane domain prediction.
Contribution
The authors develop a multi-task LSTM-based approach that learns position-specific protein embeddings from sequence and structural similarity data, outperforming existing methods.
Findings
Outperforms other sequence-based methods in structural similarity prediction.
Embeddings transfer effectively to improve transmembrane domain prediction.
Uses a novel soft symmetric alignment measure for sequence similarity.
Abstract
Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · Bioinformatics and Genomic Networks
