Retrieved Sequence Augmentation for Protein Representation Learning
Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu,, Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong

TL;DR
This paper introduces Retrieved Sequence Augmentation (RSA), a retrieval-based method that enhances protein representation learning by linking sequences to similar ones in a database, improving prediction accuracy and speed without relying on multiple sequence alignments.
Contribution
The paper presents RSA, a novel retrieval-augmented approach for protein modeling that outperforms traditional MSA-based methods in accuracy and efficiency, especially on de novo proteins.
Findings
RSA improves structure and property prediction accuracy by 5%.
RSA is 373 times faster than MSA Transformer.
RSA transfers better to new protein domains.
Abstract
Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Protein Structure and Dynamics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Softmax · Adam · Layer Normalization · Residual Connection · Dense Connections
