ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang

TL;DR
ProtST enhances protein language models by integrating biomedical texts and protein descriptions, improving function prediction and retrieval through multimodal pre-training tasks.
Contribution
This work introduces the ProtDescribe dataset and the ProtST framework, combining protein sequences and texts to improve protein representation learning.
Findings
ProtST outperforms previous models on diverse benchmarks.
Effective zero-shot protein classification achieved.
Enables functional protein retrieval without annotations.
Abstract
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Protein Structure and Dynamics
