ProtST: Multi-Modality Learning of Protein Sequences and Biomedical   Texts

Minghao Xu; Xinyu Yuan; Santiago Miret; Jian Tang

arXiv:2301.12040·q-bio.BM·July 6, 2023·31 cites

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Minghao Xu, Xinyu Yuan, Santiago Miret, Jian Tang

PDF

Open Access 1 Repo 2 Models

TL;DR

ProtST enhances protein language models by integrating biomedical texts and protein descriptions, improving function prediction and retrieval through multimodal pre-training tasks.

Contribution

This work introduces the ProtDescribe dataset and the ProtST framework, combining protein sequences and texts to improve protein representation learning.

Findings

01

ProtST outperforms previous models on diverse benchmarks.

02

Effective zero-shot protein classification achieved.

03

Enables functional protein retrieval without annotations.

Abstract

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepgraphlearning/protst
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · RNA and protein synthesis mechanisms · Protein Structure and Dynamics