Modeling Protein Using Large-scale Pretrain Language Model

Yijia Xiao; Jiezhong Qiu; Ziang Li; Chang-Yu Hsieh; Jie Tang

arXiv:2108.07435·cs.LG·December 8, 2021·22 cites

Modeling Protein Using Large-scale Pretrain Language Model

Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that large-scale pretraining of language models on protein sequences effectively captures evolutionary information, leading to significant improvements in protein analysis tasks.

Contribution

The study introduces a large-scale language model trained on protein sequences, showing its effectiveness in modeling biological information and improving analysis accuracy.

Findings

01

Significant improvements in token-level tasks

02

Enhanced sequence-level task performance

03

Effective encoding of evolutionary information

Abstract

Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

THUDM/ProteinLM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · RNA and protein synthesis mechanisms