Modeling Protein Using Large-scale Pretrain Language Model
Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang

TL;DR
This paper demonstrates that large-scale pretraining of language models on protein sequences effectively captures evolutionary information, leading to significant improvements in protein analysis tasks.
Contribution
The study introduces a large-scale language model trained on protein sequences, showing its effectiveness in modeling biological information and improving analysis accuracy.
Findings
Significant improvements in token-level tasks
Enhanced sequence-level task performance
Effective encoding of evolutionary information
Abstract
Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Protein Structure and Dynamics · RNA and protein synthesis mechanisms
