PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word   Tokenization on Downstream Applications

Yang Tan; Mingchen Li; Pan Tan; Ziyi Zhou; Huiqun Yu; Guisheng Fan,; Liang Hong

arXiv:2310.17415·cs.CL·October 27, 2023·2 cites

PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

Yang Tan, Mingchen Li, Pan Tan, Ziyi Zhou, Huiqun Yu, Guisheng Fan,, Liang Hong

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates how different vocabulary sizes and tokenization methods affect the performance of large protein language models across various downstream tasks, providing insights for optimal model design.

Contribution

It introduces a comprehensive benchmark for protein language models, analyzing the impact of vocabulary size and tokenization on transfer learning performance.

Findings

01

Optimal vocabulary size is between 50 and 200.

02

Vocabulary sizes over 800 impair model performance.

03

Extensive testing across 33 datasets validates the findings.

Abstract

Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ginnm/proteinpretraining
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Topic Modeling