A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding
Yiqing Shen, Zan Chen, Michail Mamalakis, Luhan He, Haiyang Xia,, Tianbin Li, Yanzhou Su, Junjun He, Yu Guang Wang

TL;DR
This paper introduces a new dataset and benchmark for evaluating large language models' ability to understand proteins, demonstrating that with specialized training, LLMs can surpass existing models like GPT-4 in protein comprehension tasks.
Contribution
The authors created ProteinLMDataset and ProteinLMBench, enabling effective pretraining, fine-tuning, and evaluation of LLMs for protein understanding, a novel approach in this domain.
Findings
InternLM2-7B outperforms GPT-4 on ProteinLMBench
ProteinLMDataset contains 17.46 billion tokens for pretraining
ProteinLMBench includes 944 manually verified questions
Abstract
The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Biomedical Text Mining and Ontologies · Topic Modeling
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Shrink and Fine-Tune · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention
