A Fine-tuning Dataset and Benchmark for Large Language Models for   Protein Understanding

Yiqing Shen; Zan Chen; Michail Mamalakis; Luhan He; Haiyang Xia,; Tianbin Li; Yanzhou Su; Junjun He; Yu Guang Wang

arXiv:2406.05540·q-bio.QM·July 9, 2024·2 cites

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Yiqing Shen, Zan Chen, Michail Mamalakis, Luhan He, Haiyang Xia,, Tianbin Li, Yanzhou Su, Junjun He, Yu Guang Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new dataset and benchmark for evaluating large language models' ability to understand proteins, demonstrating that with specialized training, LLMs can surpass existing models like GPT-4 in protein comprehension tasks.

Contribution

The authors created ProteinLMDataset and ProteinLMBench, enabling effective pretraining, fine-tuning, and evaluation of LLMs for protein understanding, a novel approach in this domain.

Findings

01

InternLM2-7B outperforms GPT-4 on ProteinLMBench

02

ProteinLMDataset contains 17.46 billion tokens for pretraining

03

ProteinLMBench includes 944 manually verified questions

Abstract

The parallels between protein sequences and natural language in their sequential structures have inspired the application of large language models (LLMs) to protein understanding. Despite the success of LLMs in NLP, their effectiveness in comprehending protein sequences remains an open question, largely due to the absence of datasets linking protein sequences to descriptive text. Researchers have then attempted to adapt LLMs for protein understanding by integrating a protein sequence encoder with a pre-trained LLM. However, this adaptation raises a fundamental question: "Can LLMs, originally designed for NLP, effectively comprehend protein sequences as a form of language?" Current datasets fall short in addressing this question due to the lack of a direct correlation between protein sequences and corresponding text descriptions, limiting the ability to train and evaluate LLMs for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsynbio/proteinlmdataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Biomedical Text Mining and Ontologies · Topic Modeling

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Shrink and Fine-Tune · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention