Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs
Wei Wu, Chao Wang, Liyi Chen, Mingze Yin, Yiheng Zhu, Kun Fu, Jieping Ye, Hui Xiong, Zheng Wang

TL;DR
This paper introduces SEPIT, a novel framework that enhances protein language models with structural knowledge and instruction tuning, significantly improving general-purpose protein understanding and prediction capabilities.
Contribution
The paper presents a structure-aware module integrated into pLMs and a new instruction tuning pipeline, along with the largest protein instruction dataset, enabling better general-purpose protein understanding.
Findings
SEPIT outperforms existing models in open-ended and closed-set tasks.
The structure-aware module enriches pLMs with structural knowledge.
The comprehensive dataset facilitates effective training and evaluation.
Abstract
Proteins, as essential biomolecules, play a central role in biological processes, including metabolic reactions and DNA replication. Accurate prediction of their properties and functions is crucial in biological applications. Recent development of protein language models (pLMs) with supervised fine tuning provides a promising solution to this problem. However, the fine-tuned model is tailored for particular downstream prediction task, and achieving general-purpose protein understanding remains a challenge. In this paper, we introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap. Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge, and subsequently integrates these enhanced pLMs with large language models (LLMs) to advance protein understanding. In this framework, we propose a novel instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Glycosylation and Glycoproteins Research · RNA and protein synthesis mechanisms
MethodsContrastive Learning
