Design Proteins Using Large Language Models: Enhancements and   Comparative Analyses

Kamyar Zeinalipour; Neda Jamshidi; Monica Bianchini; Marco Maggini,; Marco Gori

arXiv:2408.06396·q-bio.QM·August 14, 2024

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Kamyar Zeinalipour, Neda Jamshidi, Monica Bianchini, Marco Maggini,, Marco Gori

PDF

Open Access 1 Repo

TL;DR

This study adapts large language models for protein sequence generation using a small dataset, achieving performance comparable to specialized models and providing publicly available tools for biological sequence design.

Contribution

It demonstrates the feasibility of fine-tuning large language models on limited protein data for biologically relevant sequence generation, a novel approach in computational biology.

Findings

01

Models perform comparably to specialized protein models.

02

Efficient protein sequence generation with limited data.

03

Public release of trained models for community use.

Abstract

Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kamyarzeinalipour/protein-design-llms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics