Design Proteins Using Large Language Models: Enhancements and Comparative Analyses
Kamyar Zeinalipour, Neda Jamshidi, Monica Bianchini, Marco Maggini,, Marco Gori

TL;DR
This study adapts large language models for protein sequence generation using a small dataset, achieving performance comparable to specialized models and providing publicly available tools for biological sequence design.
Contribution
It demonstrates the feasibility of fine-tuning large language models on limited protein data for biologically relevant sequence generation, a novel approach in computational biology.
Findings
Models perform comparably to specialized protein models.
Efficient protein sequence generation with limited data.
Public release of trained models for community use.
Abstract
Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences. All of these models are publicly available.5 Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics
