Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design
Amin Tavakoli, Raswanth Murugan, Ozan Gokdemir, Arvind Ramanathan, Frances Arnold, Anima Anandkumar

TL;DR
This paper introduces a simple, efficient supervised fine-tuning method for protein language models that enhances the stability, functionality, and diversity of generated protein sequences, facilitating advanced protein design.
Contribution
The authors propose a novel SFT approach that uses the PLM itself for data curation, improving protein generation without relying on costly experimental datasets.
Findings
Generated proteins show increased stability and functionality.
Sequences are more novel and diverse.
Method is effective across different PLMs and protein systems.
Abstract
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains, yet its application to protein sequence modeling and protein language models (PLMs) remains ad hoc. This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language. We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences. Unlike existing approaches that require costly precompiled experimental datasets for SFT, our method leverages the PLM itself, integrating a lightweight curation pipeline with domain-specific filters to construct high-quality training data. These filters can independently refine a PLM's output and identify candidates for in vitro evaluation; when combined with SFT, they enable PLMs to generate more stable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · vaccines and immunoinformatics approaches · Biomedical Text Mining and Ontologies
