ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language   Models

Youhan Lee; Hasun Yu

arXiv:2303.16452·cs.LG·March 30, 2023·1 cites

ProtFIM: Fill-in-Middle Protein Sequence Design via Protein Language Models

Youhan Lee, Hasun Yu

PDF

Open Access

TL;DR

ProtFIM introduces a fill-in-middle protein language model that improves protein sequence infilling tasks by considering surrounding context, outperforming existing models in protein engineering applications.

Contribution

The paper proposes ProtFIM, a novel fill-in-middle training method for protein language models, and introduces SEIFER, a benchmark for infilling sequence design scenarios.

Findings

01

ProtFIM outperforms existing language models in infilling tasks.

02

SEIFER benchmark effectively evaluates infilling capabilities.

03

ProtFIM generates protein sequences with meaningful representations.

Abstract

Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Topic Modeling · Software Engineering Research