Reinforcement Learning for Sequence Design Leveraging Protein Language   Models

Jithendaraa Subramanian; Shivakanth Sujit; Niloy Irtisam; Umong Sain,; Riashat Islam; Derek Nowrouzezahrai; Samira Ebrahimi Kahou

arXiv:2407.03154·cs.LG·November 19, 2024

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Jithendaraa Subramanian, Shivakanth Sujit, Niloy Irtisam, Umong Sain,, Riashat Islam, Derek Nowrouzezahrai, Samira Ebrahimi Kahou

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning framework utilizing protein language models as reward functions for designing biologically plausible and diverse protein sequences, with an efficient proxy model to reduce computational costs.

Contribution

It proposes a novel RL-based sequence design method leveraging PLMs with a proxy model for efficient optimization, and provides extensive benchmarking and open-source implementation.

Findings

01

RL-based sequences show high biological plausibility.

02

Generated sequences exhibit high diversity.

03

The method outperforms prior approaches in benchmarks.

Abstract

Protein sequence design, determined by amino acid sequences, are essential to protein engineering problems in drug discovery. Prior approaches have resorted to evolutionary strategies or Monte-Carlo methods for protein design, but often fail to exploit the structure of the combinatorial search space, to generalize to unseen sequences. In the context of discrete black box optimization over large search spaces, learning a mutation policy to generate novel sequences with reinforcement learning is appealing. Recent advances in protein language models (PLMs) trained on large corpora of protein sequences offer a potential solution to this problem by scoring proteins according to their biological plausibility (such as the TM-score). In this work, we propose to use PLMs as a reward function to generate new sequences. Yet the PLM can be computationally expensive to query due to its large size.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Machine Learning in Bioinformatics