Low-N Protein Activity Optimization with FolDE

Jacob B. Roberts; Catherine R. Ji; Isaac Donnell; Thomas D. Young; Allison N. Pearson; Graham A. Hudson; Leah S. Keiser; Mia Wesselkamper; Peter H. Winegar; Janik Ludwig; Sarah H. Klass; Isha V. Sheth; Ezechinyere C. Ukabiala; Maria C. T. Astolfi; Benjamin Eysenbach; and Jay D. Keasling

arXiv:2510.24053·cs.LG·October 29, 2025

Low-N Protein Activity Optimization with FolDE

Jacob B. Roberts, Catherine R. Ji, Isaac Donnell, Thomas D. Young, Allison N. Pearson, Graham A. Hudson, Leah S. Keiser, Mia Wesselkamper, Peter H. Winegar, Janik Ludwig, Sarah H. Klass, Isha V. Sheth, Ezechinyere C. Ukabiala, Maria C. T. Astolfi, Benjamin Eysenbach

PDF

TL;DR

FolDE is a novel active learning method for protein optimization that uses naturalness-based warm-starting and a constant-liar batch selector to improve the discovery of top-performing mutants, outperforming existing methods.

Contribution

The paper introduces FolDE, an active learning approach that enhances protein activity prediction and mutant discovery using naturalness-based warm-starting and a new batch selection strategy.

Findings

01

FolDE discovers 23% more top 10% mutants than baseline methods.

02

FolDE is 55% more likely to find top 1% mutants.

03

The workflow is available as open-source software.

Abstract

Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.