Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization
Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer

TL;DR
This paper introduces CHASE, a novel framework that leverages pretrained protein language models and flow-based generative techniques to efficiently generate high-fitness protein variants, outperforming existing methods on key benchmarks.
Contribution
CHASE repurposes pretrained protein language models by compressing embeddings into a latent space and uses flow-matching for direct high-fitness variant generation, avoiding costly gradient-based sampling.
Findings
Achieves state-of-the-art results on AAV and GFP benchmarks.
Enables direct generation of high-fitness variants without predictor guidance.
Synthetic data bootstrapping improves performance in data-scarce scenarios.
Abstract
Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Genomics and Rare Diseases · Topic Modeling
