How to make the most of your masked language model for protein engineering
Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott

TL;DR
This paper introduces a novel sampling method for masked language models in protein engineering, demonstrating its effectiveness through in silico and in vitro evaluations on antibody therapeutics.
Contribution
It proposes stochastic beam search for MLM sampling and systematically evaluates its impact on antibody engineering, highlighting the importance of sampling strategies.
Findings
Sampling with stochastic beam search improves sequence optimization.
Choice of sampling method significantly affects model performance.
In vitro results validate the effectiveness of the proposed sampling approach.
Abstract
A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
