Bidirectional Learning for Offline Model-based Biological Sequence Design
Can Chen, Yingxue Zhang, Xue Liu, Mark Coates

TL;DR
This paper introduces a novel offline model-based biological sequence design method that leverages pre-trained language models and a bi-level optimization framework to improve sequence scoring and design efficiency.
Contribution
It proposes a new proxy model combining pre-trained LMs with a linear head, and a bi-level optimization framework with adaptive learning rate for better sequence design.
Findings
Effective in DNA and protein sequence design tasks
Outperforms existing methods in sequence optimization accuracy
Demonstrates robustness across different biological datasets
Abstract
Offline model-based optimization aims to maximize a black-box objective function with a static dataset of designs and their scores. In this paper, we focus on biological sequence design to maximize some sequence score. A recent approach employs bidirectional learning, combining a forward mapping for exploitation and a backward mapping for constraint, and it relies on the neural tangent kernel (NTK) of an infinitely wide network to build a proxy model. Though effective, the NTK cannot learn features because of its parametrization, and its use prevents the incorporation of powerful pre-trained Language Models (LMs) that can capture the rich biophysical information in millions of biological sequences. We adopt an alternative proxy model, adding a linear head to a pre-trained LM, and propose a linearization scheme. This yields a closed-form loss and also takes into account the biophysical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Machine Learning in Bioinformatics · Machine Learning and Algorithms
