Energy-based generative models for monoclonal antibodies

Paul Pereira; Hervé Minoux; Aleksandra M. Walczak; Thierry Mora

PMC · DOI:10.1080/19420862.2025.2584935·November 25, 2025

Energy-based generative models for monoclonal antibodies

Paul Pereira, Hervé Minoux, Aleksandra M. Walczak, Thierry Mora

PDF

Open Access

TL;DR

This paper explores using energy-based generative models to optimize monoclonal antibodies for drug development by balancing solubility, humanness, and affinity.

Contribution

The novel use of energy-based generative models to generate antibodies on optimal Pareto fronts for multiple properties.

Findings

01

Energy-based generative models can optimize antibodies for solubility, humanness, and affinity simultaneously.

02

The model identifies tradeoffs and generates candidates on optimal Pareto fronts for these properties.

Abstract

Since the approval of the first antibody drug in 1986, a total of 162 antibodies have been approved for a wide range of therapeutic areas, including cancer, autoimmune, infectious, or cardiovascular diseases. Despite advances in biotechnology that accelerated the development of antibody drugs, the drug discovery process for this modality remains lengthy and costly, requiring multiple rounds of optimizations before a drug candidate can progress to preclinical and clinical trials. This multi-optimization problem involves increasing the affinity of the antibody to the target antigen while refining additional biophysical properties that are essential to drug development such as solubility, thermostability or aggregation propensity. Additionally, antibodies that resemble natural human antibodies are particularly desirable, as they are likely to offer improved profiles in terms of safety,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases2

cancer autoimmune, infectious, or cardiovascular diseases

Figures50

Click any figure to enlarge with its caption.

C](#f0002)). Using both chains to predict the solubility score leads to a degraded correlation of 0.35. The inability of that more informative model to improve performance may be an artifact of the small test set size (85 antibodies). We compare the performance of our solubility model to other models used to evaluate the solubility of antibodies ([Figure 5D](#f0005)). Removing the SASA in the predictive models leads to a drop in the Spearman correlation coefficient to $[eqn]$ . The SASA model we use is comparable in performance to the commonly used CamSol^[18](#cit0018)^ (0.4 vs 0.33),^[25](#c

C](#f0003)), compared to the maximum possible value of $[eqn]$ since the exploration is limited to the space of sequences that are at most six amino acids away from the wildtype sequence. Novelty shows an expected dependence on the exploration parameter $[eqn]$ : it ranges from $[eqn]$ for a pessimistic acquisition function $[eqn]$ to $[eqn]$ for an exploratory acquisition function $[eqn]$ . Both MCMC and GFlownet sampling methods generate a diverse and novel set of sequences. In addition, the continuity of the cloud of points shows that the methods can generate points all along the Pareto fro

C,D](#f0005)), we again find that using $[eqn]$ is the best choice, for both solubility thresholds. However, for the first split, the method fails to generate many diverse sequences for $[eqn]$ , which explains the low number of optimal sequences found for this choice. On the other hand, looking at dataset split 2 and 3 (SI Figure S12 C,D,G,H), we find that for both solubility thresholds it is best to use a pessimistic $[eqn]$ . However, even those choices fail to generate sequences with affinities $[eqn]$ as high as in the simple epistatic model, dropping to 0 for $[eqn]$ . We conclude that i

C](#f0006), SI Fig S13 C, E). As previously mentioned, the results for the first split are likely due to the small number of unique sequences generated. For the second and third split, the GflowNet with $[eqn]$ performs almost as well as MCMC. Finally, both MCMC sampling and Gflownet outperform the local search of AntBO, the initial dataset and the random baseline. Finally, on the hard epistatic model with $[eqn]$ , first the first split, we find that for an affinity threshold $[eqn]$ , the GFlowNet with $[eqn]$ generates the highest number of valid sequences. For the second split (SI Fig S13

C](#f0007)). With $[eqn]$ , $[eqn]$ and $[eqn]$ = 2 yield the most high-ipTM designs. Both choices appear to have an advantage over $[eqn]$ and 1 due to higher number of sequences generated with $[eqn]$ (12). However, $[eqn]$ also generates many non-binders with high solubility scores whereas $[eqn]$ generates very few sequences with $[eqn]$ . In addition, both $[eqn]$ and $[eqn]$ generate more binders than the local search procedure of AntBO. Among designs with ipTM $[eqn]$ , the average pairwise CDR-loop RMSD is $[eqn]$ Å, and the average novelty score to the predicted structures from the CB

Funding2

—Agence Nationale de la Recherche10.13039/501100001665
—European Research Council10.13039/501100000781

Keywords

Computational designAI modelsmulti-objective optimization

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMonoclonal and Polyclonal Antibodies Research · Protein purification and stability · vaccines and immunoinformatics approaches

Full text

Introduction

Monoclonal antibodies are an important type of biological drug, with 162 therapeutic antibodies currently approved to be used to treat a variety of diseases for different therapeutic areas such as cancer or inflammation. Their development, however, is costly, time consuming and prone to failure. Antibodies are Y-shaped proteins that specifically bind to an antigen. The number of possible antibodies that can be designed is too large to search exhaustively when trying to find an antibody that will bind sufficiently strongly to a target antigen. In addition to affinity, there are a number of other desirable properties that the candidate antibody must have in order to function properly as a drug. For example, it is desirable for the antibody not to have exposed hydrophobic patches that could cause aggregation issues and to be soluble. The antigenicity of the candidate antibody must also be as low as possible in order to avoid an immune reaction in the patient. These constraints further reduce the number of drug candidates and must be taken into account as early as possible in the development process.

We focus on the development of energy-based generative models for monoclonal antibodies. Our work falls within a broad range of methods utilizing wet-lab experiments and machine learning models to optimize antibodies for drug development. In-vitro assays can be used to characterize antibodies and filter-out ones with poor affinity to the target antigen or poor developability properties.^1^ The development of high-throughput assays has made it possible to characterize large numbers of sequences in batches. Methods like phage or yeast display^2,3^ can be used to identify a few antibodies that bind to a target antigen out of a large library of diverse antibodies or to optimize such candidates by comparing many different mutants of the same wild-type sequence.^4^ However, high-throughput methods tend to provide less accurate information than low-throughput methods such as surface plasmon resonance^5^ and can still take a long time to run compared to in-silico methods.

Machine learning models can be trained on the data generated by these wet-lab experiments.^6^ These models can then be used to evaluate the properties of large libraries of antibodies in order to filter-out poor drug candidates and reduce the list of candidates to be tested using wet-lab experiments. This process may lower the experimental cost, allowing for the use of lower-throughput methods when possible. For example, Ref.^7^ used affinity information about $[eqn]$ variants of the clinical antibody trastuzumab binding to Her2 to train a machine learning model that was then used to predict the affinity of $[eqn]$ trastuzumab variants and to identify top binders. The top predicted binders were further tested using in-silico models for viscosity, clearance, solubility and immunogenicity to end up with a few thousand highly optimized antibodies. While this approach has demonstrated its efficacy, its success relies on optimal candidates being already present in the library of candidates. This is not guaranteed and depends on the way the library is designed.

Alternatively, generative algorithms directly generate optimized candidate antibodies. Biswas et al.^8^ successfully combined a predictive model for the fluorescence of green fluorescent proteins and energy-based sampling methods to generate new amino sequences of GFP from Aequorea victoria with increased fluorescence. Jain et al.^9^ demonstrated the ability of autoregressive models to learn to generate amino acid sequences optimized with respect to the output of a proxy predictive model for three different tasks (anti-microbial peptide, TFBind 8 and GFP). More recently, Bennet et al.^10^ demonstrated the ability of their structure generative method to design de-novo single-domain VHH that bind to target antigens for which the epitope is specified.

We develop and analyze a method for single-round optimization of monoclonal antibodies. We start from a wild type (WT) sequence of the variable part of an antibody that binds to a target antigen. While this WT antibody is functional (i.e. binds to the target antigen), we want to improve some of its properties: further increase its affinity to the target antigen or reduce the risk of aggregation. Furthermore, we want to improve these properties by performing a small number of mutations, in order to prevent the loss of functionality from the WT antibody. Finally, in order to decrease the antigenicity of the antibody, we aim to generate antibodies that are as human-like as possible. In short, we explore the tradeoffs generative models need to consider to produce candidate antibodies that satisfy properties needed in practical antibody discovery: solubility, lack of antigenicity and strong affinity.

Results

Multiple objective optimization

Each antibody is made of two copies of a heterodimer, each composed of a heavy chain and a light chain. The variable regions of the antibodies contain the complementarity-determining regions (CDR), flexible loop structures that make up a large part of the paratopes (the section of the antibody that is in contact with the antigen). The heavy chain and light chain each contain three CDR regions: CDRH1, CDRH2 and CDRH3 for the heavy chain and CDRL1, CDRL2 and CDRL3 for the light chain. In this work, we will restrict the optimization to the heavy chain sequence for simplicity although our method can easily be modified to include the light chain.

We use a multi-objective optimization approach. We consider a set of biophysical properties $[eqn]$ , assumed to be a function of the antibody sequence $[eqn]$ , which we wish to maximize all at once. We expect trade-offs to emerge since optimizing some properties may come at the cost of others. We look for Pareto optimal solutions, defined as sequences which cannot be improved in any property without being degraded in another one. Combinations of biophysical properties achieved by Pareto optimal sequences form the Pareto front.

We do not have access to the functions $[eqn]$ directly for arbitrary sequences $[eqn]$ . Instead, we will exploit predictive machine learning models $[eqn]$ trained on large datasets to generate sequences that are close to the Pareto front. We are interested in the single-round setting: we assume that once our generative method has selected candidates for us to test, we will be able to validate those candidates using wet-lab experiments only once, with a fixed budget of $[eqn]$ sequences to be tested. Since we do not expect the models $[eqn]$ to perfectly reproduce the desired properties, we wish to generate a diverse set of $[eqn]$ candidates to maximize the probability that at least one of them will pass validation. We further require that these sequences be different from the ones included in the datasets used to train the models.

In the following, we will focus on two main biophysical properties: solubility (sol) and affinity (aff). Our generative process incorporates the humanness of sequences through the distribution of natural antibodies $[eqn]$ , as given by an auto-regressive transformer previously trained on 558 million human heavy chain sequences from the OAS database.^11,12^ We define $[eqn]$ for our generative model to be as close as possible to $[eqn]$ (as measured by the Kullback-Leibler divergence) while maximizing the mean values of $[eqn]$ and $[eqn]$ . This gives us the explicit expression:

[eqn]

[eqn]

where $[eqn]$ is a normalization, $[eqn]$ may be interpreted as a temperature controlling to cloneness to the Pareto front, and $[eqn]$ is a weight controlling the importance of affinity versus solubility in the optimization.

We then use two energy-based generative methods to sample new heavy chain sequences from $[eqn]$ (Figure 1): Metropolis Hastings (MCMC) and the amortized Monte Carlo method GFlowNet.^13,14^ Markov Chain Monte Carlo (MCMC) is a method with theoretical guarantees of convergence, allowing exact sampling from the target distributions. It provides a valid baseline to compare with GFlowNet, a novel deep learning method with the potential to generate sequences faster than MCMC. GFlowNet has been shown empirically to generate better and more diverse amino acid sequence candidates than other generative methods, for example, on the task of generating new anti-microbial peptides.^9^ Details of these procedures are given in the Methods section. Figure 1.Overview of the generative process. A. Our generative model generates the heavy chain sequence of an antibody undergoing lead optimization. The goal is to find a small number of mutations in the CDRs to improve affinity and solubility. B. For humaness, our predictive model is IGLM, an autoregressive transformer which outputs a log probability estimate of the input heavy chain being human. For solubility, we use a convolutional neural network that estimates the per-residue sasa and combines these estimates with hydrophobicity weights. For the affinity we use a Gaussian process in combination with a protein language model encoder. C Energy-based sampling is used to generate mutants of the wild-type sequence with up to $[eqn]$ mutations. The closest sequences to the pareto front are selected for wet-lab validation.: Cartoon of the generative method for optimizing monoclonal antibodies by modifying the CDRs of the heavy chain. Three sequence-based predictors are used to score antibody candidates and energy-based sampling is used to sample high quality candidates.

Below, we apply this approach to generate optimized binders to the HR2 region of the SARS-CoV-2 spike protein peptide (CB-119) whose amino acid sequence is PDVDLGDISGINAS.

Affinity model

Antibody-antigen affinity is often defined in terms of the dissociation constant ( $[eqn]$ ) between an antibody and its antigen. High affinity means low $[eqn]$ , so we define $[eqn]$ . In order to train our affinity prediction model, we use the dataset from Engelhart et al.^15^ A library of variants was built by first identifying a pair of heavy and light chains that produce an antibody (Ab-14) that binds to CB-119. The heavy chain of Ab-14 has 33 CDR amino acids that were modified to generate 22,000 mutants: 594 single mutants were generated by doing a saturated deep mutational scan over the CDRH1, CDRH2 and CDRH3. In addition, 3,671 double mutants and 22,188 triple mutants were generated by performing random substitutions in the CDRs. The affinity of each sequence was measured 3 times using the AlphaSeq assay to compute their $[eqn]$ with respect to the CB-119 SARS-CoV-2 peptide. Of the 26,453 sequences, only 13,921 sequences had affinity measurements. We approximate their true $[eqn]$ as the average over replicates and remove the sequences for which there are no measurements.

For our affinity prediction method, we use a Gaussian Process. The Gaussian process assumes that outputs (affinities) are drawn from a multivariate Gaussian distribution whose covariance between two outputs depends on the distance between their input sequences $[eqn]$ , and uses Gaussian integration rules to predict the posterior for the affinity of any sequence $[eqn]$ as a function of the training data:

[eqn]

where $[eqn]$ is the training dataset.

We define multiple models using different embeddings for the amino acid sequences and use a Radial Basis Function (RBF) as our kernel function (see Methods).

The parameters of the kernel function ( $[eqn]$ and $[eqn]$ ) are fitted by minimizing the marginal likelihood loss function (see Eq. 14) for 600 steps using a learning rate of $[eqn]$ (see Methods for details). We utilize the Gpytorch^16^ framework to implement the Gaussian Process and fit the parameters.

To score sequences, we define four different versions of the affinity function to be used in the multiple-objective optimization: $[eqn]$ , with $[eqn]$ . When $[eqn]$ is positive we get back the Upper Confidence Bound (UCB) acquisition function^17^ which gives priority to sequences with high uncertainty and encourages exploration of the sequence space. When $[eqn]$ is negative, we obtain the Lower Confidence Bound (LCB) acquisition function, a pessimistic estimate of the affinity of the sequence. Using LCB encourages the model to generate sequences close to the ones in the training set.

We split the dataset into a training and a validation set. We refer to the training set as the set of sequences that are used to fit the parameters of the kernel function and to make a prediction. In order to test the ability of the Gaussian process to correctly predict the affinity to CB-119, we first built a training set of 80% of D and validate the model on the remaining 20%. The Pearson correlation score between the log of the measured association constant $[eqn]$ and the predicted affinity $[eqn]$ on the validation set for the best choice of embedding and when training on $[eqn]$ of the data is $[eqn]$ (Figure 2A) showing that the model is able to discriminate between high- and low-affinity sequences. Figure 2.A. Density plot showing results from predicting kd using Gaussian process B. Pearson’s correlation coefficient for different choices of embedding and different training set sizes C. Scatter plot comparing hic rt to solubility score for 83 monoclonal antibodies (not shown are two outliers with HIC RT equal to 25). The Spearman correlation coefficient and p-value are computed with the outliers. D. Spearman’s correlation score for different solubility methods. HW + SASA refers to the method we retain for the generative process, HW is similar to HW + SASA except the predicted sasa for each residue is set to 1. Camsol refers to the Camsol method^18^ and gravy refers to the biopython hydrophobicity score function.: A. A scatter plot showing the relationship between KD predicted by a Gaussian Process and measured KD, there is a pearson correlation of 0.58. B. Bar plot comparing the pearson correlation for different protein language model embedders (None, ESM2, Antiberty). ESM2 shows a slight improvement over no-embeddings or Antiberty. C. Scatter plot showing the correlation between solubility score and hydrophobicity integration chromatography retention time measured for 83 antibodies. There is a spearman correlation of 0.4. D. Bar plot comparing the spearman correlation for different solubility scoring methods. The best correlation is obtained by using a linear combination of hydrophobicity weights and solvent accessible surface area.

Using the protein language model (PLM) ESM2^19^ for the embeddings gives a modest improvement over the simpler one-hot vector encoding of the sequences (Figure 2B). We observe no differences between using a smaller version of ESM (ESM-T6) versus a larger network (ESM-T30). Using an antibody-specific protein, language model (Antiberty)^20^ gives worse results than a more general PLM (Figure 2). We asked whether reducing the amount of training data significantly decreases the performance of the model. When we reduce the size of the training set to only 30% of the data, the Pearson correlation coefficient decreases by only $[eqn]$ when using the ESM-T6 PLM to compute the embedding. Since the running time for the prediction of a Gaussian Process is quadratic in the number of sequences in the training set, evaluating new sequences during the generative processes, we use the ESM-T6 embeddings and train the Gaussian Process on $[eqn]$ of the dataset to limit the computational cost. In addition, we also considered adding out-of distribution-negative examples in order to try to improve the quality of the model. To do this, we generated new mutants by replacing the CDRs of the WT with randomly generated sequences. The CDRs were generated by drawing uniformly over the set of amino acids for each of 33 possible positions. The assumption we use is that by randomizing the CDRs it is very likely that the sequences we generate are far away (in terms of hamming distance) from the WT and are therefore likely to be non-binders. We therefore assign to each sequence a large $[eqn]$ . We generated 14,000 such negative examples and retrained a separate model using the ESM-T6 embeddings where the training dataset contains the original 80% training split and the generated negative examples. We validated the performance of the model on the original 20% training split and found a drop in performance from 0.58 to 0.55 pearson correlation (see SI Fig S15) compared to not using the negative data. Therefore, we decided to use the model not trained on the negative examples for the rest of the paper.

Solubility model

The solubility of a monoclonal antibody is related to different factors that can impact the efficacy of the drug. Antibodies can aggregate when stored in an aqueous solution which will result in a painful reaction when the drug is administered to the patient. In addition, aggregates can become a target for the immune system and increase the antigenicity of the drug. Hydrophobic interaction chromatography (HIC) estimates the solubility of antibodies by measuring the time it takes the antibody to cross a buffer. Stronger hydrophobic amino acids result in longer times.

We use a method developed by Jain et al.^21^ to estimate the solubility of an antibody based on its sequence. Given a sequence of amino acid of length L: $[eqn]$ , the solubility score of the sequence is defined as

[eqn]

where $[eqn]$ is the residue solvent-accessible surface area of amino acid $[eqn]$ at position $[eqn]$ , and $[eqn]$ is one of the 20 hydrophobic weights describing the hydrophobicity of the amino acid $[eqn]$ . We use the solubility weights provided in Jain et al.^21^ and train a convolutional neural network to predict the per-residue SASA score of each amino acid in an antibody heavy chain sequence.

Since the per-residue SASA score is not available from the sequence alone, Jain et al. compute the per-residue SASA for 902 antibody structures identified from the RCSB PDB database using the Shrake-Rupley algorithm^22^ and train a random forest regressor to predict the per-residue SASA of residues in the variable region from the sequence alone. They then use a private dataset of 5000 antibody sequences for which the HIC retention time (RT) was measured to learn the $[eqn]$ weights using logistic regression. We re-implemented their method but replaced the random forest regressor with a deep convolutional neural network NanoNet,^23^ a structure prediction method for nanobodies, and included more structures taken from the SAbDab database.^24^ Since this architecture is able to accurately predict the structure of nanobodies, we hypothesized it could predict the per-residue SASA of the VH of antibodies. We retrained the model from scratch while modifying the last layer of the network to output a single value corresponding to the SASA score for each residue instead of the 15 values corresponding to the coordinates of the backbone residues (see Methods A). We also trained a second version of the model that takes as input both the VH and VL of the antibodies to predict their per-residue SASA. Using the output of this model, we built a secondary solubility score that considers both VH and VL by taking the sum of the solubility scores (Eq 4) of both chains.

We use a dataset containing 137 clinical-stage antibodies for which the HIC RT was computed.^1^ We used the Thera-SAbDab database^24^ to identify the antibodies in this dataset whose structures were used to train the SASA prediction model and excluded them, leaving us with 85 clinical-stage antibodies. The Spearman correlation coefficient between the HIC RT and the predicted solubility score on the 85 clinical-stage antibodies is 0.40 (Figure 2C). Using both chains to predict the solubility score leads to a degraded correlation of 0.35. The inability of that more informative model to improve performance may be an artifact of the small test set size (85 antibodies). We compare the performance of our solubility model to other models used to evaluate the solubility of antibodies (Figure 5D). Removing the SASA in the predictive models leads to a drop in the Spearman correlation coefficient to $[eqn]$ . The SASA model we use is comparable in performance to the commonly used CamSol^18^ (0.4 vs 0.33),^25^ a sequence-based model that returns a hydrophobicity score for every amino acid in the sequence that can be averaged to produce an overall score. Finally, our method performs much better on this test set than the Gravy biopython score^26^ ( $[eqn]$ vs $[eqn]$ ). For the generative tasks in this paper, we only design the VH of antibodies, i.e., the VL is always fixed. Therefore, since we observe no obvious improvement in the quality of the score when considering both chains, and to limit the computational cost of the solubility model, we use the model that takes into consideration only the VH.

Pareto optimal binders to CB-119 peptide

To generate a diverse set of antibodies binding the CB-119 peptide and optimized for solubility, we generate 20 sets of sequences with 5 different weights $[eqn]$ and the 4 choices of $[eqn]$ . We choose this set of weights for $[eqn]$ empirically, by starting with an affinity weight of 1.0 in order to guarantee to generate sequences with a high predicted affinity to the target and then slowly increasing the solubility weight during subsequent generations until the predicted affinity of the generated sequences was significantly lower (approximately 2 orders of magnitude). We found that an affinity weight of 0.85 worked well for this purpose. Finally, we ran three additional generation procedures with affinity weights of 0.875, 0.9 and 0.95 to obtain a smooth empirical Pareto Front. The fact that the solubility weight always remains low has to do with the scales of the affinity and solubility score. These choices would most likely change if we had considered different properties to optimize. We restrict ourselves to sequences of at most six mutations away from the wildtype (WT) AB-14 (see Discussion).

We define the distance to the Pareto front as:

[eqn]

where $[eqn]$ and $[eqn]$ are the variances of the affinities and solubilities of the generated sequences. Top sequences are defined as the $[eqn]$ sequences with smallest $[eqn]$ , where $[eqn]$ is the number of sequences we wish to generate and generally correspond to a budget of sequences that can be experimentally validated following the generation process. To test the reproducibility, we ran the method 3 times with different random seeds (replicates 1–3). Additionally, since our affinity model uses only 30% of the CB-119 peptide dataset, we created three different affinity models, each trained on a different 30% subset (dataset splits 1–3). This tests whether the optimal sequences generated are consistent across different training data subsets or vary based on which sequences the affinity Gaussian Process uses for inference.

Figure 3 A,B show the empirical Pareto fronts from the generative process for an inverse temperature of $[eqn]$ , $[eqn]$ , using dataset split 1 for the Gaussian Process. The Pareto fronts for $[eqn]$ , $[eqn]$ = 1 and $[eqn]$ = −1 are included in SI Figure S4. The Pareto fronts generated by Metropolis-Hastings sampling and the GflowNet are similar for $[eqn]$ and $[eqn]$ , suggesting neither method is better than the other one at finding optimal sequences. For $[eqn]$ and $[eqn]$ , Metropolis-Hastings sampling generates sequences with higher affinity and solubility than the GFlowNet. We hypothesized that this gap in performance by GFlowNet could be bridged by additional training. We trained a new GFlowNet model for 16,000 training steps (8000 more than previously) for $[eqn]$ and $[eqn]$ (see SI Fig S18). Our results show that extended training does not improve the networks according to the convergence measure we use (see Methods). Indeed, from 8000 to 16,000, the convergence criteria stay flat for every condition we consider. Furthermore, the empirical Pareto fronts generated by MCMC continue to dominate those of GFlowNet for both beta = 1 and beta = 2. Furthermore, we surprisingly observe that the empirical Pareto fronts of the GFlowNet trained for 16,000 steps are dominated by the empirical Pareto fronts of the GFlowNet that was trained for 8000 steps, in particular, for more soluble sequences, suggesting that more training may be harmful (perhaps due to overfitting). This suggests that the performance gap cannot be closed simply by increasing training time. Instead, more powerful neural network architectures and more thorough hyper-parameter tuning may be required to bridge the gap with MCMC. For $[eqn]$ , the generated sequences that have the highest affinity have a solubility score of around $[eqn]$ while the generated sequences that have the highest solubility have an affinity of $[eqn]$ nM. As a point of comparison, the 83 therapeutic antibodies on which we evaluated our solubility prediction method have an average solubility score $[eqn]$ of $[eqn]$ . The shape of the Pareto fronts shows a trade-off between the solubility and the affinity score. The Pareto front obtained using the predictive models suggests that increasing the solubility score from $[eqn]$ to $[eqn]$ , a solubility score higher than any of the therapeutic antibodies used to validate the model, decreases the affinity by 1 order of magnitude. Figure 3.A. Density plots of 1000 sequences sub-sampled from the set generated by Metropolis-Hastings at inverse temperature $[eqn]$ and $[eqn]$ , using dataset split 1. B. Density plots of 1000 sequences sub-sampled from the set generated by the GFlowNet at inverse temperature $[eqn]$ and $[eqn]$ . C. Comparison of diversity based on generative method and choice of $[eqn]$ . D. Comparison of novelty based on generative method and choice of $[eqn]$ .

To evaluate the novelty and diversity of a set of generated sequences $[eqn]$ , we follow Jain et al.^9^ and define a diversity index given by the mean Hamming distance within the generated dataset:

[eqn]

where $[eqn]$ is the Hamming distance between $[eqn]$ and $[eqn]$ , and $[eqn]$ . Novelty is defined as the mean distance to the original dataset:

[eqn]

where $[eqn]$ is the dataset of sequences for which affinity was measured. Figure 3C,D shows the diversity and novelty of the sequences generated for both methods and the three choices of $[eqn]$ .

The average mean pairwise Hamming distance of the generated sequences is between $[eqn]$ and $[eqn]$ depending on the choice of $[eqn]$ and sampling method (Figure 3C), compared to the maximum possible value of $[eqn]$ since the exploration is limited to the space of sequences that are at most six amino acids away from the wildtype sequence. Novelty shows an expected dependence on the exploration parameter $[eqn]$ : it ranges from $[eqn]$ for a pessimistic acquisition function $[eqn]$ to $[eqn]$ for an exploratory acquisition function $[eqn]$ . Both MCMC and GFlownet sampling methods generate a diverse and novel set of sequences. In addition, the continuity of the cloud of points shows that the methods can generate points all along the Pareto front from which we can pick the sequences with the desirable trade-off between affinity and solubility.

The inverse temperature parameter $[eqn]$ provides a balance between diversity and optimality. A low value will lead to more diversity in the generated set but may prevent the method from generating sequences in or near the Pareto optimal set. To verify that $[eqn]$ is sufficiently high to generate Pareto optimal sequences, we generated $[eqn]$ new sets of sequences with $[eqn]$ and $[eqn]$ using MCMC sampling for each choice of $[eqn]$ and $[eqn]$ , using dataset split 1. We then combined all sequences generated for different choices of $[eqn]$ into a single set and identified the Pareto optimal set shown in SI Fig S1. Decreasing the temperature does not generate more optimal sequences, suggesting that an inverse temperature of 20 is sufficient to generate the Pareto optimal set.

Using IgLM as a prior constrains the optimization to sequences that are more human-like. To investigate the cost of this constraint, we generated sequences without using IgLM as a prior and compared the empirical Pareto fronts to empirical Pareto fronts that use the IgLM as a prior (see SI Fig S17). We observed that Pareto fronts generated without the humanness prior slightly dominate the ones with the prior, particularly in terms of solubility, suggesting that the prior does modestly constrain the search. In exchange for this slight loss in affinity and solubility, constraining the model to generate human-like antibodies can help optimize for other important properties that can be difficult to quantify while reducing the introduction of unnatural mutations that prevent the antibody from properly folding. However, the overall trade-off appears minor, likely because all mutations in this task are restricted to the CDRs, which are more adaptable regions. Examining this trade-off in a setting where mutations occur in the framework region, where structural constraints are stronger, would be especially interesting and is a promising direction for future work.

We evaluated the sequences generated using two other developability scores reported in the previous work to evaluate generative methods^26,27^: the average charge and instability. For the instability measure, we use the instability_index() Biopython package function, which takes as input an amino acid sequence and returns a real value instability score.^28^ A protein is considered unstable if the instability score is less than $[eqn]$ . To estimate the charge of the antibody, we sum the contributions from all amino acids in the heavy chain sequence, adding $[eqn]$ for positively charged amino acids R and K, $[eqn]$ for all H and $[eqn]$ for negatively charged amino acids D and E. A good candidate antibody should have a charge score to be between $[eqn]$ and $[eqn]$ . The results for $[eqn]$ and all choices of $[eqn]$ are shown in Figure S3. The sequences generated by our method have an average charge score within the desired range, with $[eqn]$ for $[eqn]$ and $[eqn]$ for $[eqn]$ . The standard deviation is approximately 1 for all choices of $[eqn]$ and we observe that a small fraction of the sequences have a charge score higher than the limit of 2. This is likely due to the fact that positively charged amino acid are more hydrophilic and the generative model can generate more soluble antibodies by adding positively charged amino acids to the antibody sequence. In addition, we observe that for all choices of $[eqn]$ , the instability score remains on average slightly below $[eqn]$ with a standard deviation of $[eqn]$ , with the vast majority of sequences having an instability score below the threshold of $[eqn]$ . In conclusion, these results suggests that our method is able to generate functional proteins, even when evaluated on methods that were not included in the multi-objective optimization function.

In addition, empirical Pareto fronts from three independent MCMC replicates are almost identical within any given train and validation split, demonstrating robustness to sampling noise. By contrast, the fronts diverge markedly across the three splits (Figure S5), although in every case the affinity-solubility trade-off persists. Accordingly, the sets of sequences generated through each split are different. Top 1000 sequences from MCMC replicate from the same split overlap by 37–58%, whereas generated sets from different splits share no more than 10% overlap (Figure S11A). Thus, the generative process is robust within, but not across, data partitions.

When using the RBF kernel, GP models will predict a high affinity to sequences that are close (in Euclidean distance), in the embedding space, to high-affinity training examples. Since those examples vary by split, so do the generated sequences. To verify this, we compared residue-frequency logos of the 15 top binders in each split and the 1 000 generated sequences per replicate, covering the 33 CDR positions (Figure S6; SI Figures S7, S8). Across all splits, the same positions tend to mutate, but specific substitutions differ, depending on what sequences were included in each split. Many solubility-enhancing hydrophilic substitutions (e.g. E or K) appear at variable sites in the training set, meaning that the generator uses those sites to balance solubility and affinity. These mutations confirm that the method does more than just copy beneficial mutations in the training set.

In summary, the generative procedure is robust to sampling variations yet sensitive to how a limited training set is partitioned. Expanding and diversifying the experimental dataset could reduce split-to-split variability and yield more consistent affinity-solubility Pareto fronts, however the amount of data required to find such a Pareto front may be prohibitive. As we will see in the next section on synthetic data, this lack of reproducibility across data splits does not prevent the method from identifying good candidate antibodies.

Validation on synthetic dataset

We have demonstrated the ability of energy-based generative methods to generate Pareto optimal sequences with respect to the proxy functions $[eqn]$ and $[eqn]$ . However, we are not able to experimentally evaluate the $[eqn]$ of HIC of the sequences to verify that the sequences generated are actually functional. To circumvent this limitation and test the validity of the method, we designed a synthetic dataset on which to test our approach.

We define an epistatic affinity model as:

[eqn]

where $[eqn]$ is a wild-type sequence, $[eqn]$ is the change in $[eqn]$ after mutating the amino acid at position $[eqn]$ to $[eqn]$ away from its wildtype value, and $[eqn]$ is the change caused by mutating two amino acids at position $[eqn]$ and $[eqn]$ to $[eqn]$ and $[eqn]$ away from the wildtype. This epistasis model was previously used to estimate the affinity of antibodies to fluorescein based on a dataset similar in composition to the CB-119 peptide dataset studied above^29^ and fits within the broader class of epistatic models decomposed by the order of interaction, here truncated at second order-check Phillips et al.^30^ and Ranganathan review^31^ on epistasis.

Using this framework, we define two synthetic affinity models in which $[eqn]$ and $[eqn]$ are sampled from a Gaussian distribution. The first model, referred to as the “simple” epistasis model has $[eqn]$ and $[eqn]$ . The second “hard” model has $[eqn]$ and $[eqn]$ . The probability that a random mutation will be deleterious is higher in the hard case than in the simple case due to the larger impact of the J terms compared to the H terms, which motivates the hard/simple terminology.

To generate our synthetic dataset, we use the same wildtype sequences as in the SARS-CoV-2 affinity dataset and compute the affinities (Eq 8) of 14,660 random variants containing all 660 single mutants, 2100 double mutants and 11,900 triple mutants, comparable to the affinity Engelhart dataset.^15^ Each mutant was generated by starting from the wildtype sequence AB-14, randomly selecting a number of positions that we want to mutate and performing a random substitution at each of these positions. We further add some random noise $[eqn]$ to each measurement to mimic the experimental error. We generated three such datasets using different random seeds, which we refer to as dataset splits 1, 2 and 3 ( $[eqn]$ , $[eqn]$ , $[eqn]$ ) by analogy with the real data. We use this data to train a Gaussian Process as the new predictive affinity model $[eqn]$ , and generate sequences optimized for that objective as well as $[eqn]$ . For each of the two tasks, we set the inverse temperature to $[eqn]$ and generate $[eqn]$ sets of sequences with $[eqn]$ and $[eqn]$ . We initially tried an inverse temperature of $[eqn]$ , however we found that value to be too high and the process failed to generate diverse sequences. We lowered the inverse temperature to 10 to more closely match the behavior we observed when generating binders to CB-119. For each dataset split, task and choice of $[eqn]$ , we repeated the sampling procedure 3 times, which we refer to as replicates 1, 2 and 3.

Figure 4 shows the $[eqn]$ sequences closest to the empirical Pareto front for the simple epistatic model for replicate 1, as well as the three empirical Pareto fronts generated for all three replicates, all for dataset split 1. These sequences were generated using MCMC sampling for the four choices of $[eqn]$ . The results for the hard epistasis model are shown in Figure S9. Results for dataset splits 2 and 3 are also shown in Figure S9. We observe that for all choices of $[eqn]$ , the method is able to generate a diverse set of sequences along the empirical Pareto front. As with the real data, generating sequences multiple times under the same conditions leads to finding very similar empirical Pareto fronts. We do observe that for the first dataset split, for $[eqn]$ , the method has difficulty generating diverse solutions (only 17 distinct sequences are generated); however, this is not an issue for the other dataset splits. One possible solution might be to increase the temperature for this particular setting. Figure 4.A. Density plots of 1000 sequences sub-sampled from the set generated by Metropolis-Hastings at inverse temperature $[eqn]$ and $[eqn]$ on the simple synthetic task using dataset split 1. B. Density plots of 1000 sequences sub-sampled from the set generated by Metropolis-Hastings at inverse temperature $[eqn]$ and $[eqn]$ on the simple synthetic task. C. Density plots of 1000 sequences sub-sampled from the set generated by Metropolis-Hastings at inverse temperature $[eqn]$ and $[eqn]$ on the simple synthetic task. D. Density plots of 1000 sequences sub-sampled from the set generated by Metropolis-Hastings at inverse temperature $[eqn]$ and $[eqn]$ on the simple synthetic task.

We checked the overlap between the top 1000 generated sequences of each pair of dataset splits and replicated combinations (Fig. S11). Like for the real data on CB-119, we find that there is almost no overlap between sets of sequences generated using different dataset splits, although sets of sequences generated using the same dataset split have some overlap (7−16%). This suggests that this lack of overlap does not reveal issues with the generated sequences but is rather an expected feature of the procedure.

In order to investigate the usefulness of our method for antibody lead optimization, we consider the task of generating antibodies with predicted solubility score $[eqn]$ greater than a threshold $[eqn]$ and with a synthetic affinity $[eqn]$ greater than a threshold $[eqn]$ . For each choice of $[eqn]$ , we take the set of generated sequences using MCMC sampling and select the ones with $[eqn]$ . We rank the selected sequences by $[eqn]$ and keep the top $[eqn]$ sequences, where $[eqn]$ is the budget of sequences that can be validated with wet lab experiments. For every threshold $[eqn]$ we count the number of sequences, out of the $[eqn]$ sequences generated and selected, with synthetic affinity $[eqn]$ . We choose thresholds $[eqn]$ (no solubility threshold) and $[eqn]$ to explore the differences when optimizing for both affinity and solubility vs simply affinity, and a budget of $[eqn]$ . $[eqn]$ is sufficiently high to demonstrate a difference not having a solubility threshold while still generating valid sequences. We repeated this process for each dataset split and replicate. We found that results for each replicate were very similar given each dataset split and therefore only show the results for the first replicate. The results for the simple and hard task for the first dataset split are shown in SI Figure 5. The results for the other dataset splits are included in SI Figure S12.Figure 5.Figure A. shows the number of sequences out the top 500 selected from the set of sequences generated with Metropolis Hastings when trying to optimize the simple epistasis model with an affinity above a certain thresholds for different choices of $[eqn]$ , using dataset split 1, for the first replicate. We compare those results to the set of sequences in the training set, indicated as initial. B shows the percentage of sequences that are above a certain threshold and also have a predicted solubility score above 4. Figure C. and D. show the results when trying to optimize the hard epistasis model.

Using the synthetic dataset, we investigate the most appropriate choice of $[eqn]$ for lead optimization. We observe that for the simple epistatic model, both with and without a solubility threshold, the best choice was $[eqn]$ . With no solubility threshold (Figure 5A,B, SI Figure S12 A,B,E,F), the best choice was $[eqn]$ , although $[eqn]$ performs almost as well. While $[eqn]$ and $[eqn]$ always perform worse, all four choices of $[eqn]$ generate more optimal sequences than the training set. These results suggest that using an optimistic choice of $[eqn]$ helps to discover more optimal sequences, since adding additional mutations is more likely to improve the affinity.

For the hard epistatic model, where it is less likely for mutations far from the wildtype to be beneficial, the story is different. Looking at the results for the first dataset split (Figure 5C,D), we again find that using $[eqn]$ is the best choice, for both solubility thresholds. However, for the first split, the method fails to generate many diverse sequences for $[eqn]$ , which explains the low number of optimal sequences found for this choice. On the other hand, looking at dataset split 2 and 3 (SI Figure S12 C,D,G,H), we find that for both solubility thresholds it is best to use a pessimistic $[eqn]$ . However, even those choices fail to generate sequences with affinities $[eqn]$ as high as in the simple epistatic model, dropping to 0 for $[eqn]$ . We conclude that in a setting where beneficial mutations are rare, using a pessimistic $[eqn]$ is better.

Another advantage of the energy-based framework is the possibility to balance optimality with diversity of the sequences generated by controlling the inverse temperature. To explore the trade-off between maximizing diversity and maintaining affinity, we generated sequences on the both synthetic task while lowering the inverse temperature $[eqn]$ from 10 to 5 and 2. We evaluated the mean synthetic affinity of the top 500 sequences generated as well as their diversity and novelty (see SI Fig S16). The results show a clear trade-off: sequences generated with a higher inverse temperature achieve higher average synthetic affinity but exhibit lower diversity, while lowering the inverse temperature produces more diverse sequences at the cost of reduced affinity.

Comparison to other methods

We can use the synthetic dataset to compare the performance of our approach to previous methods against the ground truth. Khan et al.^26^ define the multi-objective optimization of antibodies as a constrained affinity optimization problem. They use a local search algorithm similar to hill climbing as the inner loop of an active learning framework to optimize the CDRH3 of antibody heavy chain in order to increase their affinity according to a synthetic affinity function computed with the software Absolut!,^32^ as well as three developability properties (net charge, repetition of amino acids and presence of a glycosylation motif). For each developability property, the algorithm requires an interval of valid values to be specified. Sequences that fall outside of these developable regions are automatically rejected by the search algorithm.

In order to compare the local search algorithm from AntBO to our approach, we re-implemented their method and ran it using two different sets of developability restrictions. In the first case, we limit the exploration to sequences with Hamming distance less than $[eqn]$ to the wild type and restrict the search to sequences with $[eqn]$ . We chose −120 because all the sequences generated by MCMC and the GflowNet have $[eqn]$ . In the second case, we add the additional restriction that the sequences must have a solubility predicted score $[eqn]$ . For both cases, we ran the local search algorithm $[eqn]$ times in parallel for $[eqn]$ steps and kept the last sequences of each run as the set of generated sequences.

In addition, as a negative control, we also compared our method to a set of $[eqn]$ randomly generated sequences $[eqn]$ . We generated these sequences by starting with the best mutant in $[eqn]$ . We then randomly select $[eqn]$ positions that have not yet been mutated. For each of these positions, we perform an amino acid substitution with an amino acid uniformly drawn from the $[eqn]$ other amino acids.

We compared the initial mutants $[eqn]$ , the randomly generated mutants $[eqn]$ , and the sequences generated with the local search of AntBO to sequences generated with our method using MCMC sampling and the GflowNet (Figure 6). For both MCMC and the GflowNet, for each choice of $[eqn]$ we combined all the sequences generated for both choices of the affinity weight $[eqn]$ into a single set. For each set of generated sequences, we remove all sequences with $[eqn]$ for $[eqn]$ and $[eqn]$ . Then, we select the top $[eqn]$ sequences according to the acquisition function used during the generation process. We then compute the percentage of sequences with a synthetic affinity $[eqn]$ higher than a chosen threshold $[eqn]$ . Figure 6.Figure A shows the number of sequences out the top 500 selected from the set of sequences generated for different choices of $[eqn]$ and generative method (Metropolis Hastings and GFlowNet) for dataset split 1, for the first replicate. We compare those results to the set of sequences generated by the local search procedure of AntBO and the set of sequences used to train the Gaussian process affinity predictor, indicated as initial, as well as a set of randomly generated sequences in the neighborhood of the best sequence in $[eqn]$ indicated as random. B shows the percentage of sequences that are above a certain threshold and also have a predicted solubility score above 4. Figure C and D show the results when trying to optimize the hard epistasis model.

For the simple epistatic model with no solubility threshold, MCMC samplings with $[eqn]$ and $[eqn]$ are the best choices of parameters and outperform the AntBO local search, except for the dataset split 2 (Figure 6A, SI Fig S13 A, E). However, we observe that the random sequences contain more optimized sequences than any of the sequences generated by our method. This suggests that the task is quite easy. For the simple epistatic model with $[eqn]$ , MCMC sampling with $[eqn]$ , $[eqn]$ and $[eqn]$ outperforms the AntBO local search (Figure 6B, SI Fig S13 B, F). In addition, MCMC sampling performs better than GFlowNet for similar values of except for the second dataset split where the GFlowNet performs the best with $[eqn]$ than the MCMC. All of the methods do better than the random baseline and the initial dataset. For the simple epistatic model, regardless of solubility thresholds, we observe once again that using a positive $[eqn]$ is more beneficial although if $[eqn]$ is too high, the performance starts to decrease a little on average.

For the hard epistatic model with no solubility threshold, the best method is MCMC sampling with $[eqn]$ for the first dataset split and $[eqn]$ for dataset split 2 and 3 (Figure 6C, SI Fig S13 C, E). As previously mentioned, the results for the first split are likely due to the small number of unique sequences generated. For the second and third split, the GflowNet with $[eqn]$ performs almost as well as MCMC. Finally, both MCMC sampling and Gflownet outperform the local search of AntBO, the initial dataset and the random baseline. Finally, on the hard epistatic model with $[eqn]$ , first the first split, we find that for an affinity threshold $[eqn]$ , the GFlowNet with $[eqn]$ generates the highest number of valid sequences. For the second split (SI Fig S13 D) the GFlowNet with $[eqn]$ also performs well although both MCMC and GFlowNet with $[eqn]$ are not far behind. Finally, for the third split, the best method is MCMC with $[eqn]$ with the GflowNet with $[eqn]$ also doing well. Once again, with these settings, our method outperforms AntBO, the random baseline and the initial dataset $[eqn]$ .

These results serve to demonstrate that our energy-based sampling method performs better than constrained optimization in a variety of settings. In particular, we find that the local search algorithm has more difficulty when generating sequences optimized for the hard epistasis affinity function than for the simple epistasis affinity function. In addition, we find that using optimistic choices of $[eqn]$ is better for the simple epistatic model whereas using a pessimistic $[eqn]$ is more appropriate when we expect that most mutations will be deleterious.

Validation using AlphaFold3

To further validate the method beyond synthetic data, we come back to the sequences generated from the models trained on CB-119 binding data and use AlphaFold 3 (AF3)^33^ to assess whether the designed antibodies are likely to bind their target. AF3’s ipTM score has previously been proposed as a proxy for affinity.^34^ The main idea is that AF3 can only be confident about the relative orientation of the structure of the peptide relative to the heavy and light chain if the antibody binds to the peptide.

We first tested the validity of the ipTM score as a predictor of affinity by computing it for 60 antibodies from the original CB-119 data set: the 20 top binders, the 20 worst binders, and 20 random antibody sequences. The ipTM correlated modestly with measured affinity (Spearman $[eqn]$ , Figure 7A). However, defining binders as antibodies with $[eqn]$ nM, and using an ipTM threshold of 0.7 for calling a positive, gives a precision of 0.81 and a recall of 0.38. We thus use the threshold of ipTM = 0.7 for calling binders to assess the performance of the method. Figure 7.A. Scatter plots comparing the measured affinity to the CB-119 peptide ipTM of 60 antibodies selected from the first split of the CB-119 dataset. “Best” refers to the 20 sequences with the lowest $[eqn]$ , “worse” the 20 sequences with the highest $[eqn]$ and “random” refers to the 20 sequences selected at random. B. Plot showing the 60 sequences closest to the empirical Pareto front generated by the first replicate for $[eqn]$ . C. Plot showing the number of generated sequences with CB-119 ipTM higher than certain thresholds, using no solubility threshold. D. Plot showing the number of generated sequences with CB-119 ipTM higher than certain thresholds, using a solubility threshold of 4. E, F. Structural diversity and novelty of the 60 selected for different choices of $[eqn]$ . We compute the diversity and novelty using Eq. (6) and (7), replacing $[eqn]$ by the RMSD of the carbon- $[eqn]$ atoms of the CDRs of the heavy chain between structures x and x,’ after aligning them over the entire heavy chain. For the novelty, the dataset used as $[eqn]$ contains the 60 predicted structures from the “best,” “worse” and “random” sequences.

For each $[eqn]$ we took the 60 sequences closest to the front (split 1, replicate 1) and scored them with AF3’s ipTM. With $[eqn]$ = 0 (Figure 7B) the ipTM = 0.7 line cleanly separates high-affinity, lower-solubility designs ( $[eqn]$ ) from predicted non-binders but more soluble ones, implying that binders with $[eqn]$ are hard to obtain. Similar trends for $[eqn]$ = −1, 1, 2 appear in Figure SS14A-C.

Counting designs above various ipTM thresholds (with or without $[eqn]$ ) shows that $[eqn]$ matters little without the solubility cut (Figure 7C). With $[eqn]$ , $[eqn]$ and $[eqn]$ = 2 yield the most high-ipTM designs. Both choices appear to have an advantage over $[eqn]$ and 1 due to higher number of sequences generated with $[eqn]$ (12). However, $[eqn]$ also generates many non-binders with high solubility scores whereas $[eqn]$ generates very few sequences with $[eqn]$ . In addition, both $[eqn]$ and $[eqn]$ generate more binders than the local search procedure of AntBO. Among designs with ipTM $[eqn]$ , the average pairwise CDR-loop RMSD is $[eqn]$ Å, and the average novelty score to the predicted structures from the CB-119 dataset is 0.15–0.17 Å (Figure 7E-F).

In summary, the method generates many antibodies that are predicted to be good binders according to Alphafold 3. Despite the sequence variety, these antibodies have similar CDR conformations. To get a greater structural diversity would require a more heterogeneous training set.

Discussion

Data limitations and choice of models

We begin by justifying the design choices made when designing this method. First and foremost, data limitations are inherent to the task of antibody optimization. In this paper, we use a dataset that contains $[eqn]$ sequences. While this number may vary depending on the available resources, that number is usually a tiny fraction of the space of sequences we wish to explore. Even limiting the search to sequences six amino acids away from the WT, leaves $[eqn]$ possible candidates. Given the complexity of the task of predicting the affinity of an antibody to a target antigen based on sequence information alone, training a model on the dataset that generalizes well to the entire search space is hard. For this reason, we chose to use a Gaussian Process as this method allows us to directly estimate the epistemic uncertainty of our prediction. Furthermore, due to the limited training set size, when using a protein language model to compute the embeddings of sequences, we chose not to do any fine-tuning of the model to avoid potential over-fitting due to the large number of parameters these models use.

For the solubility model, in absence of a large exploitable dataset for machine learning, we opted to reuse the method developed by.^21^ Since their method relies on first training a SASA prediction method on publicly available structures of antibodies, we were able to implement our own SASA prediction model and then reuse the 20 hydrophobicity weights learned by^21^ on their private dataset containing 5000 pairs of antibody sequences and their HIC RT.

Review of prior approaches

A number of methods for the generation of optimized antibody sequences have previously been proposed. Like our method, AntB0^26^ performs multi-objective optimization by modifying the heavy chain of antibodies. They seek to optimize the affinity estimate given by Absolut! as well as three developability properties. Another similar approach^35^ also performs multi-objective optimization and optimizes both affinity and thermostability. Our approach to multi-objective optimization is different from Ref.^35^ and Ref.^26^ Both of these methods define hard constraints for the developability properties. For example, AntBO requires the charge of the generated sequences to be between −2 and 2. The method in Ref.^35^ requires the predicted melting temperature to be above 60 or 65 $[eqn]$ C. We do not use hard constraints but instead use an energy function made up of a linear combination of the predicted properties and sampled from the Boltzmann distribution defined by this energy function. The advantage of this approach is that it does not require prior knowledge of constraints for each property. In the case of the solubility prediction, for example, the HIC RT can depend on experimental conditions and defining a precise threshold for which antibodies are soluble or not is often not simple. Ref.^36^ developed an active learning loop procedure to optimize the affinity of antibodies based on the predicted free energy of the antibody-antigen complex using the Schrödinger software, using a standard multi-round active learning framework without constraining the search space. Unlike us, they do not perform multi-objective optimization, but they use a Gaussian process as a predictive model for the Schrödinger $[eqn]$ prediction similar to our methods and^35^ and Ref.^26^ These three methods^26,35,36^ use the expected improvement acquisition function to select which sequence to test. In contrast, we use the UCB/LCB acquisition function in order to study the effect of being optimistic vs pessimistic in a single round of optimization. Another key difference is that, we utilize a pretrained autoregressive protein language model to regularize our generative process. Of these methods, Ref.^35^ is the most similar to us in scope as it is designed to perform offline single-round multi-objective optimization, although it optimizes thermostability whereas we optimize solubility. In addition, the idea of performing the exploration in the latent space of the auto-encoder in order to amortize the exploration cost of the search space is similar to our use of the GflowNet. AntBO and Ref.^36^ on the other hand were designed for multi-round online active learning. Both methods use in-silico affinity prediction methods as $[eqn]$ , allowing them to bypass the more expensive and time-consuming wet lab experiments.

In addition,^27^ we trained a GP on the dataset of^15^ and generated new sequences using Monte Carlo Markov chain and the expected improvement acquisition function. They demonstrated that this approach is able to generate new mutants with higher affinity to the CB-119 peptide than in the initial dataset. Unlike us, however, they do not perform multi-objective optimization.

Motivation behind investigating the choice of acquisition function

Our motivation to investigate the use of different choices of $[eqn]$ for the acquisition function came from the fact that we have a limited dataset and the need to constrain the space of sequences on which the exploration is performed, both to facilitate the generative process and limit the search to sequences on which our predictive models are confident. By varying $[eqn]$ , we can choose to let the generative model generate sequences farther away from the original dataset or to remain close. In previous studies,^26,35^ a trust region is used to achieve a similar goal. The trust region limits the exploration and is dynamically updated at the end of each optimization round based on whether a new best solution has been found. It is not possible to use the same approach in the context of a single round as the trust region is only modified at the end of each round. In our case, however, trying out different choices for $[eqn]$ is a principled way of restricting the search space.

We took inspiration from recent papers in offline reinforcement learning to try both optimistic and pessimistic values of $[eqn]$ .^37,38^ The setting of offline reinforcement learning is identical to single-round antibody optimization, wherein we do not have the ability to further query the environment or run more wet lab experiments. In this setting, multiple papers have pointed out that it may be useful to penalize uncertainty, i.e. choose a negative value for $[eqn]$ .^39,40^

The advantage of being pessimistic is that it forces the generative model to generate sequences on which it is more confident that its prediction is accurate. This is particularly important if the number of sequences that can be tested is small and if there are few good sequences (with a high affinity) within the trust region. Instead of being optimistic and generating a large variety of sequences for which we have no guarantee that they will work, we chose to generate sequences closer to the ones in our dataset, but for which we have more accurate affinity predictions, and therefore stronger guarantees that they will bind to the target antigen.

Using the AlphaFold 3’s ipTM score for CB-119 peptide binding showed that both pessimistic and optimistic $[eqn]$ values have benefits. Pessimistic $[eqn]$ yielded many predicted binders among sequences with solubility scores $[eqn]$ , while optimistic $[eqn]$ generated more highly soluble sequences, some of which were binders. This suggests that combining both approaches could be advantageous and that optimism enables simultaneous optimization of multiple properties (producing more soluble sequences), while pessimism increases the likelihood of actual binding but may make it difficult to optimize more properties.

Limitations of our method and future research directions

There are several limitations to our study. We only consider the case of optimizing over both affinity and solubility, although there are several other important properties of interest, such as thermostability. It will be interesting to test this method when there are more properties to optimize over. Furthermore, while this approach can be extended to optimize over more than two properties, the computational cost required to generate sequences along the Pareto front and to identify the best sequences generated grows exponentially with the number of properties considered. The current approach we used of sweeping across values of $[eqn]$ may no longer be appropriate in this situation. Investigating future solutions to this problem is an interesting area of research.

Our solubility predictive model only outputs a predictive score but no uncertainty estimates. It remains a possible research direction to build a SASA prediction model that includes uncertainty estimates. In addition, since the model outputs a prediction for every amino acid residue, it would also need to output an uncertainty estimate for every residue. It would therefore also be interesting to investigate how to combine each estimate into a single uncertainty score that could be included in the acquisition function.

We found that Pareto fronts vary with different training data and generated sequence cluster near the best training examples, suggesting that our method explores locally rather than globally despite using PLM encodings. Achieving global optimization would require larger training sets and more accurate affinity prediction models, which do not yet exist. The quality of the generated candidates depends on the training set quality – better input sequences yield better outputs according to predictive models. While different dataset splits produce different sequences, our method consistently outperforms others in generating candidates with desirable properties. However, AlphaFold 3 predictions show that all generated antibodies have similar structures, indicating limited structural diversity if these predictions are accurate.

Overall, our approach demonstrates the possibility to optimize several antibody properties in parallel. As real-life pharmaceutical projects aim at optimizing more than only two protein properties (affinity and solubility) before identifying a drug candidate, a direct application of this work would be to apply this approach to the optimization of five or more properties with experimental evaluations of the resulting sequences. By providing sequences fulfilling the drug developability criteria, the method described in this paper can help scientists to accelerate drug discovery projects.

Methods

Solubility predictor

In order to select the structures from which to build the training dataset for the solubility predictor, we took all the structures available in SabDab^24^ and clustered them by 98% similarity of the heavy chain sequence using CD-HIT.^41^ From each cluster, we selected the antibody with the longest heavy chain sequence, giving us 2648 distinct heavy chain structures. We use the Shrake-Rupley algorithm^22^ to compute the SASA of every residue in the heavy chains. We divided each computed SASA value by the maximum exposed side-chain SASA in the Ala-X-Ala peptides (where X is the amino acid for which the SASA was computed) as determined by.^42^

We use the convolutional neural network architecture used for NanoNet^23^ with an additional embedding layer added at the beginning. This layer takes as input a one-hot vector encoding of dimension $[eqn]$ . The 20 values are used to determine which amino acid is present at position $[eqn]$ . In addition, for each residue $[eqn]$ , we provide the network with the distance to the first amino acid $[eqn]$ and the distance to the last amino acid $[eqn]$ .^43^ We found that providing both the forward and backward positioning information helps the network deal with sequences of different lengths. The output of the encoding layer is a matrix of dimension $[eqn]$ with $[eqn]$ .

The network was trained using stochastic gradient descent for 10 epochs and using a mini-batch size of 16. We use the MSE loss function:

[eqn]

where $[eqn]$ is the predicted SASA of the residue at position $[eqn]$ in the sequence $[eqn]$ and $[eqn]$ is its true value.

Gaussian process

The Gaussian process assumes that the output function $[eqn]$ was initially drawn at random, but in a way that similar $[eqn]$ have similar $[eqn]$ . The distribution $[eqn]$ for all $[eqn]$ is assumed to be a multivariate Gaussian of constant mean $[eqn]$ and covariance $[eqn]$ , where $[eqn]$ is called the kernel function and determines how similar the outputs of $[eqn]$ and $[eqn]$ should be.

Let $[eqn]$ be the set of training data and $[eqn]$ associated affinity values. We assume that the data is noisy so that $[eqn]$ with $[eqn]$ . Call $[eqn]$ is a new sequence for which we aim to make a prediction for $[eqn]$ .

The joint distribution for the training data and the new sequence is also a multivariate Gaussian: $[eqn]$ with:

[eqn]

$[eqn]$ a matrix of dimensions $[eqn]$ where the $[eqn]$ entry is $[eqn]$ . $[eqn]$ a vector of dimension $[eqn]$ where the ith entry is equal to $[eqn]$ and $[eqn]$ is the identity matrix of dimension $[eqn]$ .

We seek the posterior distribution:

[eqn]

which is Gaussian with mean:

[eqn]

and variance

[eqn]

The posterior distribution for the noisy prediction $[eqn]$ has an additional $[eqn]$ term in its variance.

The model provides both an estimate for the affinity of a new sequence as well as an uncertainty estimate. This uncertainty $[eqn]$ only depends on the training data $[eqn]$ and can be reduced by adding more training examples. Therefore, it captures the “epistemic” uncertainty of the error. On the other hand $[eqn]$ is independent of the training set and $[eqn]$ , therefore it captures the “aleatoric” uncertainty.

We use the common radial basis function (RBF) Kernel:

[eqn]

The RBF kernel^44^ encodes the bias that sequences close to one another in embedding space have similar affinities to the target antigen.

The norm $[eqn]$ is defined as the Euclidian distance in the embedding space. We tested the following embedding choices:

One-hot vector embedding of the amino acid sequences. For a sequence of length L, this is a vector of length $[eqn]$ .
Embeddings computed by the protein language model ESM2.^45^ ESM2 is a BERT^46^ style transformer that was trained using the masked language task on general proteins. We compared the versions with 8 M, 150 M and 650 M parameters.
Embeddings computed by Antiberty,^20^ a bert-style transformer, similar to ESM2, but trained uniquely on human antibody heavy chain sequences.

The parameters $[eqn]$ and $[eqn]$ of the Kernel $[eqn]$ , as well as $[eqn]$ , are learned by minimizing the log marginal likelihood of the training data:

[eqn]

Humaness model

To model humaness parameter $[eqn]$ , we use IGLM,^11^ an auto-regressive transformer trained on 558 M human heavy chain sequences from the OAS database.^12^ Given the amino acid sequence from position 1 to $[eqn]$ denoted $[eqn]$ , the model returns a discrete probability distribution over the set of the 20 amino acids $[eqn]$ and adds the next amino acid. For each sequence of amino acids, $[eqn]$ is the probability that IGLM generates that particular sequence (obtained using the likelihood method of IGLM),

[eqn]

which guarantees that only amino acid sequences that resemble human heavy chain sequences will have a high probability of being generated. Shuai et al.^11^ showed that sequences generated with IGLM have on average good developability properties when looking at solubility, aggregation and CDRH3 length.

Energy-based model

We sample sequences using energy-based models (EBMs) by assigning an energy score $[eqn]$ to each amino acid sequence $[eqn]$ where $[eqn]$ is low for desirable $[eqn]$ . We would like to sample $[eqn]$ from a probability distribution that is as close as possible to the distribution of natural antibodies, $[eqn]$ , while minimizing the mean of $[eqn]$ :

[eqn]

where $[eqn]$ is the Kullback-Leibler divergence. The solution of this minimization is given by the Boltzmann law, Eq. 1. In the particular case where $[eqn]$ is replaced by the uniform distribution, $[eqn]$ becomes the negative entropy of $[eqn]$ , and the problem reduces to the maximum entropy principle, where $[eqn]$ plays the role of an inverse Lagrange parameter enforcing the mean value of $[eqn]$ .

The temperature $[eqn]$ sets the trade-off between the objectives of minimizing the quantity of interest $[eqn]$ , and remaining as close to the basal distribution of antibodies. While $[eqn]$ reduces to finding the best sequences, $[eqn]$ ensures a higher diversity of antibodies that look like natural ones.

The energy function $[eqn]$ is chosen as a linear combination of multiple properties which we wish to minimize, called linear scalarization:

[eqn]

Varying the weights $[eqn]$ makes it possible to explore the Pareto front in the $[eqn]$ limit, when that front is convex. For $[eqn]$ , the vicinity of the front is explored while also ensuring that antibodies still look like natural ones drawn from $[eqn]$ .

MCMC sampling

To perform MCMC sampling of Eq. 1, we start from the WT sequence ‘GFTLNSYGISIYSDGRRTFYGDSVGRAAGTFDS,’ the concatenation of the CDRH1, CDRH2 and CDRH3 of the AB-14 sequence from the CB-119 dataset. The same wild-type sequence was also used to generate the synthetic datasets for the simple and hard synthetic affinity task. We set this sequence to be the first sequence $[eqn]$ . At time $[eqn]$ , we randomly sample a mutant sequence $[eqn]$ from the neighborhood of $[eqn]$ . The neighborhood of $[eqn]$ is defined as the set of sequences that are at most 1 mutation away from $[eqn]$ and at most $[eqn]$ mutations away from $[eqn]$ . For each sequence with probability $[eqn]$ , we accept the mutation and set $[eqn]$ , or otherwise we reject the mutation and set $[eqn]$ .

We ran this process in parallel 8 times. Each time we performed MCMC sampling for a period of 20,000 time steps. To add a burn-in period, we removed all the sequences $[eqn]$ from the eight chains and computed the Gelman-Rubin statistic.^47^ The Gelman-Rubin statistic checks the convergence of multiple Markov Chain Monte Carlo (MCMC) chains by computing the ratio between the variance within each chain to the variance between chains, with values close to 1 indicating convergence. We then selected an initial timestep $[eqn]$ such that the Gelman-Rubin statistic was as close to 1 as possible. If we could not find a timestep $[eqn]$ such that the Gelman-Rubin statistic was lower than 1.1, we ran the sampling for 20,000 additional steps and repeated the process until convergence.

GFlowNet sampling

For the autoregressive model used in our implementation of the GFlowNet, we used a ByteNet^48,49^ architecture. The network starts with an encoding layer that feeds into four ByteNet blocks. Each block uses masked convolutional layers with no dilution and a kernel size of 17. The output of these four blocks is a matrix of dimension $[eqn]$ where $[eqn]$ is the maximum sequence length and $[eqn]$ is a hyper parameter representing the size of the encoding which we choose to be 16. We then feed to a decoder the $[eqn]$ column of this matrix to which we append two values, $[eqn]$ and $[eqn]$ , where $[eqn]$ is the position of the amino acid we are currently sampling and $[eqn]$ is the Levenshtein distance between the sequence generated so far and the prefix of length $[eqn]$ of the WT sequence. We provide $[eqn]$ to the decoder to simplify the learning process and $[eqn]$ in order for the network to be able to learn to only generate sequences with at most $[eqn]$ mutations from the wild-type sequence.

The decoder consists of a two layer multi-layer perceptron with a normalizing layer and ReLu activation layer between the first and second linear layer. The output of the second layer is a vector of dimension 20 containing the logits for each of the 20 amino acids.

For each GflowNet trained, we use a starting learning rate for the parameters of the network of $[eqn]$ , except for the learning rate of $[eqn]$ which is set to $[eqn]$ (an order of magnitude higher per the recommendation of the GflowNet paper). We train each network for 8000 training steps, halve the learning rate after 4000 steps and use a replay buffer with a max size of 20,000. We initialize to contain the sequences in the COVID dataset for the COVID task and the synthetic dataset for the simple and hard synthetic tasks. At each step, we first generate 16 sequences using the generative model and compute their energy $[eqn]$ . We then sample 16 more sequences and their score from the replay buffer and update the parameters of the model using stochastic gradient descent on the set of 32 sequences. The 16 generated sequences are then added to the replay buffer and the process is repeated.

Unlike for the MCMC, there is no principled way to check that the training of the neural network has converged to a local optimum. In order to verify that the network is learning to properly generate sequences with a probability proportional to their reward, we take inspiration from Ref.^50^ Using a separate dataset of sequences from the sequences used to initiate the replay buffer, we periodically, during the training process, compute the log probability of the sequences being generated. We compute the Spearman R correlation score between the scores and the log probability of the generated sequences.

While this method has its flaws, we found that it was a simple method to verify that the generative model was correctly learning for each choice of $[eqn]$ , $[eqn]$ and inverse temperature $[eqn]$ (SI Figure S2).

Alphafold 3 predictions

Here, we provide step-by-step instructions to reproduce our Alphafold three predictions. Datasets containing the sequences of the antibody and the antigen should be downloaded at: https://github.com/statbiophys/ABGen/tree/master/lib/dataset/af3_batch_jobs_generated_beta:-1.0.json,https://github.com/statbiophys/ABGen/tree/master/lib/dataset/af3_batch_jobs_generated_beta:0.0.json,https://github.com/statbiophys/ABGen/tree/master/lib/dataset/af3_batch_jobs_generated_beta:1.0.json,https://github.com/statbiophys/ABGen/tree/master/lib/dataset/af3_batch_jobs_generated_beta:2.0.json,https://github.com/statbiophys/ABGen/tree/master/lib/dataset/af3_batch_jobs_generated_antBO_beta:0.0.json (one file for each value of $[eqn]$ , in addition to the antBO prediction).

After logging into the server (a Google account is required), click on ‘upload json’ and select one of the downloaded json files on your local computer. After loading the json files, each job must be submitted manually, with a limit of 30 jobs per day.

Supplementary Material

figureS2.jpg

figureS8.jpg

figureS5.jpg

figureS13.jpg

figureS6.jpg

figureS1.jpg

figureS15.jpg

figureS4.jpg

figureS9.jpg

figureS17.jpg

figureS12.jpg

figureS14.jpg

figureS11.jpg

figureS10.jpg

figureS16.jpg

figureS7.jpg

figureS3.jpg

figureS18.jpg

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Jain T, Sun T, Durand S, Hall A, Houston NR, Jh N, Sharkey B, Bobrowicz B, Caffry I, Yu Y. Biophysical properties of the clinical-stage antibody landscape. In: Wells J, editor. Proceedings of the National Academy of Sciences; Vol. 114. 2017. p. 944–24.10.1073/pnas.1616408114 PMC 529311128096333 · doi ↗ · pubmed ↗
2Mc Mahon C, Baier AS, Pascolutti R, Wegrecki M, Zheng S, Ong JX, Erlandson SC, Hilger D, Rasmussen SGF, Ring AM, et al. Yeast surface display platform for rapid discovery of conformationally selective nanobodies. Nat Struct Mol Biol. 2018;25(3):289–296. doi: 10.1038/s 41594-018-0028-6.29434346 PMC 5839991 · doi ↗ · pubmed ↗
3Ledsgaard L, Kilstrup M, Karatt-Vellatt A, Mc Cafferty J, Laustsen AH. Basics of antibody phage display technology. Toxins. 2018;10(6):236. doi: 10.3390/toxins 10060236.29890762 PMC 6024766 · doi ↗ · pubmed ↗
4Ye W, Liu X, He R, Gou L, Lu M, Yang G, Wen J, Wang X, Liu F, Ma S, et al. Improving antibody affinity through in vitro mutagenesis in complementarity determining regions. J Biomed Res. 2022;36(3):155. doi: 10.7555/JBR.36.20220003.35545451 PMC 9179109 · doi ↗ · pubmed ↗
5Hearty S, Leonard P, O’Kennedy R. In: Nevoltris D, Chames P, editors. Antibody Engineering: Methods Protocols, 2nd ed. New York, NY: Humana Press; 2012. p. 411–442.
6Khetan R, Curtis R, Deane CM, Hadsund JT, Kar U, Krawczyk K, Kuroda D, Robinson SA, Sormanni P, Tsumoto K, et al. Current advances in biopharmaceutical informatics: guidelines, impact and challenges in the computational developability assessment of antibody therapeutics. Vol. 14. Taylor & Francis; 2022.10.1080/19420862.2021.2020082 PMC 881277635104168 · doi ↗ · pubmed ↗
7Mason DM, Friedensohn S, Weber CR, Jordi C, Wagner B, Meng SM, Ehling RA, Bonati L, Dahinden J, Gainza P, et al. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat Biomed Eng. 2021;5(6):600–612. doi: 10.1038/s 41551-021-00699-9.33859386 · doi ↗ · pubmed ↗
8Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM. Low-n protein engineering with data-efficient deep learning. Nat Methods. 2021;18(4):389–396. doi: 10.1038/s 41592-021-01100-y.33828272 · doi ↗ · pubmed ↗