Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Xiang Ma; Supantha Dey; Vaishnavey SR; Casey Zelinski; Qi Li; Ratul Chowdhury

PMC · DOI:10.1093/nargab/lqaf154·November 22, 2025

Seq2Bind webserver for binding site prediction from sequences using fine-tuned protein language models

Xiang Ma, Supantha Dey, Vaishnavey SR, Casey Zelinski, Qi Li, Ratul Chowdhury

PDF

Open Access

TL;DR

Seq2Bind is a web tool that predicts protein binding sites from sequences alone, using advanced machine learning models to identify key residues without needing structural data.

Contribution

The novel contribution is using fine-tuned protein language models to predict binding residues directly from sequences, bypassing structural requirements.

Findings

01

ESM2 and ProtBERT achieved over 67% interface-residue recovery at N-factor = 3 on 6063 dimer proteins.

02

Seq2Bind outperformed structural docking methods on 14 human health-relevant protein complexes at N-factor = 2.

03

The tool enables rapid screening of disordered proteins and provides comparable accuracy to docking methods.

Abstract

Decoding protein–protein interactions at the residue level is crucial for understanding cellular mechanisms and developing targeted therapeutics. We present Seq2Bind webserver, a computational framework that leverages fine-tuned protein language models (PLMs) to determine binding affinity between proteins and identify critical binding residues directly from sequences, eliminating the structural requirements that limit affinity prediction tools. We fine-tuned four architectures, including ProtBERT, ProtT5, Evolutionary Scale Modeling 2 (ESM2), and Bidirectional Long Short-Term Memory on the SKEMPI 2.0 dataset. Through systematic alanine mutagenesis on each residue for 6063 dimer proteins from Protein Data Bank, we evaluated each model’s ability to identify interface residues. Performance was assessed using N-factor metrics, where N-factor = 3 evaluates whether true residues appear within…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Figures5

Click any figure to enlarge with its caption.

Schematic representation of the deep learning workflow for predicting experimental protein–protein binding affinity. Two protein sequences are tokenized and processed through a Siamese neural network (two identical, weight-sharing pre-trained PLM encoders) to generate sequence embeddings. After max pooling, the vectors are combined by a Hadamard product. The joint representation is then passed to an MLP regression head (the gear icon), followed by a ReLU activation function, to produce an affinity score as the final output. The error is backpropagated to fine-tune the PLM encoders.

(A) Loss curves for ProtBERT, ProtT5, ESM2, and BiLSTM across training and (B) validation epochs. Both training and validation MAE are shown using distinct marker styles. Among the models evaluated, ProtBERT achieves the lowest validation MAE, indicating the best overall predictive performance. (C) Distribution of −∆G values for the training and validation dataset. (D) MAE of −∆G values for each model on training and validation datasets.

Performance comparison of interface residue prediction models across different N factors. (A) Distribution of success rates (%) for each model (ESM2, BiLSTM, ProtBERT, T5) across analyzed protein complexes for N factors 1, 2, and 3. Success rate is defined as the average percentage of ground-truth interface residues correctly predicted by the model for 6063 protein complexes. (B) Heatmap showing the mean success rate % across different models (x-axis) and N factors (y-axis).

(A) Interface-residue recovery across N-factor thresholds for sequence-based models, docking, and mutation-effect predictors. Bars indicate the success rate (percentage of ground-truth interface residues recovered) aggregated over 14 protein–protein complexes. For sequence models (ESM, LSTM, ProtBERT, T5) and mutation predictors (SAAMBE-3D, DDMut-PPI, mCSM-PPI2, SAAMBE-SEQ, MutationExplorer), three grouped bars correspond to N = 1, 2, and 3, where the top N, 2N, or 3N highest-ranked residues per chain are counted as correct if present in the ground-truth interface. The HADDOCK bar (leftmost) reports its aggregate success rate across complexes. (B) Stacked horizontal bar charts showing the count of correctly predicted interface residues per PDB ID for each model. The total length of the bar represents the sum of correct predictions by all models for that complex (including residues predicted correctly by multiple models). Three different parts represent results for N factors 1, 2, and 3. To prevent overcounting, the multiple models section was used to indicate that the correct residue was predicted by more than one model.

Interface-residue recovery on two benchmark complexes (PDB IDs 5WBX and 5V89). For each complex, we display the experimentally validated interacting residues as sticks (ground truth) and annotate those positions that are recovered by at least two of the four models tested for panel (A) 5WBX (a SK/IK channel positive modulators) and by our best-performing model esm2 for panel (B) 5V89 (a DCN1-like protein 4 PONY domain bound to Cullin-1).

Tables1

Table 1.. Predictive accuracy of models for N factor 1, Patch 1 for different types of residue bonds for 14 complexes

Bond type	Ground-truth count	ESM2-recovery %	BiLSTM-recovery %	ProtBERT-recovery %	T5-recovery %
H-bond	87	19.5	11.5	16.1	18.4
Ionic bond	6	33.3	50.0	16.7	0.0
p-Cation	5	0.0	20.0	80.0	20.0
pp-Stacking	15	26.7	13.3	26.7	20.0
Disulfide	2	0.0	50.0	50.0	50.0

Equations1

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} - {\mathrm{\Delta }}G = 298{\mathrm{\ }} \times 1.9872{\mathrm{\ }} \times {\mathrm{\ }}\frac{{\ln {{K}_D}{\mathrm{\ }}}}{{1000}}\ \end{eqnarray*}\end{document}

Funding4

—National Science Foundation10.13039/100000001
—Iowa State University Startup Grant (Building a World of Difference Faculty Fellow)
—Center for Industrial Research and Service10.13039/100022753
—Established Program to Stimulate Competitive Research10.13039/100005714

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topicsvaccines and immunoinformatics approaches · Protein Structure and Dynamics · Bioinformatics and Genomic Networks

Full text

Introduction

Protein–protein interactions (PPIs) play a pivotal role in facilitating various cellular processes, influencing the regulation of gene expression, signal transduction, and the formation of complex biological networks [1–3]. The intricate interplay between proteins affects essential biological functions, and deciphering the landscape of PPIs is crucial for understanding cellular mechanisms and disease pathways [4–7]. Traditional experiments [8, 9] and high-throughput techniques [10, 11] have generated a significant amount of PPI data, but they involve expensive and time-consuming wet-lab experiments. Accurate prediction of PPIs by machine learning and deep learning is, therefore, of paramount importance in expediting campaigns for comprehension of cellular functions and drug discovery efforts. The ability to predict PPI strengths holds significant promise for unveiling the intricacies of cellular interactions (such as signal transduction, competitive inhibition, and intercellular quorum sensing) and guiding targeted therapeutic interventions.

Traditional machine learning methods, such as support vector machine [12] and random forest [13], have played a foundational role in developing effective PPI predictors. Given the primary influence of protein sequences on interactions, much research has concentrated on sequence-based PPI prediction. The key step in constructing a machine learning model for PPI involves feature engineering, converting the protein sequences into fixed-dimensional vectors. Amino acid composition, dipeptide composition, conjoint triad, Doc2Vec, and position-specific scoring matrix are prevalent sequence encoding schemes [14–18], providing descriptive representations and are often leveraged by traditional machine learning methods [19, 20]. This harnesses the inherent information within protein sequences, offering a robust strategy for deciphering complex PPI strength and structural/biochemical priors that drive these interactions. However, traditional machine learning methods often face limitations, such as nonlinear relationships and feature extraction, in capturing intricate and nuanced relationships within highly dynamic and complex biological systems [21].

Recent advancements in deep learning, particularly transfer learning approaches borrowed from the natural language processing (NLP) domain, have demonstrated significant advantages in predicting PPIs compared to traditional machine learning methods. A substantial body of work predicts residue-level determinants of PPIs, often via mutational hot-spot identification and ΔΔG estimation for single-site substitutions. Representative methods include mCSM-PPI2 [22], SAAMBE-3D [23], SAAMBE-SEQ [24], MutationExplorer [25], and DDMut-PPI [26], which leverage graph signatures, structure-based energetics, or sequence-learned features to prioritize interface residues for experimental follow-up. Notably, attention-based models, including bidirectional encoder representations from transformers (BERT) and text-to-text transfer transformer (T5), have been adapted from NLP and trained on a large corpus of protein sequences in a self-supervised fashion to learn biophysical features inherent in protein sequences [22]. In this context, protein language models (PLMs) treat sequences as analogous to sentences, with amino acids serving as the fundamental building blocks, akin to a natural vocabulary. These pre-trained PLMs have shown to serve as promising foundational bases for downstream tasks to map sequence to function. While there has been a great deal of focus on leveraging PLMs for accurate prediction of 3D protein structures individually (RGN2 [23], AlphaFold2 [24], OpenFold [25], ESMFold [26], OmegaFold [27]) or in complex with other proteins (AlphaFold-Multimer [28]), the inference times for structure prediction is long and is also beset with uncertainties proportional to the number of known family members for a protein. In addition, protein–protein docking protocols also put forward uncertainties on binding poses depending on how structured and where the binding loci are. This is why predicting antigen interactions with hypermutable, unstructured CDR3 stretches of antibodies are still elusive to structure-based methods. Protein language-model encodings (an implicit summary of both sequence and structure) is gaining traction for protein function prediction, enzyme activity [29], distant homology detection [30], allosteric site prediction [31], subcellular localization [32] etc. However, training a binding affinity predictor between two proteins, and using it downstream to directly predict the hotspot amino acids that are key for the interaction between proteins has thus far remained mostly unaddressed.

In this study, we addressed this gap and have put forward a general computational assessment platform (Seq2Bind) to identify promising pre-trained natural language encodings of proteins and fine-tuned them for protein pairs and jointly trained them to predict normalized experimental binding strength between them. We fine-tuned four different pre-trained architectures and subsequently queried the trained models to pinpoint key amino acids from each protein in a pair, which, when altered, compromises the binding strength—i.e. critical to binding. We iteratively altered amino acid types along the polypeptide backbones of both proteins in a pair to alanine (one at a time) to glean those residue positions which lead to highest drops in binding energy. We analyzed 6069 heterodimers from Protein Data Bank to assess the performance of our tool. Based on the highest binding affinity shifts, we identified the binding site for these complexes. Subsequently, residues (and some of their neighboring flanks) from both proteins were all substituted by alanine resulted in augmented signals i.e. reveal with certainty whether the language model can capture a locus’ importance in mediating interaction with the partner protein. Additionally, 14 protein pairs (relevant for human health) were used for a comparison test against other SOTA prediction and docking. For docking, we took these 14 protein pairs from the Protein Data Bank and split their chains and re-docked them using structural docking program HADDOCK3 to compute the number of native interaction residues that the top HADDOCK3 pose recovers.

By presenting an in-depth analysis of the predictive performance of PLMs on mutated protein sequences, we show that subtle point mutations are difficult to be translated into local changes in protein geometry that could (a) be picked up by deep learning-based protein structure predictors and (b) show any difference in structure-based docking protocols. The mutational landscape poses a unique challenge for predictive models, and assessing their robustness in predicting interacting residues under varying mutational scenarios is crucial for establishing their reliability. This study offers insights of PLMs in its adaptability to mutated protein sequences for not only structured proteins but also non-structural transmembrane viral proteins, which might be important for entry into a host (with consequences in disease surveillance and antibody design for intervention in humans, animals, and plants). Our results shed new light on leveraging the in situ potential of PLMs for predicting key residues that biochemically drive PPI.

Seq2Bind is freely accessible via our webserver at https://agrivax.onrender.com/seq2bind/scan as part of the StructF suite. Users can select from four distinct predictive models to perform calculations tailored to their needs. The platform offers two primary functionalities: binding energy prediction and alanine mutation scanning. Additionally, a Google Colab (https://colab.research.google.com/drive/1ssETy-TXgpGzcS0AhZrDvT3cOaF5pWrF) is provided to facilitate a user-friendly, accelerated workflow, enabling efficient execution of analyses with minimal computational overhead. Compute times on the Colab are dependent on the cloud architecture chosen. Naive estimates on compute time using NVIDIA A100-SXM4-80GB GPU architecture and 369 GB RAM takes cloud runtime ∼46 sequences/second for ProtBERT, ∼10.9 sequences/second for Evolutionary Scale Modeling 2 (ESM2), and ∼4.96 sequences/second for ProtT5 model of Seq2Bind.

Methods

Datasets

The protein dataset utilized in this study was retrieved from the publicly accessible Structural Kinetics of Energetic and Molecular Processes Interface (SKEMPI 2.0) database [33]. The dataset, encompassing a wide range of protein complexes, including endogenous and exogenous interactions, comprises 7086 pairs of proteins with experimentally determined thermodynamic data. Utilizing the affinity data, represented by the dissociation constant (KD), we computed the negative delta G as the target affinity score using the formula

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} - {\mathrm{\Delta }}G = 298{\mathrm{\ }} \times 1.9872{\mathrm{\ }} \times {\mathrm{\ }}\frac{{\ln {{K}_D}{\mathrm{\ }}}}{{1000}}\ \end{eqnarray*}\end{document}

Protein sequences were extracted from the provided PDB files, and mutations were applied using mutation codes (e.g. LI38G denotes replacing L at position 38 in chain I with G). The accuracy of amino acid residues was verified against the protein sequences. We retrieved 344 wild-type protein pairs. Subsequently, 287 pairs lacking affinity data and 1573 pairs with mismatches between mutation codes and protein sequences were excluded. After removing 183 duplicated data points, we obtained 5387 pairs of proteins for analysis. This processed dataset was then divided into two subsets: 70% (3770 data points) for training and 30% (1617 data points) for validation purposes.

Models

We employed three PLMs, ProtBERT, ProtT5 [22], and (ESM2) [34], for the embedding of protein sequences. ProtBERT and ProtT5 are based on the BERT and T5 models, both of which are deep learning architectures recently developed in the NLP field. ESM2, based on transformer architecture, has become a cornerstone in modern NLP and computational biology due to its ability to model long-range dependencies in sequences. The PLMs underwent a pretraining process on large-scale datasets using a masked language modeling objective. Specifically, during training, a masked language model randomly masks 15% of the amino acids in the input sequences, fostering an efficient learning process. The ProtBERT model was pretrained on the Uniref100 dataset, comprising a 217 million protein sequences, while the ProtT5-XL-UniRef50 model, based on the t5-3b architecture, and ESM2 (esm2_t33_650M_UR50D) were pretrained on UniRef50, a dataset containing 45 million protein sequences. [22] The encoder portion (prot_t5_xl_half_uniref50-enc) of the original ProtT5-XL-UniRef50 model was used in this study.

The embedding process of Seq2Bind begins with tokenization of a given protein sequence, followed by the addition of positional encoding to each token. This tokenized input is then passed through stacked self-attention layers, enabling the generation of context-aware embeddings. Each residue within the protein sequence is encoded into an output embedding of length 1024. The hidden state of the attention stack, with dimensions corresponding to the number of amino acids, is utilized as a feature matrix for downstream tasks, e.g. regression or classification.

PLMs were integrated into a Siamese neural network architecture (Fig. 1), a design inspired and modified from recent literature [35]. The regression head of the network was modified to include three dropouts and linear layers. The final linear layer was configured to have one output, predicting the binding energy. Additionally, a ReLU activation function was applied to the output layer to ensure that all scores were positive. Various loss functions, including mean squared error (MSE), mean absolute error (MAE), and Huber, were tested, along with different batch sizes, learning rates, and weight decay hyperparameters, to optimize performance. The MSE and MAE metrics were used as evaluative measures to assess the model’s performance.

Schematic representation of the deep learning workflow for predicting experimental protein–protein binding affinity. Two protein sequences are tokenized and processed through a Siamese neural network (two identical, weight-sharing pre-trained PLM encoders) to generate sequence embeddings. After max pooling, the vectors are combined by a Hadamard product. The joint representation is then passed to an MLP regression head (the gear icon), followed by a ReLU activation function, to produce an affinity score as the final output. The error is backpropagated to fine-tune the PLM encoders.

In addition, we employed a Bidirectional Long Short-Term Memory (BiLSTM) model [36] as our baseline. Each protein sequence was input into a BiLSTM architecture with a latent dimension of 10. The outputs from the last hidden layers served as encodings for the respective sequences. Subsequently, the encoded representations from both sequences were concatenated and fed into linear layers to produce a single output representing the predicted binding energy. To address the positional bias inherent in this architecture, where residues near the beginning of the sequence disproportionately influence predictions, we implemented an enhanced inference strategy. This approach processes protein sequences using a sliding window technique with overlapping chunks (100 residues per chunk with a 50-residue stride). Each chunk of residues is independently analyzed by the BiLSTM model. The resulting predictions are then averaged, which ensures uniform consideration of residues regardless of their position in the sequence. Overall, this approach significantly improves the model’s ability to identify critical interaction sites throughout the entire protein chain rather than focusing primarily on N-terminal regions.

Evaluation

A set of 6063 protein pairs was compiled to assess the predictive accuracy of the PLMs. All available heterodimers were first extracted from the Protein Data Bank. To avoid overlap with the training set, any protein pairs that appeared in SKEMPIv2 were excluded. Proteins with >2 chain IDs or >500 residues were also removed. This process resulted in the final evaluation set, which is a leakage-controlled prediction corpus. The final corpus contained 7 620 872 per-residue predictions across 6063 PDBs by four sequence-based models (ESM, LSTM, ProtBERT, and T5).

For comparison with other methods, a separate group of 14 protein pairs was selected. These pairs are relevant to human health and have experimentally confirmed interactions (Supplementary Table S1). Importantly, none of them appear in the initial dimer set, allowing for an independent, targeted evaluation. These include immune-related complexes, such as 4UDT (superantigen–TCR) [37], 6ISC (NK cell activation) [38], 7SO0 (chemokine neutralization) [39], 7DC7 (ATP-dependent antibody) [40], and 6WYT (antigen-binding fragments) [41], relevant in infection, inflammation, and immunotherapy. Cancer-associated interactions are represented by 6XI7 (KRAS–RAF1) [42], 8EJM (mutant spliceosome complex) [43], and 5V89 (ubiquitin ligase regulation) [44]. Neurological and vascular signaling is illustrated by 5WBX (SK channels) [45], 7F1G (BACE1 inhibition in Alzheimer’s) [46], and 6UZK (HEG1–KRIT1 complex in cerebrovascular disease) [47]. Therapeutic and viral-host models include 6UMI (migraine antibody–receptor) [48], 7E50 (plasmin–inhibitor in fibrinolysis) [49], 6T36 (HBV–PDZ signaling) [50], 4U7E (ESCRT machinery) [51], and 7WBP (Omicron RBD–hACE2 binding) [52]. Together, these structures span major biological systems implicated in human disease.

To further analyze the structures and binding interface, we conducted a secondary structure analysis of all the PDB complexes. For this, we utilized the Define Secondary Structure of Proteins algorithm [53]. Additionally, in order to comprehensively assess the models’ robustness and generalization capabilities, a series of mutations were introduced for each protein pair. Specifically, for a given protein pair, one protein sequence remained unaltered, while the sequence of the interacting partner underwent systematic mutations. These mutations involved replacing individual amino acids with alanine (A), as well as more complex alterations, such as substituting each pair of consecutive amino acids with di-alanine (AA) and extending this pattern to larger subsequences. This iterative mutation process continued until each set of 5 or 10 consecutive amino acids was replaced by 5 or 10 alanine residues. This sequence manipulation serves to simulate a diverse range of mutation scenarios, providing a comprehensive evaluation of the models’ predictive capabilities under various conditions. Additionally, we utilized RING 4.0 to investigate whether the models exhibit different success rates when predicting interface residues involved in specific interaction types [54]. These interactions involve hydrogen bonds, ionic interactions, p-cation, p–p stacking, and disulfide bonds.

While larger patch sizes (e.g. 5 and 10) facilitate the exploration of spatially extended mutational effects, evaluating performance based solely on predicted ddG magnitude can be challenging due to the potentially subtle changes in binding affinity induced by single-site substitutions. To address this, we established an evaluation criterion based on an “N factor.” This factor is applied as a multiplier to the size of the true interface residue set, denoted as n. Specifically, performance is assessed by evaluating the model’s ability to rank true interface residues among the top k predicted positions, where k is defined as N × n. We performed this analysis using N factors of 1, 2, and 3. For instance, for a complex with 20 true interface residues (n = 20), we evaluated predictions against the top 20 (1n), 40 (2n), and 60 (3n) highest-scoring positions. We benchmarked the model performance against a state-of-the-art (SOTA) structural docking method (HADDOCK3) [55]. In addition, we benchmarked five SOTA PPI mutation predictors (mCSM-PPI2, SAAMBE-3D, SAAMBE-SEQ, MutationExplorer, and DDMut-PPI) that support alanine scanning by submitting our 14 PDB complexes and mutation lists to each method’s public web server (default parameters unless noted) and retrieving per-residue scores [22–26]. For comparability, we restricted to alanine substitutions ranked residues according to each tool’s native score/sign convention. Performance was assessed with our N-factor protocol.

Results

We first fine-tuned ProtBERT, ProtT5, ESM2, and BiLSTM on the SKEMPI 2.0 mutation dataset. The training and validation loss curves, shown in Fig. 2A and B, illustrate the efficient learning process of the four models. Specifically, ProtBERT converged within five epochs, while ProtT5 achieved convergence in just two epochs. Furthermore, the validation MAE achieved by each model, presented in Fig. 2D and detailed in Supplementary Data, demonstrates consistently low prediction errors across architectures. These results underscore the effectiveness of the fine-tuned models in capturing protein–protein binding affinities, particularly given that a majority of experimental binding scores in the SKEMPI 2.0 dataset fall within the 5–15 kcal/mol range (Fig. 2C). The relatively small MAE values compared to the dynamic range of binding affinities suggest that the models are well-calibrated and capable of capturing meaningful biophysical variations in protein interaction strength.

(A) Loss curves for ProtBERT, ProtT5, ESM2, and BiLSTM across training and (B) validation epochs. Both training and validation MAE are shown using distinct marker styles. Among the models evaluated, ProtBERT achieves the lowest validation MAE, indicating the best overall predictive performance. (C) Distribution of −∆G values for the training and validation dataset. (D) MAE of −∆G values for each model on training and validation datasets.

To evaluate the predictive accuracy of PLMs in identifying interaction residues, we assessed the four models using a set of 6063 protein pairs with experimentally confirmed interactions. The assessment comprised 7 620 872 predictions across 6063 PDB entries corresponding to dimers, with a mean protein chain length of 157.1 residues (range: 1–474) (Supplementary Data). One protein in each pair retained its original sequence, while the interacting partner underwent either single amino acid substitutions (one mutation) or replacement of 5 or 10 consecutive amino acids with alanines (5 or 10 mutations). The model performance varies markedly across mutation scenarios. This performance was measured as the fraction of true amino acids captured among the top N predictions. As shown in Fig. 3, the mean success rates (%) of the prediction models increased consistently as the number of top predictions considered per chain pair (N) increased. For N = 1, success rates were 37.1% (ESM), 34.4% (LSTM), 35.9% (ProtBERT), and 35.9% (T5). At N = 2, the rates rose to 56.3%, 53.6%, 56.8%, and 55.0%, respectively. At N = 3, success rates further increased to 67.4% for ESM, 64.7% for LSTM, 68.2% for ProtBERT, and 66.1% for T5. For single‐mutation targets and N factor of 1, ESM2 leads the field with an accuracy of 37.1%, followed by ProtT5 and ProtBERT at 35.9%, and our BiLSTM baseline at 34.4%. Overall, elevating the “N factor,” the number of candidate residues considered per position, yields significant improvements. As illustrated in Fig. 4, moving from an N factor of 1–3 increases ESM2’s accuracy by >30%, from 37.1% to 67.4%. Similar upward trends are observed for all models, underscoring that allowing multiple top‐ranked hypotheses per site substantially enhances the likelihood of including the correct substitution. Moreover, this benefit is consistent across each protein complex tested, highlighting the robustness of the multi‐candidate approach in capturing true mutational events (Fig. 4).

Performance comparison of interface residue prediction models across different N factors. (A) Distribution of success rates (%) for each model (ESM2, BiLSTM, ProtBERT, T5) across analyzed protein complexes for N factors 1, 2, and 3. Success rate is defined as the average percentage of ground-truth interface residues correctly predicted by the model for 6063 protein complexes. (B) Heatmap showing the mean success rate % across different models (x-axis) and N factors (y-axis).

(A) Interface-residue recovery across N-factor thresholds for sequence-based models, docking, and mutation-effect predictors. Bars indicate the success rate (percentage of ground-truth interface residues recovered) aggregated over 14 protein–protein complexes. For sequence models (ESM, LSTM, ProtBERT, T5) and mutation predictors (SAAMBE-3D, DDMut-PPI, mCSM-PPI2, SAAMBE-SEQ, MutationExplorer), three grouped bars correspond to N = 1, 2, and 3, where the top N, 2N, or 3N highest-ranked residues per chain are counted as correct if present in the ground-truth interface. The HADDOCK bar (leftmost) reports its aggregate success rate across complexes. (B) Stacked horizontal bar charts showing the count of correctly predicted interface residues per PDB ID for each model. The total length of the bar represents the sum of correct predictions by all models for that complex (including residues predicted correctly by multiple models). Three different parts represent results for N factors 1, 2, and 3. To prevent overcounting, the multiple models section was used to indicate that the correct residue was predicted by more than one model.

Interface residue recovery was systematically evaluated across increasing N-factor thresholds [1–3] using a curated set of 14 human protein complexes outside the initial validation cohort. Two complexes were excluded after sequence mapping revealed suboptimal alignment quality (see Supplementary Data); thus, analyses proceeded with 14 of the original 16 PDB (Supplementary Table S1). The comparative performance of our method, Seq2Bind, alongside five SOTA prediction and docking techniques, is summarized in Fig. 5A. At the highest contextual depth (N-factor = 3), Seq2Bind consistently outperformed alternative approaches in identifying interacting residues. Notably, even at moderate contextualization (N-factor = 2), sequence-based language models surpassed classical docking; ESM (37.2%) and ProtBERT (35.1%) both yielded higher interface recovery rates than blind docking with HADDOCK3 (32.1%).

Interface-residue recovery on two benchmark complexes (PDB IDs 5WBX and 5V89). For each complex, we display the experimentally validated interacting residues as sticks (ground truth) and annotate those positions that are recovered by at least two of the four models tested for panel (A) 5WBX (a SK/IK channel positive modulators) and by our best-performing model esm2 for panel (B) 5V89 (a DCN1-like protein 4 PONY domain bound to Cullin-1).

Among the evaluated Seq2Bind variants, ESM achieved the highest recovery and demonstrated continuous improvement with increasing N-factor: 19.2% (N = 1), 37.2% (N = 2), and 50.5% (N = 3). In contrast, mutation-predictor baselines exhibited uniformly lower success rates across all N-factors. For N factors 1–3, recovery for DDMut-PPI increased from 28.9% to 41.0%, SAAMBE-3D from 19.4% to 38.6%, mCSM-PPI2 from 16.8% to 33.6%, SAAMBE-SEQ from 11.5% to 31.5%, and MutationExplorer from 9.0% to 27.5%. Despite these gains, all five methods lagged behind Seq2Bind and HADDOCK, except for SAAMBE-3D, which marginally exceeded ESM at only N-factor 1 (19.4% versus 19.2%) (Fig. 4A). Collectively, these results establish large language model-based approaches—particularly ESM and ProtBERT—as the most accurate for benchmarked protein interface prediction, with docking providing a robust classical baseline and mutation-scanning predictors trailing across all N-factor thresholds. Contrary to our initial hypothesis, enlarging the patch size offers only marginal gains in prediction quality (see Supplementary Fig. S1), suggesting that contextual breadth alone is not the primary driver of improved residue recovery.

Figure 4B reveals significant heterogeneity in prediction performance across different protein complexes. Complexes such as 4UDT and 8EJM consistently show higher counts of correctly identified interface residues compared to others like 6UZK or 7WBP, irrespective of the patch size or N-factor size. Increasing the N factor led to a significant rise in correct predictions, notably enabling models that initially failed to identify any correct residues at stricter thresholds to successfully capture true positive sites. For instance, the LSTM model, which identified no correct residues at Top 1N (Patch 1), successfully predicted interface residues at Top 3N (Supplementary Fig. S2). Furthermore, the relative contribution of each model (ESM2, BiLSTM, ProtBERT, T5) varies substantially depending on the specific complex. While certain models like ESM2 might dominate predictions for some PDB IDs, others like ProtT5 or ProtBERT demonstrate stronger performance for different complexes (e.g. ProtT5 in 5WBX and 8EJM) (Supplementary Fig. S3). Comparing the plots across patch sizes indicates that the total number of correct predictions and the best-performing model can be complex-dependent, suggesting no single patch size is universally optimal (Supplementary Fig. S3). These findings highlight that interface prediction difficulty is highly variable between protein systems.

To investigate if model performance is influenced by the underlying interaction chemistry, we analyzed the recovery rate for residues involved in specific bond types, focusing on Patch 1 results (Table 1). The ground truth interaction types were determined using RING 4.0 based on the reference structures. However, RING 4.0 did not identify interactions for every residue within the 6.5 Å interface definition used for the primary hotspot analysis, providing a subset of chemically characterized interactions. Focusing on specific non-covalent interactions [excluding Van der Waals (VdW)], we observed distinct recovery patterns. A key finding was the consistently low recovery rate (11.5%–19.5%) for hydrogen-bonded residues across all models, despite H-bonds representing the largest category of identified non-VdW interactions (N = 87) (Supplementary Fig. S4). This suggests identifying H-bond partners from sequence alone presents a significant challenge for these zero-shot models. While performance varied more substantially between models for rarer interaction types (ionic, p-cation, p–p stacking, disulfide), the low number of ground truth instances in these categories (N in the range of 2–15) limits robust interpretation of apparent model-specific strengths or weaknesses.

For complex 5WBX, most correctly retrieved contacts lie in the hydrophobic core of the interface, including Leu 6, Phe 16, Leu 17, Phe 18, and Met 75 from both chains. Increasing the top N threshold markedly boosts coverage. With N = 1, the four models recover an average of 12 true contacts, which rises to 20 and 28 at N = 2 and 3, respectively. Increasing the N factor benefits ProtBERT significantly, as its hit count increases four to five-fold with the broader candidate list (Supplementary Data), Patch-size enlargement, in contrast, confers only marginal additional gains (Supplementary Fig. S5). For ProtT5, we observed a trend of decreasing performance with larger patch sizes. On the other hand, complex 5V89 is dominated by basic side chains. ESM-2 reliably recovers His28, Lys51, and Lys52, and, at higher N values, additionally identifies Arg111 on chain A and Asp55, Glu59, Glu61, Arg65, and Asp71 on chain B. Overall success rates climb sharply when moving from N = 1–3 (Supplementary Data), underscoring the value of permitting multiple ranked hypotheses per site. Together, these snapshots illustrate the model’s capacity to pinpoint key interface determinants, particularly when a modest expansion of the prediction set is allowed, while also revealing complexes where simpler architectures such as the BiLSTM still underperforms.

The core dataset includes ~7.6 million alanine scanning predictions, with each of the four models—ESM, LSTM, ProtBERT, and T5—contributing roughly 1.9 million predictions. ESM and ProtBERT predicted higher destabilizing mutation rates (46.7% and 58.1%) alongside stabilizing rates of 32.9% and 36.3%, respectively. T5 showed a comparable stabilizing rate (37.6%) but a lower destabilizing rate (42.9%), while LSTM exhibited markedly low detection rates for both stabilizing (3.6%) and destabilizing (0.3%) mutations, with near-neutral mean ΔΔG values across all models (Supplementary Data and Supplementary Fig. S6). These findings reveal distinct model-specific sensitivities in predicting mutation effects. On the other hand, increasing the N factor further improves success rates for N = 4 and 5 (Supplementary Fig. S7). However, this increase effectively broadens the scope by including a larger proportion of residues, thereby potentially inflating success rates by considering most residues rather than truly informative ones. This is further reflected in our analysis of protein length relative to actual interface size (Supplementary Fig. S8). This highlights the need for cautious interpretation when expanding scanning parameters to balance detection sensitivity and specificity.

Subsequently, the observed trend of higher accuracy in the 10-mutation scenario compared to single mutations can be attributed to how the models process sequence data and detect patterns. Transformer-based PLMs, such as ProtBERT, ProtT5, and ESM2, rely heavily on self-attention mechanisms to capture long-range dependencies and contextual information across entire sequences. In the 10-mutation case, the extensive sequence alterations introduce a more prominent signal of disruption in the protein sequence, making it easier for these models to identify significant changes. By contrast, single mutations represent subtler alterations, which may be harder for the models to discern amidst the inherent variability of natural protein sequences.

In contrast to transformer-based models, the BiLSTM, which serves as a baseline, processes sequences sequentially, leveraging recurrent mechanisms to capture temporal dependencies in both forward and backward directions. While BiLSTM does not use attention mechanisms, its ability to generalize in the 10-mutation scenario may stem from its focus on capturing local sequential patterns rather than relying on long-range relationships. The higher accuracy of BiLSTM in this scenario suggests that it may be less sensitive to overfitting subtle changes, offering a complementary perspective on handling sequence disruptions.

From an NLP perspective, PLMs treat protein sequences as tokenized data and learn statistical patterns across these tokens. Larger, contiguous mutations create a clearer anomaly relative to the learned patterns of natural sequences, making it easier for PLMs to detect interaction residues. Single mutations, however, lack this pronounced contextual disruption, posing challenges for both PLMs and BiLSTM in distinguishing biologically significant mutations from background noise.

Discussion

Despite the growing utility of PLMs in modeling PPIs, our findings highlight critical limitations in their ability to reliably identify residue-level interaction sites through fine-tuning. Although ProtBERT, ProtT5, and ESM2, when fine-tuned on binding affinity data, exhibited strong predictive accuracy for global binding strength, their performance in pinpointing interaction interfaces across diverse protein complexes remained inconsistent. Notably, while alanine-scanning mutagenesis revealed a subset of correctly predicted interface residues, the recovery rates, particularly for hydrogen-bonded contacts, were uniformly low across all models. This suggests that single-point mutations introduce subtle sequence perturbations that PLMs struggle to associate with local structural or energetic disruptions, particularly in the absence of explicit 3D structural context. Moreover, the predictive accuracy varied markedly between complexes and interaction chemistries, suggesting that fine-tuned models remain sensitive to dataset bias and system-specific variability. The higher accuracy observed in larger sequence perturbations (e.g. 10-residue alanine patches) further suggests that PLMs may be more attuned to detecting broader contextual disruptions than fine-scale contact determinants.

To improve the interpretability and robustness of interaction residue prediction, future studies could integrate structural priors during model training, such as incorporating residue–residue distance maps or attention-guided contact constraints from experimentally resolved structures. Multi-modal training frameworks that jointly learn from both sequence and structure could bridge the gap between global sequence embeddings and local geometric features critical for interface formation. Additionally, rather than relying solely on alanine scanning, leveraging generative adversarial mutation strategies [56] or in silico evolution simulations [57] may better expose latent model biases and enhance sensitivity to biologically relevant perturbations. Benchmarking models across chemically annotated interaction types (e.g. hydrogen bonds, ionic, π-stacking) with balanced ground-truth sets will also be essential to isolate model-specific strengths and limitations. Finally, expanding training datasets to include more diverse and experimentally resolved complexes, especially those with non-canonical or transient interactions, may improve generalization and help disentangle sequence-derived signals from structure-imposed constraints.

Our analysis revealed that increasing the N factor, and thus the number of top-ranked predictions considered, significantly improves the success rate of PLMs in identifying known interface residues (Fig. 3). Consequently, model selection and threshold tuning are crucial for downstream applications. The demonstrated increase in the count of unique correct predictions with higher N factors (Fig. 3) confirms that expanding the prediction threshold effectively captures a greater number of distinct true positive interface sites, providing a more comprehensive map of the predicted interaction surface. Further insights into the characteristics of these top predictions were gained by examining their associated predicted ddG values (Supplementary Fig. S6). Consequently, since the active residues in PPIs typically remain unknown before experimental characterization, molecular docking and simulations are often unsuccessful and waste computational resources. With structural information unavailable for most of the proteome, and considering the challenges posed by disordered regions and non-globular proteins, methods that bypass structural requirements address a critical need [58].

Our analysis of the top predicted residues by the four models that were not part of ground truth revealed substantial spatial dispersion from the true interface region. Distance calculations between predicted residues and the nearest true interface residues for the 14 complexes showed that polar contacts predominated, suggesting electrostatic interactions play a key role (Supplementary Data). While many predicted residues lacked direct interface contact, they may represent components of allosteric networks that influence binding through long-range effects. The prevalence of polar interactions supports this interpretation, as hydrogen-bonding networks facilitate allosteric communication (Supplementary Figs S4 and S9). These findings suggest that prediction models may capture a broader functional landscape [59–61]. Overall, they identify both direct interface residues and allosterically coupled sites important for protein–protein binding.

For certain models, notably LSTM and T5, the predicted ddG values exhibit a remarkably narrow range among 6063 complexes, −0.368 to 0.07 kJ/mol for LSTM and −0.795 to 0.822 kJ/mol for T5. This limited magnitude and range of predicted effects among the top candidates may suggest that these models face challenges in accurately predicting subtle changes in binding affinity resulting from mutations. This difficulty could potentially stem from inherent limitations in predicting nuanced energetic effects. Additionally, characteristics of the training data, such as potential biases or limitations in the diversity of experimental mutation data available in our SKEMPI v2 dataset, can impact the robustness of models for fine-grained ddG prediction across all mutation types and structural contexts [62]. Nevertheless, the ability of these models to enrich for known interface residues through computationally predicted alanine scanning, particularly when appropriate prediction thresholds are applied via metrics like the N factor, underscores their utility as a valuable tool for guiding experimental efforts in the characterization of protein interaction interfaces. Additionally, ESM2 (−2.211 to 3.504 kJ/mol) and ProtBERT (−3.422 to 3.172 kJ/mol) effectively capture a comparatively wider spectrum of mutational effects resulting from single amino acid substitutions.

Sequence-based methods like Seq2Bind offer highly accurate binding affinity predictions rapidly, but may overlook intricate 3D protein–ligand interactions [63]. This limitation is partly addressed by contemporary PLMs that integrate structural and functional information in their training [64]. In contrast, structure-based tools such as HADDOCK3 explicitly model shape complementarity and conformational flexibility via symmetry distance restraints, yet they struggle with proteins that are unstable or intrinsically disordered [65, 66]. Additionally, pipelines involving AlphaFold structure prediction followed by docking introduce compounded uncertainty and may affect accuracy due to prediction errors, especially in disordered regions [67]. Here, sequence-based approaches present a promising alternative, circumventing these structural challenges while enabling efficient, scalable binding affinity predictions.

Overall, our approach provides a valuable preliminary step before resource-intensive molecular docking, offering superior accuracy (ESM N-factor = 2 versus HADDOCK3, Fig. 4A), significantly reduced computational overhead (minutes versus hours for complete docking protocols), and broader applicability to proteins lacking experimentally determined structures.

Supplementary Material

lqaf154_Supplemental_Files

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Braun P, Gingras A. History of protein–protein interactions: from egg-white to complex networks. Proteomics. 2012;12:1478–98. 10.1002/pmic.201100563.22711592 · doi ↗ · pubmed ↗
2Garlick JM, Mapp AK. Selective modulation of dynamic protein complexes. Cell Chem Biol. 2020;27:986–97. 10.1016/j.chembiol.2020.07.019.32783965 PMC 7469457 · doi ↗ · pubmed ↗
3Zhang Y, Gao P, Yuan JS. Plant protein–protein interaction network and interactome. Curr Genomics. 2010;11:40–6. 10.2174/138920210790218016.20808522 PMC 2851115 · doi ↗ · pubmed ↗
4Ideker T, Sharan R. Protein networks in disease. Genome Res. 2008;18:644–52. 10.1101/gr.071852.107.18381899 PMC 3863981 · doi ↗ · pubmed ↗
5De Las Rivas J, Fontanillo C. Protein–protein interaction networks: unraveling the wiring of molecular machines within the cell. Brief Funct Genomics. 2012;11:489–96. 10.1093/bfgp/els 036.22908212 · doi ↗ · pubmed ↗
6Gonzalez MW, Kann MG. Chapter 4: protein interactions and disease. P Lo S Comput Biol. 2012;8:e 1002819. 10.1371/journal.pcbi.1002819.23300410 PMC 3531279 · doi ↗ · pubmed ↗
7Ivanov AA, Revennaugh B, Rusnak L et al. The Onco P Pi Portal: an integrative resource to explore and prioritize protein–protein interactions for cancer target discovery. Bioinformatics. 2018;34:1183–91. 10.1093/bioinformatics/btx 743.29186335 PMC 6030952 · doi ↗ · pubmed ↗
8Jung SO, Ro H, Kho BH et al. Surface plasmon resonance imaging-based protein arrays for high-throughput screening of protein–protein interaction inhibitors. Proteomics. 2005;5:4427–31. 10.1002/pmic.200500001.16196090 · doi ↗ · pubmed ↗