Leveraging protein language models and a scoring function for indel characterization and transfer learning
Oriol Gracia Carmona, Vilde Leipart, Gro V. Amdam, Christine Orengo, Franca Fraternali

TL;DR
The paper introduces IndeLLM, a new method using protein language models to better assess the impact of insertions and deletions in proteins, improving accuracy and interpretability.
Contribution
IndeLLM introduces a novel scoring framework and Siamese network for indel pathogenicity prediction with improved accuracy and interpretability.
Findings
IndeLLM achieves performance comparable to existing predictors with minimal computing resources.
A Siamese network built using transfer learning outperforms tested indel predictors with a Matthews correlation coefficient of 0.77.
The framework enables mapping of indel effects to specific protein regions, enhancing interpretability.
Abstract
Protein language models (PLMs) are increasingly used to assess the impact of genetic variants, achieving high accuracy and often outperforming traditional pathogenicity predictors. They enable zero-shot inference, making predictions without task-specific fine-tuning, though studying in-frame insertions and deletions (indels) remains challenging due to altered protein lengths and limited annotated datasets. Here, we present IndeLLM, a scoring approach for indel pathogenicity that accounts for sequence length differences. Our zero-shot method relies solely on sequence information, requires minimal computing resources, and achieves performance comparable to existing predictors. Building on this, we developed a Siamese network via transfer learning that outperformed all tested indel predictors (Matthews correlation coefficient = 0.77). To enhance accessibility, we provide a plug-and-play…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Bioinformatics and Genomic Networks · Machine Learning in Bioinformatics
Introduction
Insertion and deletion variants (indels) are among the most prevalent forms of genetic variation, accounting for about 18% of all variation in humans, with individuals sometimes carrying hundreds of these mutations across their exome.1^,^2 These variations are commonly categorized into in-frame indels, which add or remove entire codons without disturbing the reading frame, and frameshift indels, which shift the reading frame by adding or removing nucleotides in non-triplet amounts, producing entirely new downstream protein sequences. The effects of indels can vary significantly, with some leading to substantial changes in protein structure and function, while others remain largely benign.3^,^4
Frameshift indels, which disrupt the reading frame and typically result in loss of protein function, are generally considered pathogenic.5 Here, we are considering in-frame indels, which, in contrast, preserve the reading frame while altering the amino acid sequence. These indels can potentially impact protein stability and function. Their effects are less predictable than frameshift indels and are often comparable to those of missense variants.6
Although in-frame indels are known to cause several well-documented monogenic diseases, such as cystic fibrosis, eye disorders such as childhood cataracts and retinal dystrophies, and several types of human cancers,5^,^6^,^7 their pathogenicity has historically been less studied than that of single-nucleotide variants (SNVs).1^,^8
As a result, the mechanism behind pathogenicity for in-frame indels is far from understood. Despite this, some specialized computational approaches are available to predict in-frame indel pathogenicity. The tools employ diverse strategies, often relying on the use of several manually encoded predictive features, such as sequence conservation, indel size, and protein structure and function, that have been seen to correlate with pathogenicity.8^,^9^,^10^,^11^,^12 However, selecting such features is based on an incomplete knowledge of the underlying mechanisms. It may not accurately capture the full complexity of indel pathogenicity or may potentially introduce biases that emphasize certain aspects while overlooking others. In addition, most available tools are often tailored to the human genome, as creating more broadly applicable tools is challenging due to limited available pathogenicity annotations for indels from other species.
Protein language models (PLMs) have the ability to meet these challenges. These models have emerged as a transformative approach in artificial intelligence, particularly for specialized fields like protein sequence analysis. PLMs are able to capture intricate relationships in proteins only by using protein sequences as input through training on massive unlabeled protein sequence datasets. The obtained embeddings contain rich biological information, which can be used as input for transfer learning approaches. PLMs offer a complementary technique to those based on genomic features and can be used to infer the potential effects of variants at the protein level.
These PLMs can capture complex information about protein secondary and tertiary structures across a wide range of organisms, from bacteria to humans, using the amino acid sequence as its only input feature, potentially providing insights beyond those available from traditional methods at a faster speed. Advances in computational power and the growth of available protein sequences drive the field of PLMs to create models with advanced transformer architectures containing billions of parameters. Today, a diverse range of PLMs, varying in size and architecture, is available, from general-purpose to specialized models.13^,^14^,^15^,^16^,^17^,^18
Transformer-based architectures and PLMs have shown great success and revolutionized the SNV prediction field,13^,^19^,^20 both when used as zero-shot inference methods (using the predicted logits with no additional training21) and when fine-tuned or used as features for other models. Zero-shot inference approaches are appealing since they do not risk overfitting to the data. However, when dealing with indels, using these techniques is not straightforward because of the difference in length between the sequences, which adds extra challenges.19^,^22 Recent efforts have focused on developing methods leveraging PLMs, either as zero-shot inference tools using a generalized scoring function described by Brandes et al.19 or as features to train other downstream models.9^,^23 However, these approaches still have limits in their accuracy or possible use cases. For example, SHINE, an elastic net model trained on a principal-component decomposition of PLM embeddings, is limited to small in-frame indels of one or two amino acids and requires two independent models, one for deletions and one for insertions.23
Despite these efforts, it remains unclear which PLMs provide the most accurate latent representations of the protein effects caused by indels, how best to transfer those representations, which scoring functions perform best for zero-shot indel prediction, and what the limitations of these methods are. Furthermore, the interpretability of predictions from PLMs can be challenging, as the patterns they capture are often complex and not immediately intuitive.24^,^25
In this work, we present an in-depth analysis of the inner workings of general scoring functions for zero-shot pathogenicity classification for in-frame indels. We suggest an improved scoring regime that achieved an improvement of 0.07 points in the Matthews correlation coefficient (MCC) over a general scoring approach,19 reaching an MCC comparable to the best-performing methods for indel prediction without any further training. From there, we test different transfer learning approaches on a simple one-hidden-layer Siamese network and manage to obtain unprecedented performance through a biology-guided slicing of the PLM embeddings. Finally, we present an analysis approach that aids in understanding the model predictions by looking at the local effects of the indel on the rest of the protein amino acids.
Results
Efficient scoring approach for zero-shot inference using protein language models
Our dataset has 7,500 indels, of which 2,409 are insertions and 5,091 are deletions. Of these, 2,878 are classified as likely pathogenic or pathogenic, while 4,622 are classified as likely benign or benign. We used the dataset to assess different scoring methods for the zero-shot inference using PLMs. The PLM of choice to test the scoring methods was ESM2 (650M parameters),15 which has shown the best performance compared with the other tested PLMs (Figure S3). Using PLMs as zero-shot predictors is not straightforward due to the varying lengths of both the sequences and the indel sizes.22 Brandes et al.19 described a generalizable approach to score sequences using the pseudo-log likelihood (PLL) of each sequence calculated as the sum of the log probabilities of each observed amino acid (see methods for more details). The described general approach, referred to as Brandes scoring, achieved a reasonable performance of MCC 0.58 (Figures 1A and 1B).Figure 1. Comparison of zero-shot scoring methods for ESM2Evaluation of different zero-shot scoring methods using the probabilities of ESM2 (650M parameters). The different scoring methods were IndeLLM (probabilities of overlapping regions), IndeLLM abs (using the absolute probability of overlapping regions), IndeLLM masked (logits from masked positions), Brandes (sum of all probabilities), and Brandes abs (using the absolute sum of probabilities). We compared the scoring methods by (A) plotting the ROC curve and calculating the AUC and (B) calculating the MCC scores. See Table S1 for all scores.
One concern of such a scoring method is that the PLLs being compared are not comparable due to the difference in length in the wild-type and mutated sequences. To mitigate this, we removed the probabilities of the deleted or inserted amino acids, calculating a PLL for the overlapping regions. Doing this eliminates the issue of incomparable likelihoods by indirectly inferring the indel’s effect. Another concern is that of fluctuations in probabilities within less-conserved regions. Here, the PLM assigns low probabilities to both the wild-type and the mutant sequences, which can disproportionately influence the final score when using methods such as log-probability summation or direct probability multiplication (Figure S2). To mitigate this effect, we propose using the sum of probabilities instead. This simplified scoring approach, which we denote as IndeLLM, produced an improvement in the predictions, ranking first among the tested scoring methods (Figures 1A and 1B; Table S1).
Another factor to consider is the impact of the indels, which significantly increase the likelihood of a protein being potentially pathogenic. To account for that, the absolute value of the difference between PLLs or scores is used instead. However, this scoring method, denoted as IndeLLM abs or Brandes abs, showed no improvement compared with using just the plain subtraction of PLLs or scores (Figures 1A and 1B).
Finally, we also compared two different ways to calculate the marginal probabilities, one using a single encoding step and then using that to compute the logits at each position (IndeLLM). The other uses a masking objective by masking each position individually and then using the logits of the masked position only (IndeLLM masked). Both approaches showed marginal differences in performance (Figures 1A and 1B; MCC IndeLLM 0.65 vs. IndeLLM masked 0.62), with the single encoding approach requiring significantly less computation time.
Leveraging transfer learning and Siamese networks for indel prediction
We leveraged PLMs in combination with a Siamese network to predict the pathogenicity of indels. Since PLMs are pre-trained models, we can extract high-quality embeddings that encode the structural and functional properties of sequences without requiring extensive task-specific training. These embeddings were fed into a smaller, task-specific architecture, a Siamese network, to efficiently compare and analyze sequence pairs.
Due to the small size of the labeled indel dataset (n = 7,500), we used a small multilayer perceptron (MLP) model with only one hidden layer. This mitigated overfitting by limiting the model’s complexity and, thus, its capacity to memorize the training data by picking up patterns in the noise that are not generalizable to the populations. In particular, using small Siamese networks for pathogenicity prediction has shown high performance in single-point mutation pathogenicity prediction.20
We used 80% (n = 5,960) of the labeled indels for the training dataset, while the remaining indels were split into a validation and a test dataset (n = 819 and n = 721, respectively). A robust model should be able to predict the pathogenicity of unseen protein sequences and indels. Therefore, it is essential to avoid redundancy between training, test, and validation datasets (data leakage). We clustered all wild-type protein sequences by sequence identity above 50% and randomly divided the clusters into the three datasets, achieving a maximum of 50% identity between all three datasets. We also ensured that the distributions of indel type (insertions or deletions), indel size (short [1 or 2 amino acids] or long [≥3 amino acids]), and indel classification (pathogenic and benign) were equal to the distributions in the complete dataset (see methods for details).
We used a setup similar to the one suggested by VariPred,20 namely, a Siamese network with one hidden layer, using a LeakyReLu as an activation function and a dropout rate of 0.5. Other hyperparameters, such as the number of neurons in the hidden layer, batch size, learning rate, and negative slope parameter for the LeakyReLu, were optimized through a simple grid search (Figure 2A).Figure 2. Siamese network architecture and performance(A) Siamese network setup. From the wild-type and mutated sequences, we generated four types of weights: the mean embeddings for the wild-type and mutated sequences (the full sequences), the IndeLLM score (using only overlapping regions, colored green), the indel type and length (based on the gap position and size, colored red), and the mean embeddings of the inserted or deleted amino acids (indel sequence, colored orange). The setup includes only one hidden layer of eight nodes.(B) Table of the weight used in models 1–4 marked with black dots and the same coloring as in (A).(C) The MCC (light blue), F1 (blue), and AUC (dark blue) scores for the mean of models 1–4 and IndeLLM Zero-shot using the test dataset.
The difference in length between the wild-type and the mutated sequences adds an extra layer of complexity when deciding which part of the embedding to use as input. Other approaches using PLM embeddings as input features have solved this by separating insertions from deletions.23 Here, we tested four different Siamese architectures to find which approaches provide the best transfer of the relevant latent space (Figures 2A and 2B).
The four models are variations of a Siamese network designed for sequence analysis. Model 1 utilizes mean embeddings from the last hidden layer for both wild-type and mutant sequences. Model 2 builds on this by incorporating the IndeLLM score as an additional input. Model 3 further extends this by including the indel type (insertion or deletion) and length as extra inputs. Model 4 splits the embeddings into overlapping and non-overlapping (indel) regions while also integrating indel type and length as inputs.
All replicates for model 1 using a more classical representation of the wild-type and mutant sequences by extracting the mean embeddings performed equal to the IndeLLM score in terms of area under the curve (AUC) of the receiver operating characteristic (ROC) curve but generally underperformed in terms of MCC and F1 score (performance on the test dataset was AUC, 0.86 vs. 0.86; MCC, 0.59 vs. 0.62; and F1, 0.78 vs. 0.83, respectively), indicating that the mean embeddings did not contain unique information when compared with the proposed IndeLLM score. Adding the IndeLLM score, model 2, did not improve the performance (MCC, 0.59). Model 3 was the only model from the group with an MCC comparable to the IndeLLM score (0.66 vs. 0.62, respectively), suggesting that providing the indel length and type contributed information on top of the plain mean embeddings. Model 4 performed significantly better than the IndeLLM score and any of the other three models, achieving MCC performance on the test set of 0.77, a 24% improvement from the IndeLLM score. This suggests that the information captured by the embeddings of the inserted or deleted amino acids adds additional information to the sequence embeddings. A more detailed breakdown of the performances of each model on each of the available datasets can be found in Table S2.
Comparing indel predictor performance with the IndeLLM scoring approach and Siamese model
We used the scores of nine different tools published by Cannon et al.9 and evaluated their performance and indications of overfitting (Figures S4 and S5). We concluded that the best-performing indel pathogenicity prediction tools were Provean and MutPred-indel, which were used to obtain pathogenicity prediction for all 7,500 indels. We compared performance using the IndeLLM Zero-shot score to the Provean and MutPred-indel on all indels (n = 7,500), the training (n = 5,960), validation (n = 819), and test (n = 721) datasets. We observed consistent pathogenicity prediction performance for all methods (Figure 3). MutPred-indel and IndeLLM have comparable performance (MCC MutPred-indel 0.57 vs. MCC IndeLLM 0.62). IndeLLM Siamese outperformed all methods and has 5.5% better prediction accuracy than the best-performing prediction tool, Provean (MCC IndeLLM Siamese 0.77 vs. MCC Provean 0.73) (Figure 3F). See Table S3 for all scores.Figure 3. Performance comparison of IndeLLM and pathogenicity prediction tools across datasetsEvaluation of the performance of the best-performing pathogenicity prediction tools, (A) Provean and (B) MutPred-indel, compared with (C) IndeLLM Zero-shot and (D) IndeLLM Siamese on all collected indels (blue) and the different dataset splits (green) used for training (train.), validation (val.), and testing (test). In (E), we compare their performance on the test dataset. In (A)–(E), we plotted the ROC curve and calculated the AUC, while in (F), we show the MCC scores for each method per dataset grouping and sorted by the MCC on the test data split.
The Siamese network outperformed the zero-shot inference approach by 24%. To assess the potential for overfitting, we evaluated the model on a holdout dataset derived from a recent iteration of ProteinGym, excluding any variants previously included in the curated training set. IndeLLM Siamese maintained comparable performance on this independent set, achieving an AUC of 0.90 and an MCC of 0.79 (Figures S5 and S6; Tables S4 and S5).
Examining the confusion matrix (Figure 4) reveals a high reduction in false negatives for insertions (19.51% for IndeLLM Zero-shot and 6.10% for IndeLLM Siamese) (Figures 4B and 4D). For the IndeLLM score predictions, we observe that the false-negative predictions for insertions have a high mean probability for the inserted amino acid, similar to the real benign insertions (Figure S7). Additionally, these false-negative insertions tend to be longer than benign insertions. However, this effect cannot be observed using the IndeLLM Siamese model. This suggests that some insertions may not introduce damaging amino acids but may instead cause pathogenicity through increased stability or altered fit, which the IndeLLM Zero-shot scoring approach fails to detect. By incorporating sequence embeddings for the inserted or deleted amino acids in Siamese model 4, we were able to account for these instances and accurately predict pathogenicity.Figure 4. Confusion matrices for IndeLLM Zero-shot and Siamese models(A) Confusion matrix for all deletions using IndeLLM Zero-shot scoring.(B) Confusion matrix for all insertions using IndeLLM Zero-shot scoring.(C) Confusion matrix for all deletions using IndeLLM Siamese network.(D) Confusion matrix for all insertions using IndeLLM Siamese network.All confusion matrices are colored by the percentage of true positives for benign (left) and pathogenic (right) insertions or deletions.
Toward interpretable indel predictions
Interpretability is essential for the usability of machine models. Understanding how these models predict changes in protein fitness allows users to validate results, ensuring that the predictions align with biological principles and are not artifacts or model biases.
The zero-shot scoring method by PLMs per amino acid allows us to obtain information on the consequences of the indels for the protein environment. The model assigns a value to each amino acid in the sequence based on its predicted likelihood given the rest of the sequence, which indicates the evolutionary patterns. By subtracting the per-amino-acid wild-type probabilities from the per-amino-acid mutated probabilities (except the deleted or inserted amino acids) for the entire protein sequence, we obtain a difference in value per amino acid. The differences range from 1 to −1, where 1 can be interpreted as the indel introducing beneficial consequences for the amino acids while −1 is detrimental. By plotting the differences in amino acid values, we can visualize the impact of the indel on the protein environment and, therefore, gain a better insight into why the indel is predicted as pathogenic or benign.
Our dataset reports four indels for a single protein, fibroblast growth factor receptor 1 (FGFR1). FGFR1 is a cell-surface membrane receptor built of three extracellular immunoglobulin (Ig)-like domains (D1–D3), a short hydrophobic transmembrane region, and a cytoplasmic tyrosine kinase domain.26 Of the four indels, one is in the tyrosine kinase domain, and three are in the extracellular Ig-like domains (Table 1). The differences in amino acid probabilities illustrate which regions the indels impact (Figures 5A–5D). The two pathogenic deletions have a negative effect on the surrounding regions, while the two benign indels do not affect the protein. By coloring the 3D model by the difference, we provide an explanation for the prediction. Here, we show that deleting a methionine in position 535 is predicted to destabilize the αC helix (Figure 5E), a structural element essential for the activation of the enzyme.26 We also found a five amino acid deletion in the first extracellular Ig-like domain (D1) in FGFR1. The long deletion occurs in a loop region, and IndeLLM predicts a destabilization and rearrangement of the domain, since we observe several interactions predicted to be altered (lost interactions in red and new interactions in blue, Figure 5F). We tested our hypothesis by predicting the structure of the mutated domain using AlphaFold,27 which also predicts an unfolding of the domain (Figure S8).Table 1. Example of indels reported for the FGFR1 protein (UniProt: P11362)ProteinIndel typeaa posFGFR1 domainWT sequenceMut sequenceIndel sizeClinVarIndeLLMFGFR1deletion535tyrosine kinase domainMKK1pathogenic/likely pathogenicpathogenicFGFR1insertion133disorderedDDD1likely benignbenignFGFR1deletion132disorderedDDD1likely benignbenignFGFR1deletion92Ig-like D1VPADSGG5likely pathogenicpathogenicFigure 5Mapping sequence probability differences onto protein domains and structures(A–D) The differences between the wild-type and the mutated sequence probabilities (y axis) are plotted per amino acid in FGFR1 (x axis). The colored boxes on the plots represented the domains in FGFR1 (D1–D3 in green, transmembrane region in blue, and the tyrosine kinase domain in red). In each plot, the indel position is labeled with an arrow and the indel type is noted in the title.(E) PDB structure of the tyrosine kinase domain (PDB: 4UWY) colored by the difference score (from red [≤ −0.2] to blue [≥0.2]). The indels are labeled and colored yellow. The impacted structural element (αC helix) is labeled, as well as the ATP binding pocket (dotted circle).(F) PDB structure of the Ig-like domain 1 (PDB: 2CR3) using the same coloring scheme as in (E). The N and C termini are labeled.(G and H) The plots are the same as in (A)–(D) but for the GLMN protein sequence. The colored boxes represented the N-terminal (yellow) and C-terminal (pink) domains.(I) PDB structure of the GLMN protein (PDB: 4F52) is colored as in (E). The N- and C-terminal domains are labeled in colored boxes above the structure.
Next we considered the effects of indels in glomulin (GLMN), which is essential for normal vasculature development and is part of a large protein complex (GLMN-RBX1-CUL1 complex).28 GLMN (HEAT-like) adopts a repeated fold of α helices organized in the N- and C-terminal domains. Our dataset reports two indels in the GLMN protein (Table 2). The pathogenic deletion is in the C-terminal domain, a domain that is essential for interactions with subunits in the complex.28 The difference in probabilities between the wild-type and the mutated sequences demonstrates that the deletion of asparagine in position 393 destabilizes large regions of the protein, particularly in the C-terminal domain (Figures 5G and 5I). In contrast, the benign insertion downstream of the N-terminal domain has minimal implications for the protein (Figure 5H).Table 2. Example of indels reported for the GLMN protein (UniProt: Q92990)ProteinIndel typeaa posGLMN domainWT sequenceMut sequenceIndel sizeDDDIndeLLMGLMNdeletion393C-terminal domainNKK1pathogenicpathogenicGLMNinsertion278N-terminal domain–E1likely benignbenign
Discussion
We have presented here a framework for the prediction of indel scores in the context of zero-shot inference. By scoring only amino acids that are identical in the wild-type and mutant sequences, previous limitations caused by indel-induced length differences were overcome by our method. Specifically, without length correction, pathogenicity scores were heightened for deletions and decreased for insertions. This difference originates from the extra amino acids in either the mutant or the wild-type sequence, making scores not directly comparable.
To address these issues, a zero-shot scoring approach that resolves the bias introduced by non-comparable PLLs has been introduced here. Our method utilizes the context awareness of PLMs, which capture sequence dependencies and evolutionary relationships. Rather than including the inserted or deleted amino acids in the PLL calculation, we compute the scores using only the amino acids present in both wild-type and mutant sequences. By using only overlapping regions, one removes length biases, decreases noise, and ensures a balanced indel pathogenicity scoring. Given the limited availability of labeled data for indels, this represents a promising approach, as it allows for accurate pathogenicity assessment without overfitting to known cases.
Moreover, we also propose a simplified zero-shot scoring approach utilizing the sum of probabilities instead of the sum of log probabilities or multiplication of individual probabilities. This modification addresses a key issue: when relying on log sums or probability products, small changes in positions with already low likelihoods, typically found in poorly conserved regions,15 can disproportionately influence the overall score. This happens because a small absolute change in a low probability corresponds to a large fold change, resulting in an exaggerated impact on the final score. In contrast, larger absolute changes at well-conserved positions, which have higher baseline probabilities, may result in smaller relative fold changes and thus contribute less to the score under those methods, despite being potentially more biologically meaningful. By summing raw probabilities, our method reduces this imbalance, ensuring that variations in high-confidence regions are not overshadowed by noise in low-confidence ones. This scoring scheme, which we term IndeLLM Zero-shot, offers a more balanced evaluation of mutation effects. IndeLLM Zero-shot scoring substantially improved the predictions, boosting PLMs from 0.58 to 0.65 (MCC score), levels comparable to those of methods trained with labeled data and manually curated features. Furthermore, we anticipate that as PLMs continue to evolve, the performance gap between zero-shot and supervised approaches will continue to narrow. In addition, our scoring strategy offers a context-aware method for evaluating indels with potential applications that extends beyond variant effect prediction in areas such as multiple sequence alignment.
One appealing use of PLMs is their ability to generate latent representations as input features for downstream models. Various strategies have been proposed to integrate PLM embeddings across applications, including the prediction of indels.19^,^23 Ideally, such models should maintain simplicity to minimize the risk of overfitting when working with limited datasets. In this study, we combined the approach described in IndeLLM Zero-shot with a lightweight Siamese network consisting of a single fully connected hidden layer with eight nodes. Despite its simplicity, this approach achieved state-of-the-art performance with no evidence of overfitting (IndeLLM Siamese vs. Provean AUC: 0.94 vs. 0.91). To better understand the contributions of specific features, we conducted an ablation study, revealing that splitting the embedding of indel amino acids from the rest of the sequence embedding significantly improved performance. This highlights the importance of integrating our biological knowledge when designing the model and during feature engineering. By leveraging this understanding, we can isolate the most informative parts of the embeddings, preventing signal dilution and reducing noise while preserving rich sequence representations. This key modification enhanced model performance without adding complexity. The proposed framework, termed IndeLLM Siamese, outperformed previous approaches in classification accuracy (MCC 0.77) and demonstrated greater computational efficiency compared with previously suggested methods (Note S8). These findings indicate that the IndeLLM Siamese model can serve as a valuable tool for the community, both as a high-performing predictor and as an example of effective feature selection in transfer learning. We suggest that incorporating more biological knowledge during feature selection could further enhance variant effect prediction in a more controlled and precise manner.
A closer examination revealed that the improvements in the IndeLLM Siamese, compared with the IndeLLM Zero-shot scoring, come from the reduction in the insertions that were wrongly predicted as benign (false negatives). The confusion matrix for IndeLLM reveals a high false-negative rate for insertions (19.51%), suggesting a systematic error in the predictions for insertions. The majority of the false-negative predictions for insertions involved amino acids with higher estimated probabilities than the average observed for pathogenic insertions (Figure S7). These findings suggest that such insertions could be causing a pathogenicity effect due to increased stability or altered fit or, perhaps, gain of function.19 IndeLLM Zero-shot scores assess the potential damage caused by the indel on the protein sequence and use that as a proxy for pathogenicity. This can explain why gain-of-function cases are not captured properly. We mitigated this by adding information about the inserted or deleted amino acids in IndeLLM Siamese model 4 (Figure 2). The negative rate for insertions dropped drastically to 6.10%. By comparing the mean inserted probabilities of the false-negative predicted insertions from IndeLLM Zero-shot and IndeLLM Siamese model 4, we observed a drop from 110 cases to 3 cases (Figure S7). This suggests that the IndeLLM Siamese model can better capture gain-of-function cases by leveraging embeddings of inserted amino acids. However, further improvements are needed to refine this capability. Mapping embeddings to identify key regions of interest presents a promising direction for future advancements.
Finally, we present an approach to interpreting the predictions by examining individual amino acid probability changes. The obtained differences can then be mapped back to the structure to spot regions predicted to be affected by the indels. This allows experts to use this information in terms of structure and function.
The Siamese network and zero-shot scoring approaches introduced in IndeLLM provide a low-complexity, biologically informed framework that can be seamlessly integrated into standard genomic analysis pipelines. These methods are particularly useful for evaluating pre-selected variants of uncertain significance (VUSs), offering insights into their potential functional impact at the protein level. Because both approaches rely solely on protein sequence information, they serve as a valuable complementing tool to existing techniques that leverage genomic or population-level data. Depending on the analysis stage and the specific type of variants under investigation, threshold values can be adjusted to better balance false-positive and false-negative rates, allowing the approach to be tailored to different research or clinical priorities.
While the developed IndeLLM Siamese network achieved the highest performance, it requires labeled training data, limiting its applicability primarily to human sequences. However, the underlying design principles can be extended to future models. In contrast, the IndeLLM Zero-shot approach offers broader applicability and ease of use, albeit with a trade-off in predictive accuracy.
In particular, the IndeLLM Zero-shot method operates at the per-position level, enabling preliminary identification of potential mechanisms of disruption, such as altered motifs or conserved regions, without requiring labeled training data. This makes it especially useful in scenarios where experimental validation data are limited. An example of the utility of this approach in a real-world setting was the study of variants in a honeybee protein.29 It is important to note, however, that certain limitations persist when applying the IndeLLM Zero-shot scoring methodology. While the proposed approach addresses key challenges, such as eliminating the confounding effects of sequence length differences and mitigating the overemphasis on low-likelihood regions by aggregating probabilities, it does not explicitly incorporate the identities of the inserted or deleted amino acids. As a result, it may misclassify variants in which the specific biochemical properties of the indel residues are the primary drivers of pathogenicity rather than disruptions to local sequence motifs or structural context.
Additionally, the IndeLLM Zero-shot scoring method may underperform on large indels, where substantial portions of the sequence (such as whole domains) are added or removed. In such cases, the model may underestimate the functional impact due to the absence of explicit sequence content, leading to reduced sensitivity in detecting highly disruptive events (see Figure 4). These limitations highlight potential areas for future improvement, such as integrating indel residue identity or adopting hybrid models that combine sequence-based scoring with contextual structural information.
Several avenues remain open for enhancing the model predictive power and interpretability by incorporating additional sources of structural and functional biological information, for example, Pfam domains, CATH functional families, conservation scores, or data on disrupted motifs. Furthermore, the question of how to best leverage more advanced transformer architectures remains an active area of investigation, particularly those that integrate additional data modalities such as structural or functional annotations like ESM3,16 ProSST,30 or ProsT5.17 While current results from the tested models indicate that these multimodal transformer models seem to underperform compared with pure sequence-based PLMs in terms of sequence likelihoods (Table S2), it is important to note that additional data channels other than sequence (for example, structure or functional annotations) were not utilized in those comparisons. As such, significant untapped potential remains for adapting these models to the specific challenges of indel interpretation.
We believe this information can aid in developing the indel predictor field. In line with that goal, much of our attention has been placed on making the method easy to use and accessible. Source code, digital notebooks, and datasets combining data from the most recent published datasets9^,^19^,^23 can be found in the resource availability section.
Methods
Dataset generation (including handling of long sequences)
We retrieved datasets of indels in the human genome from three recent studies on indel predictions.9^,^19^,^23 The first study, Cannon et al.,9 evaluated the performance of nine in silico pathogenicity prediction tools for small in-frame indels. Their published dataset contains 3,964 human indels (genome build GRCh38) collected from gnomAD, ClinVar, and the Deciphering Developmental Disorders (DDD) study,31^,^32^,^33 with prediction classifications from nine tools: CADD, CAPICE, FATHMM-indel, MutPred-indel, MutationTaster2021, PROVEAN, SIFT-indel, VEST-indel, and VVP.8^,^10^,^11^,^34^,^35^,^36^,^37^,^38^,^39 The Fan et al.23 study presented a PLM-based (ESM1b) pathogenicity predictor for short in-frame indels called SHINE. We retrieved the 1–2 amino acid indel training and validation dataset for SHINE. The Fan et al. dataset contains 5,457 indels from ClinVar and gnomAD. The Brandes et al. study included a dataset that was used to benchmark the performance of a PLM (ESM1b) on in-frame indel predictions, which is part of a genome-wide disease variant effect prediction study.19 The Brandes et al. dataset contained 3,470 ClinVar indels.
The Cannon et al. dataset did not contain the variant impacts on protein level, so we submitted the reported chromosome position and nucleotide variation to Ensembl Variant Effect Predictor (VEP40) to obtain the reference and alternative amino acids for most indels (some transcription IDs were no longer available). Then, we used the available transcription IDs (cross-referenced with the output from VEP) to obtain the full-length peptide sequence from Ensembl BioMart.41 We quality checked that the dataset contained only in-frame indels, leading to a single protein impact per indel, excluding frameshift variants. The Fan et al. dataset contained protein IDs, which we submitted to BioMart to obtain the full-length peptide sequences (some protein IDs were no longer available). The Brandes et al. dataset included the cropped wild-type and mutated protein sequences that we could use directly in our study.
We combined all the unique indels from the three studies creating our dataset (n = 7,500). The dataset includes 7,500 indels, of which 2,409 are insertions and 5,091 are deletions. Of these, 2,878 are classified as likely pathogenic or pathogenic, while 4,622 are classified as likely benign or benign. The indels range from 1 to 223 amino acids inserted or deleted (length) and represent a distribution of short indels (1–2 amino acids, n = 3,882) and longer indels (≥3 amino acids, n = 3,618) (Figure S1). We used this dataset for all analyses, except the specific performance analysis comparing our scoring approach to the nine indel prediction tools, in which we used the Cannon et al. subset of the dataset. Cannon et al. included the reported scores from the nine prediction tools. The Cannon et al. subset has 3,478 indels from gnomAD (n = 647), ClinVar (n = 2577), and DDD (n = 254), of which 1,115 are insertions and 2,363 are deletions. Of these, 1,518 are classified as likely pathogenic or pathogenic, while 1,960 are classified as likely benign or benign. Based on this specific performance analysis, we found that Provean37 and MutPred-indel11 showed no signs of overfitting and were among the few methods that have not been discontinued. We collected prediction scores for all 7,500 indels using Provean and MutPred-indel, which are included in our dataset.
Due to the size limitations of some PLMs, it is not possible to fit sequences with more amino acids than the allowed number of tokens. Models that allow for sequences of any length, such as ESM2, which uses RoPe positional encoding, quickly run into memory limitations due to the memory required to grow quadratically with protein length. Because of this, sequences exceeding 1,000 amino acids were truncated around the indel location, leaving a buffer region of 500 amino acids down and upstream of the indel. Since the attention mechanism grows quadratically with sequence length, most available PLMs have limits on the distance between tokens for which attention is computed, usually capped at 500 amino acids.15 By cropping the sequences and leaving 500 amino acids around the indel location, we guarantee that the maximum amount of the learned self-attention can be used for the effect inference. The resulting cropped sequences have a limit size of 1,000 amino acids plus the length of the indel. In cases where the indel length exceeds the model’s size limit, the number of buffer amino acids around the indel is symmetrically reduced to a length that fits the model (1,022 amino acids). All sequences, crops, and annotations used in this study are provided in the resource availability section.
Indel scoring
Brandes et al. described a generalized approach for scoring in-frame indels as a zero-shot inference task.19 In summary, the described approach calculates a PLL of each sequence s as
where L is the sequence length, si is the amino acid at position i, and log(P(xi = si|s)) is the log likelihood predicted by ESM1b for observing the input amino acid si at position i given the entire input sequence s. In this context, the output of the PLM is treated as a sequence of random variables x = x1, …, xl, where each xᵢ represents the probability distribution for observing 1 of the 20 standard amino acids at position i. The effect score of an in-frame indel is calculated as the difference in PLLs between the mutated (s_mut_) and the wild-type (s_WT_) sequences: PLL(s_m_ᵤ_t_) − PLL(s_WT_). Since the scoring above calculates the PLL of the whole sequence, which was first presented in the Brandes et al. study,19 we refer to this scoring as “Brandes.” One limitation of such a scoring method is that it relies on subtracting incomparable values due to the length difference of the sequences, especially when dealing with longer indels. The study showed attempts to limit that effect using a scaling factor dependent on the length difference but showed no improvements compared with the original Brandes score.19 In this work, we present an alternative scoring approach to obtain more comparable PLLs by calculating the PLL of only the overlapping amino acids. Since PLMs are context-aware language models, the indels’ effect is observable on the marginal probabilities of the rest of the amino acids in the sequence, allowing for the subtraction of comparable PLLs. The scoring can be computed as before, and the PLL of the overlapping regions (PLL_overlap_) can be calculated as
where soverlap is the amino acids present in both the wild-type and the mutated sequences. Building on this rationale, we propose a simplified scoring approach obtained by performing the sum of the probabilities over each of the overlapping positions, decreasing in this way the impact of fluctuations in poorly conserved regions. We refer to this scoring as IndeLLM. The optimal pathogenicity threshold for the entire dataset using the IndeLLM scoring approach is −0.59. This threshold was determined by maximizing the difference between the true- and the false-positive rates (TPRs and FPRs) on the ROC curves.
Finally, we also compare the predictions’ performance when using a masking objective for the assessed position or when using inference without masked tokens.
Siamese network architecture
For the Siamese network, we split the dataset into training, validation, and test datasets containing 80%, 10%, and 10% of the indels, respectively. We used CD-HIT42 to cluster wild-type sequences with a maximum 50% sequence identity (2,902 clusters), and the sequences in each cluster were always kept in the same dataset so that our training, validation, and test datasets contained diverse sequences. The clusters were randomly distributed into the three datasets while ensuring close to equal fractions of indel types, lengths, and pathogenicity classification as in the entire dataset. We considered indels of 1–2 amino acid short, while the remaining were long. See Table 3 for the distributions per dataset.Table 3. Distribution of indels for the full dataset and splitsDatasetTotal indelsShort deletionsLong deletionsShort insertionsLong insertionsBenignPathogenicFull7,5003,3741,7171,5348754,6222,878Training5,9602,6901,3571,2256883,7742,186Validation81936719416197453366Test72131716614890395326
To enhance indel classification and explore transfer learning potential using PLMs, we implemented a shallow feedforward neural network (FNN) inspired by the architecture described in VariPred for SNVs.20 The FNN was trained solely on the class labels, keeping the PLM parameters frozen. A simple grid search was employed for hyperparameter optimization, with the final parameters for each model detailed in Figure 2. To mitigate overfitting, the FNN was designed with a single fully connected hidden layer. Each model was trained five times with different initial weights to evaluate reliability.
The input features extracted for the models were as follows: mean embeddings of the last hidden layer for the wild-type sequence, mean embeddings of the last hidden layer for the mutant sequence, mean embeddings of the last hidden layer for the deleted or inserted amino acids, the IndeLLM score, the length of the indel, and the type of indel (0 for deletion, 1 for insertion). For models utilizing indel embeddings, the embeddings corresponding to the indel region were excluded from the longest sequence’s embeddings to ensure that only corresponding regions of the protein were compared.
Four final model architectures were developed, each leveraging a different combination of these features.
- (1)Model 1: a basic Siamese network that used the mean embeddings from the last hidden layer of the PLM for the whole wild-type and mutant sequences.
- (2)Model 2: built on model 1 by adding the IndeLLM score as an additional input parameter.
- (3)Model 3: extended model 2 by incorporating the indel type (insertion or deletion) and indel length as additional parameters.
- (4)Model 4: this model took a more advanced approach by splitting embeddings into overlapping and non-overlapping regions. The inputs included the mean embeddings from the overlapping regions of the wild-type and mutant sequences, the mean embeddings of the indel (non-overlapping region), the indel type, and its length. This biologically informed architecture aimed to capture sequence structure better. Each model was evaluated to determine the contribution of its specific features to the indel classification task. The sequence embeddings were extracted using ESM2 with 650M parameters. A final output layer with two nodes and a SoftMax activation function was used to classify variants as pathogenic or benign. Only the pathogenic node’s output was considered for classification. The AUC on the validation dataset was used as the early stopping criterion during training. The best-performing pathogenicity threshold for all the data is 0.46.
Performance analysis
To compare the performance of the pathogenicity prediction tools and PLMs, we computed the AUC of the ROC, the F1 score, and the MCC score and confusion matrices using Scikit-learn.43 The ROC curve provided the FPRs and TPRs at different threshold values. When comparing any approach in our study, we used the calculated FPR and TPR to find the optimal threshold, rather than the default threshold provided by each pathogenicity prediction tool, to make a fair comparison. The FPR and TPR were also used to calculate the ROC curve’s AUC. Our calculations are provided in our analysis Jupyter Notebooks.
Interpretability analysis
Since PLMs are context aware, one can compare the changes of the predicted amino acid probabilities at each position due to the surrounding context by subtracting the individual probabilities from each other for the amino acids that are overlapping:
where (si|smut)is the first amino acid in the mutated sequence smut, and (si|swt) is the same amino acid but in the wild-type sequence swt. This results in a different score for each position in the sequence.
These localized changes in probabilities can then be mapped back to the 3D structure of the protein of interest. This allows one to observe the regions where the model predicts significant changes due to the indel and understand which positions drive the pathogenicity prediction results.
To illustrate this approach, we identified genes in our dataset with multiple indels reported, both pathogenic and benign. We selected two proteins, FGFR1 and GLMN, which had their 3D protein structure resolved at high resolution. We show two domains of FGFR1, the apo structure of the intracellular tyrosine kinase domain (PDB: 4UWY,44 Figure 5E) and the first of the three extracellular Ig-like domains (PDB: 2CR3,45 Figure 5F), and the complete GLMN chain extracted from the GLMN-RBX1-CUL1 complex (PDB: 4F52,28 Figure 5I). The probability changes (difference scores) were obtained by aligning the wild-type and mutated sequences using the Bio.AlignIO package from Biopython46 and color was rendered on the structure using PyMOL.
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Oriol Gracia Carmona ([email protected]).
Materials availability
This study did not generate new unique reagents or materials.
Data and code availability
- •All the data used on this project can be found on GitHub (https://github.com/OriolGraCar/IndeLLM) and HuggingFace.47
- •The source code for IndeLLM, all analyses, and examples are available in our GitHub repository (https://github.com/OriolGraCar/IndeLLM). The original code has also been deposited at Zenodo48 and is publicly available) as of the date of publication. We also released a plug-and-play Google Colab version of the code,48 which is also available at https://colab.research.google.com/github/OriolGraCar/IndeLLM/blob/main/IndeLLM.ipynb.
Acknowledgments
The authors acknowledge support from the 10.13039/501100000268Biotechnology and Biological Sciences Research Council (BB/T002212/1 to F.F. as principal investigator). F.F. and O.G.C. are thankful for the BHF grant (RG/F/22/110079). V.L. and G.V.A. acknowledge 10.13039/501100005416The Research Council of Norway grant nos. 335244 and 350231 for funding toward running costs, travel grants, and conference support.
Author contributions
O.G.C. and V.L. contributed to the conceptualization, visualization, methodology, software development, validation, formal analysis, data curation, writing of the original draft, and review and editing of the manuscript. F.F., C.O., and G.V.A. contributed to the writing of the original draft as well as to the review and editing of the manuscript.
Declaration of interests
The authors declare no competing interests.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the authors used generative-AI-powered tools in order to fix grammar and spelling mistakes. The authors have reviewed and edited the content as needed and take full responsibility for the content of the publication.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Mullaney J.M.Mills R.E.Pittard W.S.Devine S.E.Small insertions and deletions (INDE Ls) in human genomes Hum. Mol. Genet.192010 R 131R 13610.1093/hmg/ddq 40020858594 PMC 2953750 · doi ↗ · pubmed ↗
- 2Mills R.E.Luttig C.T.Larkins C.E.Beauchamp A.Tsui C.Pittard W.S.Devine S.E.An initial map of insertion and deletion (INDEL) variation in the human genome Genome Res.1620061182119010.1101/gr.456580616902084 PMC 1557762 · doi ↗ · pubmed ↗
- 3Kim R.Guo J.t.Systematic analysis of short internal indels and their impact on protein folding BMC Struct. Biol.1020102410.1186/1472-6807-10-24PMC 292434320684774 · doi ↗ · pubmed ↗
- 4Lin M.Whitmire S.Chen J.Farrel A.Shi X.Guo J.T.Effects of short indels on protein structure and function in human genomes Sci. Rep.72017931310.1038/s 41598-017-09287-x PMC 557095628839204 · doi ↗ · pubmed ↗
- 5Farinha C.M.Callebaut I.Molecular mechanisms of cystic fibrosis – how mutations lead to misfunction and guide therapy Biosci. Rep.422022 BSR 2021200610.1042/BSR 20212006 PMC 925158535707985 · doi ↗ · pubmed ↗
- 6Sergouniotis P.I.Barton S.J.Waller S.Perveen R.Ellingford J.M.Campbell C.Hall G.Gillespie R.L.Bhaskar S.S.Ramsden S.C.The role of small in-frame insertions/deletions in inherited eye disorders and how structural modelling can help estimate their pathogenicity Orphanet J. Rare Dis.11201612510.1186/s 13023-016-0505-0PMC 502446327628848 · doi ↗ · pubmed ↗
- 7Ye K.Wang J.Jayasinghe R.Lameijer E.-W.Mc Michael J.F.Ning J.Mc Lellan M.D.Xie M.Cao S.Yellapantula V.Systematic discovery of complex insertions and deletions in human cancers Nat. Med.2220169710410.1038/nm.400226657142 PMC 5003782 · doi ↗ · pubmed ↗
- 8Douville C.Masica D.L.Stenson P.D.Cooper D.N.Gygax D.M.Kim R.Ryan M.Karchin R.Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel)Hum. Mutat.372016283510.1002/humu.2291126442818 PMC 5057310 · doi ↗ · pubmed ↗
