# Leveraging protein language models and a scoring function for indel characterization and transfer learning

**Authors:** Oriol Gracia Carmona, Vilde Leipart, Gro V. Amdam, Christine Orengo, Franca Fraternali

PMC · DOI: 10.1016/j.patter.2025.101425 · Patterns · 2025-11-26

## TL;DR

The paper introduces IndeLLM, a new method using protein language models to better assess the impact of insertions and deletions in proteins, improving accuracy and interpretability.

## Contribution

IndeLLM introduces a novel scoring framework and Siamese network for indel pathogenicity prediction with improved accuracy and interpretability.

## Key findings

- IndeLLM achieves performance comparable to existing predictors with minimal computing resources.
- A Siamese network built using transfer learning outperforms tested indel predictors with a Matthews correlation coefficient of 0.77.
- The framework enables mapping of indel effects to specific protein regions, enhancing interpretability.

## Abstract

Protein language models (PLMs) are increasingly used to assess the impact of genetic variants, achieving high accuracy and often outperforming traditional pathogenicity predictors. They enable zero-shot inference, making predictions without task-specific fine-tuning, though studying in-frame insertions and deletions (indels) remains challenging due to altered protein lengths and limited annotated datasets. Here, we present IndeLLM, a scoring approach for indel pathogenicity that accounts for sequence length differences. Our zero-shot method relies solely on sequence information, requires minimal computing resources, and achieves performance comparable to existing predictors. Building on this, we developed a Siamese network via transfer learning that outperformed all tested indel predictors (Matthews correlation coefficient = 0.77). To enhance accessibility, we provide a plug-and-play Google Colab notebook for using IndeLLM and visualizing the impact of indels on protein sequence and structure. The tool is freely available on GitHub and Google Colab.

•Improved scoring framework for indel effects using protein language models•Increased interpretability by mapping indel impact to specific protein regions•Improved prediction accuracy using a compact Siamese network•Practical guidelines to enhance transfer learning for indel-focused studies

Improved scoring framework for indel effects using protein language models

Increased interpretability by mapping indel impact to specific protein regions

Improved prediction accuracy using a compact Siamese network

Practical guidelines to enhance transfer learning for indel-focused studies

Studying the effects of insertions and deletions (indels) remains a significant challenge, particularly since they alter sequence length. This issue complicates the study of indels when using high-performing protein language models (PLMs). Existing pathogenicity prediction tools are based on limited knowledge, are primarily focused on human proteins, and typically lack interpretability. We introduce an improved scoring framework that applies PLMs to studying indels. Our approach takes advantage of these models’ ability to generalize across different organisms. It also enables researchers to directly map indel effects to specific protein regions, offering improved interpretability and structural insight. We also outline a set of practical guidelines for transfer learning with PLMs to enhance the efficiency and effectiveness of indel-related studies. We design a compact Siamese network, a smaller task-specific architecture combined with the framework, that outperforms current state-of-the-art indel pathogenicity predictors. This approach contributes to more robust and interpretable protein modeling, with potential implications for indel annotation, comparative genomics, and disease-related variant analysis.

Insertions and deletions (indels) are difficult to study because they change protein sequence length, limiting existing tools. Here, IndeLLM is presented: a framework that leverages protein language models to assess indel effects in a more interpretable and generalizable way. The method maps indel impact to protein regions, improves predictive accuracy with a Siamese network, and provides guidelines for transfer learning. IndeLLM offers new opportunities for indel annotation and disease-related variant analysis.

## Full-text entities

- **Genes:** GLMN (glomulin, FKBP associated protein) [NCBI Gene 11146] {aka FAP, FAP48, FAP68, FKBPAP, GLML, GVM}, FGFR1 (fibroblast growth factor receptor 1) [NCBI Gene 2260] {aka BFGFR, CD331, CEK, ECCL, FGFBR, FGFR-1}, RBX1 (ring-box 1) [NCBI Gene 9978] {aka BA554C12.1, RNF75, ROC1}, CUL1 (cullin 1) [NCBI Gene 8454], TXK (TXK tyrosine kinase) [NCBI Gene 7294] {aka BTKL, PSCTK5, PTK4, RLK, TKL}
- **Diseases:** cystic fibrosis (MESH:D003550), retinal dystrophies (MESH:D058499), cancers (MESH:D009369), eye disorders (MESH:D005128), DDD (MESH:D002658), cataracts (MESH:D002386), PLMs (MESH:D007806)
- **Chemicals:** amino acid (MESH:D000596), acid (MESH:D000143), amino (-), ATP (MESH:D000255)
- **Species:** Homo sapiens (human, species) [taxon 9606], Apis mellifera (bee, species) [taxon 7460]
- **Mutations:** deletion of asparagine in position 393, methionine in position 535, AUC of 0

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12921505/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12921505/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC12921505/full.md

---
Source: https://tomesphere.com/paper/PMC12921505