# Sequence-structure based prediction of pathogenicity for amino acid substitutions in proteins associated with primary immunodeficiencies

**Authors:** Ekaterina S. Porfireva, Anton D. Zadorozhny, Anastasia V. Rudik, Dmitry A. Filimonov, Alexey A. Lagunin

PMC · DOI: 10.3389/fimmu.2025.1492751 · Frontiers in Immunology · 2025-02-05

## TL;DR

This paper introduces a new method to predict if amino acid changes in immune-related proteins cause disease, using structural and sequence data to improve diagnosis of rare immune disorders.

## Contribution

The novel contribution is the development of SSPR models using sequence-structure-property relationships for predicting pathogenicity in PID-related proteins.

## Key findings

- SSPR models achieved an average ROC AUC of 0.831 in predicting pathogenic amino acid substitutions.
- The models outperformed existing bioinformatics tools like SIFT4G and Polyphen2 in accuracy and reliability.
- A web application, SAV-Pred, was developed to make these predictions accessible to medical professionals.

## Abstract

Primary immunodeficiencies (PIDs) are a group of rare genetic disorders characterized by dysfunction of the immune system components. Early diagnosis and treatment are essential to prevent severe or life-threatening complications. PIDs are manifested by diverse clinical symptoms, posing challenges for accurate diagnosis. A key aspect of PID diagnosis is identifying specific amino acid substitutions in the proteins related with heritable diseases. In this study, we have developed classification sequence-structure-property relationships (SSPR) models for predicting the pathogenicity of amino acid substitutions (AAS) in 25 proteins associated with the most important and genetically studied PIDs and encoded genes: IL2RG, JAK3, RAG1, RAG2, ADA, DCLRE1C, CD40LG, WAS, ATM, STAT3, KMT2D, BTK, FOXP3, AIRE, FAS, ELANE, ITGB2, CYBB, G6PD, GATA2, STAT1, IFIH1, NLRP3, MEFV, and SERPING1.

The data on 4825 pathogenic and benign AASs in the selected proteins were extracted from ClinVar and gnomAD. SSPR models were created for each protein using the MultiPASS software based on the Bayesian algorithm and different levels of MNA (Multilevel Neighborhoods of Atoms) descriptors for the representation of structural formulas of protein fragments including AAS.

The accuracy of prediction was assessed through a 5-fold cross-validation and compared to other bioinformatics tools, such as SIFT4G, Polyphen2 HDIV, FATHMM, MetaSVM, PROVEAN, ClinPred, and Alpha Missense. The best SSPR models demonstrated high accuracy, with an average ROC AUC of 0.831 ± 0.037, a Balanced accuracy of (0.763 ± 0.034), MCC (0.457 ± 0.06), and F-measure (0.623 ± 0.07) across all genes, outperforming the most popular bioinformatics tools.

The best created SSPR models for the prediction of pathogenicity of amino acid substitutions related with PIDs have been implemented in a freely available web application SAV-Pred (Single Amino acid Variants Predictor, http://www.way2drug.com/SAV-Pred/), which may be a useful tool for medical geneticists and clinicians. The use of SAV-Pred for some clinical cases of PIDs are provided.

## Linked entities

- **Genes:** IL2RG (interleukin 2 receptor subunit gamma) [NCBI Gene 3561], JAK3 (Janus kinase 3) [NCBI Gene 3718], RAG1 (recombination activating 1) [NCBI Gene 5896], RAG2 (recombination activating 2) [NCBI Gene 5897], ADA (adenosine deaminase) [NCBI Gene 100], DCLRE1C (DNA cross-link repair 1C) [NCBI Gene 64421], CD40LG (CD40 ligand) [NCBI Gene 959], WAS (WASP actin nucleation promoting factor) [NCBI Gene 7454], ATM (ATM serine/threonine kinase) [NCBI Gene 472], STAT3 (signal transducer and activator of transcription 3) [NCBI Gene 6774], KMT2D (lysine methyltransferase 2D) [NCBI Gene 8085], BTK (Bruton tyrosine kinase) [NCBI Gene 695], FOXP3 (forkhead box P3) [NCBI Gene 50943], AIRE (autoimmune regulator) [NCBI Gene 326], FAS (Fas cell surface death receptor) [NCBI Gene 355], ELANE (elastase, neutrophil expressed) [NCBI Gene 1991], ITGB2 (integrin subunit beta 2) [NCBI Gene 3689], CYBB (cytochrome b-245 beta chain) [NCBI Gene 1536], G6PD (glucose-6-phosphate dehydrogenase) [NCBI Gene 2539], GATA2 (GATA binding protein 2) [NCBI Gene 2624], STAT1 (signal transducer and activator of transcription 1) [NCBI Gene 6772], IFIH1 (interferon induced with helicase C domain 1) [NCBI Gene 64135], NLRP3 (NLR family pyrin domain containing 3) [NCBI Gene 114548], MEFV (MEFV innate immunity regulator, pyrin) [NCBI Gene 4210], SERPING1 (serpin family G member 1) [NCBI Gene 710]

## Full-text entities

- **Genes:** DCLRE1C (DNA cross-link repair 1C) [NCBI Gene 64421] {aka A-SCID, DCLREC1C, RS-SCID, SCIDA, SNM1C}, RAG2 (recombination activating 2) [NCBI Gene 5897] {aka RAG-2}, JAK3 (Janus kinase 3) [NCBI Gene 3718] {aka JAK-3, JAK3_HUMAN, JAKL, L-JAK, LJAK}, NLRP3 (NLR family pyrin domain containing 3) [NCBI Gene 114548] {aka AGTAVPRL, AII, AVP, C1orf7, CIAS1, CLR1.1}, G6PD (glucose-6-phosphate dehydrogenase) [NCBI Gene 2539] {aka CNSHA1, G6PD1}, ELANE (elastase, neutrophil expressed) [NCBI Gene 1991] {aka ELA2, GE, HLE, HNE, NE, PMN-E}, ADA (adenosine deaminase) [NCBI Gene 100] {aka ADA1}, AIRE (autoimmune regulator) [NCBI Gene 326] {aka AIRE1, APECED, APS1, APSI, PGA1}, ATM (ATM serine/threonine kinase) [NCBI Gene 472] {aka AT1, ATA, ATC, ATD, ATDC, ATE}, IFIH1 (interferon induced with helicase C domain 1) [NCBI Gene 64135] {aka AGS7, Hlcd, IDDM19, IMD95, MDA-5, MDA5}, MEFV (MEFV innate immunity regulator, pyrin) [NCBI Gene 4210] {aka FMF, MEF, PAAND, TRIM20}, CD40LG (CD40 ligand) [NCBI Gene 959] {aka CD154, CD40L, HIGM1, IGM, IMD3, T-BAM}, FOXP3 (forkhead box P3) [NCBI Gene 50943] {aka AIID, DIETER, IPEX, JM2, PIDX, XPID}, FAS (Fas cell surface death receptor) [NCBI Gene 355] {aka ALPS1A, APO-1, APT1, CD95, FAS1, FASTM}, IL2RG (interleukin 2 receptor subunit gamma) [NCBI Gene 3561] {aka CD132, CIDX, IL-2RG, IMD4, P64, SCIDX}, BTK (Bruton tyrosine kinase) [NCBI Gene 695] {aka AGMX1, AT, ATK, BPK, IGHD3, IMD1}, SERPING1 (serpin family G member 1) [NCBI Gene 710] {aka C1IN, C1INH, C1NH, HAE1, HAE2}, STAT3 (signal transducer and activator of transcription 3) [NCBI Gene 6774] {aka ADMIO, ADMIO1, APRF, HIES}, GATA2 (GATA binding protein 2) [NCBI Gene 2624] {aka DCML, IMD21, MONOMAC, NFE1B}, CYBB (cytochrome b-245 beta chain) [NCBI Gene 1536] {aka AMCBX2, CGD, CGDX, GP91-1, GP91-PHOX, GP91PHOX}, KMT2D (lysine methyltransferase 2D) [NCBI Gene 8085] {aka AAD10, ALR, BCAHH, CAGL114, KABUK1, KMS}, WAS (WASP actin nucleation promoting factor) [NCBI Gene 7454] {aka IMD2, SCNX, THC, THC1, WASP, WASPA}, ITGB2 (integrin subunit beta 2) [NCBI Gene 3689] {aka CD18, LAD, LCAMB, LFA-1, MAC-1, MF17}, STAT1 (signal transducer and activator of transcription 1) [NCBI Gene 6772] {aka CANDF7, IMD31A, IMD31B, IMD31C, ISGF-3, STAT91}, RAG1 (recombination activating 1) [NCBI Gene 5896] {aka RAG-1, RNF74}
- **Diseases:** dysfunction (MESH:D006331), genetic disorders (MESH:D030342), PIDs (MESH:D000081207)
- **Chemicals:** SAV-Pred (-)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11835853/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11835853/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC11835853/full.md

---
Source: https://tomesphere.com/paper/PMC11835853